Do you want to understand the Internet? Build a web crawler

May 27th, 2020

One of the web ways to understand how the web works are to try and crawl it.

It's no wonder that Google rules the webspace. Building 2 commercial products in this space, Proxies API - a Rotating Proxy Service and TeraCrawler - a high scale crawler in the cloud, taught us a bunch of things about the web that we had never known before from the programmers perspective.

We understood how the web is structured. We learned from the difficulties in parsing HTML about the evolution of the web. Just like a Geologist looks at rocks and can tell the different phases the land had gone through, the remnants of the early web, the evolution of the HTTP protocol, the evolution of HTML & CSS.

You will learn a lot about CSS and XPath and thinking about patterns. A lot of web scraping is finding patterns that work and work reliably over time. Thinking of various ways in which you can achieve it because every website is different feels like detective work many times.

You will learn about thinking at scale.

You are always projecting what will happen to your poor crawler, your scraper, your database, your network, and your servers when subjected to millions of URL requests and many of them running concurrently by multithreaded, aggressive bots from your clients. You will learn to measure, project, and test for scale very quickly so you can be absolutely sure that things dont break at scale.

You will learn about handling large amounts of data.

Every crawl job at Teracrawler produces gigabytes of data. Handling them on our servers requires special skills that we acquired over time. What to store in a database, what to store in flat files, handling media like pdfs, images, and handling expiry and archival. Allowing our clients to download these data as quickly as possible in a convenient fashion by itself is a big challenge.

You will learn about distributed computing.

You will learn about creating and managing asynchronous spiders, learn a ton about sockets, you will be forced into learning distributed computing and you will be man handled into learning about creating and managing behind the scenes jobs, job queues.

It's a cat and mouse between you and the web servers.

Much of web crawling is about pretending to be a human with a browser as a pet. You will learn about User-Agent strings, handling cookies, different types of headers, about redirects, honey pots, and all the other things that ail web servers. You will learn about how web servers and browsers talk to each other and how that has evolved over time. Things like AJAX, sessions become pretty obvious as a side effect of indulging in the relentless challenge of web crawling/scraping.

You will learn about stability inside chaos.

Web crawling is inherently chaotic because almost 90% of it is not in your control. You will learn how to deal with it and incrementally improve and do it systematically. While initially you might, as a rule, suffer a couple of panic attacks, you will soon recover and come back with a more zen-like attitude.

You will learn that it is not all about coding.

If we look back, the original code that does the business logic of crawling and scraping is just about 10% of our entire code base is TeraCrawler. Every time we start a web scraping project we get fooled into thinking about it.

Get our articles in your inbox