If it helps to think of a web crawler as a system than a piece of code.
This shift is very important and will be forced on any developer whoever attempts web scraping at scale.
It is one of the best ways to learn thinking in systems.
We can see the whole crawling process as a workflow with multiple possible points of failure. in fact, any place where the scraper is dependant on external resources is a place it could and will fail. So 90% of the time spent by the developer is in fixing in bit and pieces these inevitable issues.
At Proxies API, we have gone through the drudgery of not thinking in a systematic way about web scraping till we one day took a step back and identified the central problem. The code was never the problem. The whole thing didn't work as a system. We finally decided on our own set of rules to make crawlers that work systematically.
Here are the rules that the system has to obey:
- Handle fetching issues (timeouts, redirects, headers, browser spoofing, CAPTCHAs and IP blocks)
- Where the crawler doesn't have a solution to each of the issues(for example, CAPTCHAs), it should at least handle and log them.
- The system should be able to "step over" any issue and not stumble and fail to bring everything down with it.
- The system should immediately alert the developer about an issue.
- The system should help the developer diagnose the last issue quickly with as much context as possible, so it is easily re-producable
- The system should be as generic as possible at the code level and should push individual website logic to an external database as much as possible.
- The system should have enough levers to control the speed and scale of the crawl.