Here are some rules of thumb to follow when building web crawlers that can scale.
- De-couple the web crawling and the web scraping process. This is because you can then measure and speed up the performance of each of these processes separately.
- Do not use Regex for scraping. Use XPath or CSS selectors instead. Regex will be the first to break when the target web page's HTML changes even a little bit.
- Assume every external dependant process or method will fail and write handlers and loggers for each. For example, assume the URL fetch will fail, timeout, redirect, return empty, or show a CAPTCHA. Anticipate and log each of these exceptions.
- Make it easy to debug your app by tracing all the steps your crawler goes through. Make the logger as rich as possible. Send yourself alerts so you immediately know if there is something wrong.
- Learn how to get your crawlers to pretend to be human.
- Build a crawler around a framework. Custom code will have a bunch more points of failure.