Web scraping projects are known to fail a lot. So we thought it is more appropriate to have a list of DONTs rather than a list of DOs. So here goes.
- If the crawler depends on any external data or event to happen in a particular way, Don't assume it will happen like that. It won't MORE often than it does. For example: when fetching a URL, it could break because of timeouts, redirects, CAPTCHA challenges, IP blocks, etc.
- DONT build custom code. Use a framework like scrapy.
- DONT be too aggressive on a website. Check the response time of the website first. In fact, at crawltohell.com, our crawlers adjust their concurrency depending on the response time of the domain, so we dont burden their servers too much.
- DONT write linear code. Dont write code that crawls, scrapes data, processes it, and stores it all in one linear process. If one breaks, so will the others, and also, you would be able to measure and optimize the performance of each process. Batch them instead.
- DONT depends on your IPs. They will eventually get blocked. Always build in the ability to proxy your requests through a Rotating Proxy Service like Proxies API.