Here is a bunch of things that can get you in trouble while web crawling:
- Not respecting Robots.txt.
- Not using Asynchronous connections to speed up crawling.
- Not use CSS selectors or XPath to reliably scrape data.
- Not use a user-agent string.
- Not rotate user-agent strings.
- Not add a random delay between requests to the same domain.
- Not use a framework.
- Not monitor the progress of your crawlers.
- Not use a rotating proxy service like Proxies API.
Being smart about web crawling is realizing that it's not about the code. In our experience at Teracrawler developing cloud-based web crawlers at scale, most of the web crawling and web scraping is about controlling these variables. Having a systematic approach to web crawling and getting to a place where you can get frequent and reliable data and scale day in and day out can change the fortunes of your company.