Web crawling operations are lying navigating a ship during a storm. Apart from your code, almost everything else is not under your control. Being well prepared is half the battle.
Here is a bunch of conditions you are likely to face, and you will need to be ready for before you set sail.
- The target web servers will go down
- The target website will timeout on your fetches
- You will find connections hung and taking too many resources because your crawler will encounter an unusually large file.
- The web server will block your crawler as it can't identify you as a browser.
- The website might IP ban you.
- The website might restrict access at the speeds you want, so some of your queries will fail. It might temporarily restrict all access as well.
- The website might throw a CAPTCHA challenge.
- The Robots.txt file will be different from what you expect.
- The website might change its patterns making your web scraping code like CSS selectors or XPaths redundant
- It might have links to external websites, so your crawler veers way off course
- The images and documents you want may be on a CDN, and your external domain restrictions might mean you won't crawl these.
- Your crawler will hand because of the load and unexpected behaviors.
- Your crawler will suffer memory overloads.
- You will have problems handling large amounts of data. For example, you might be storing all your files in a single folder and, after a few weeks, might have millions of them making managing them a nightmare.
- You run out of resources as you ask more from your crawler over time. These could be CPU, memory, network speeds and even storage space
- Some websites are just too large, and if you have no policy, you will be stuck in getting data that you dont want
- Your crawler keeps breaking, but you have no idea were amongst the thousands of links it is fetching as you have not built in a sufficiently informative logger
- Your crawler is getting gibberish or no data at all for weeks in certain parts, and you didn't even notice
There are more. But these are all starting points to think about and have handlers, loggers, and alerting mechanisms in place. You might have to use a rotating proxy service to overcome many of the IP blocks and other access-related problems above. We have developed a cloud-based crawling service keeping these problems in mind called TeraCrawler, which automatically handles all these issues behind the scenes and removes more or less 99% of all headaches connected with large scale web crawling. TeraCrawler also uses our rotating proxy service behind the scenes to crawl almost any kind of website without getting IP banned.