Web scrapers are known to die on us. It's because so much is dependant on things on the internet we cant control. We at Proxies API always say if you want to understand the internet build a web crawler.web scraping tools, web scraping, web scraping api, best web scraping tools, web scraping tools open source
Here are a bunch of oh-so-very-common issues and their fixes
Issue | Fix |
---|---|
The scraper keeps breaking AFTER it has worked | Use XPATH or CSS selectors to scrape. DONT use RegEX |
Its too slow | Use Asynchronous URL fetches to get data. But be mindful of the server's ability to handle it |
I keep getting blocked | Spoof and rotate your User-Agent string header |
I keep getting CAPTCHAs | It's a hard problem to solve on your own. My best advice is to use a Rotating Proxy Service like Proxies API, so you dont face this problem |
My URL fetches keep breaking | Build-in auto retries into your code. URL fetches break all the time because it is the internet. Try again later. Build that into code |
I dont know where the problem is. The crawler keeps hanging | Build a logger into all the places that are externally dependant. Never assume the external dependencies will behave themselves. |
My code keeps having issues | Writing custom code is always going to be prone to issues. Use a framework like Scrapy for web crawling |
I laptop doesn't scale | Of course, it doesn't. Put it on an Amazon AWS instance or use a cloud-based crawler like crawltohell to get your data |