Fix Your Web Scrapers with This 15 Point Checklist

May 27th, 2020

Web scrapers are known to die on us. It's because so much is dependant on things on the internet we cant control. We at Proxies API always say if you want to understand the internet build a web crawler.web scraping tools, web scraping, web scraping api, best web scraping tools, web scraping tools open source

Here are a bunch of oh-so-very-common issues and their fixes

Issue	Fix
The scraper keeps breaking AFTER it has worked	Use XPATH or CSS selectors to scrape. DONT use RegEX
Its too slow	Use Asynchronous URL fetches to get data. But be mindful of the server's ability to handle it
I keep getting blocked	Spoof and rotate your User-Agent string header
I keep getting CAPTCHAs	It's a hard problem to solve on your own. My best advice is to use a Rotating Proxy Service like Proxies API, so you dont face this problem
My URL fetches keep breaking	Build-in auto retries into your code. URL fetches break all the time because it is the internet. Try again later. Build that into code
I dont know where the problem is. The crawler keeps hanging	Build a logger into all the places that are externally dependant. Never assume the external dependencies will behave themselves.
My code keeps having issues	Writing custom code is always going to be prone to issues. Use a framework like Scrapy for web crawling
I laptop doesn't scale	Of course, it doesn't. Put it on an Amazon AWS instance or use a cloud-based crawler like crawltohell to get your data

Here are a bunch of oh-so-very-common issues and their fixes

Get our articles in your inbox