May 27th, 2020
Fix Your Web Scrapers with This 15 Point Checklist

Web scrapers are known to die on us. It's because so much is dependant on things on the internet we cant control. We at Proxies API always say if you want to understand the internet build a web crawler.web scraping tools, web scraping, web scraping api, best web scraping tools, web scraping tools open source

Here are a bunch of oh-so-very-common issues and their fixes

IssueFix
The scraper keeps breaking AFTER it has workedUse XPATH or CSS selectors to scrape. DONT use RegEX
Its too slowUse Asynchronous URL fetches to get data. But be mindful of the server's ability to handle it
I keep getting blockedSpoof and rotate your User-Agent string header
I keep getting CAPTCHAsIt's a hard problem to solve on your own. My best advice is to use a Rotating Proxy Service like Proxies API, so you dont face this problem
My URL fetches keep breakingBuild-in auto retries into your code. URL fetches break all the time because it is the internet. Try again later. Build that into code
I dont know where the problem is. The crawler keeps hangingBuild a logger into all the places that are externally dependant. Never assume the external dependencies will behave themselves.
My code keeps having issuesWriting custom code is always going to be prone to issues. Use a framework like Scrapy for web crawling
I laptop doesn't scaleOf course, it doesn't. Put it on an Amazon AWS instance or use a cloud-based crawler like crawltohell to get your data

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon