May 27th, 2020
Do you make these 9 mistakes while web crawling?

Here is a bunch of things that can get you in trouble while web crawling:

  1. Not respecting Robots.txt.
  1. Not using Asynchronous connections to speed up crawling.
  1. Not use CSS selectors or XPath to reliably scrape data.
  1. Not use a user-agent string.
  1. Not rotate user-agent strings.
  1. Not add a random delay between requests to the same domain.
  1. Not use a framework.
  1. Not monitor the progress of your crawlers.
  1. Not use a rotating proxy service like Proxies API.

Being smart about web crawling is realizing that it's not about the code. In our experience at Teracrawler developing cloud-based web crawlers at scale, most of the web crawling and web scraping is about controlling these variables. Having a systematic approach to web crawling and getting to a place where you can get frequent and reliable data and scale day in and day out can change the fortunes of your company.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon