An Evolution of a Programmer New to Web Scraping

May 27th, 2020

Programmers new to web crawling have a typical progression of maturity that we wanted to document. We did this, our developer friends have done it, our new hires get bullied into not doing it.

Use requests and beautiful soup (learned from a tutorial on Medium or Quora) that shows you step by step how to "crawl the web".

Hit issues with link extraction, speed, obeying Robots.txt, and reconsider everything or try and write custom code for each.

Hit issues with downloading images, pdfs, and other media.

Realize that this is a lot of wheels that need to be reinvented. Like anything else complex, maybe there is a framework for this. Learn about frameworks.

Use a framework. Solves 80% of all problems. Life is good again.

Get restricted by many websites because you just unleashed Scrapy on them.

Learn clever ways to pretend to be human to overcome blocks.

Get completely blocked at the IP level. You now learn about the term IP Block, the cancer of web crawlers.

Try some free proxies (after much googling) write everything around an auto-scraping online free proxy database that you aggressively update every 5 minutes "just to be safe".

Realize they simply dont work.

Go away for a week.

Reluctantly look at solving IP blocks with commercial rotating proxy services like Proxies API and realize the value of not reinventing the wheel vs spending months fighting every [possible vagary of the internet. Or decide that you are done with coding this and that this is not your code business anyway and also realize that your web crawler code is probably bigger than your main business logic, decide to use a cloud-based web crawler like TeraCrawler.io

Get our articles in your inbox