Or issuing Captchas
Then developers Google something to the effect of "How not to get blocked by web sites when crawling," and they get myriad blog posts (including mine earlier) about all the tricks they can use to stop this from happening.
They learn about
- Spoofing being a browser by using User-Agent strings
- Rotating User-Agent strings
- Rate limiting
- Automatic CAPTCHA solvers
- Pretending to be human
- Sending cookies back
- Sending more headers to pretend to be more of a web browser
- Get plastic surgery done etc
They go ahead and implement it. It will probably take a month to do this properly and correctly, especially if you are a novice in web scraping.
Then everything goes swimmingly well for a few days or weeks, and as it invariably happens, you get IP blocked!!
Now let's look at all the earlier solutions implemented.
Spoofing being a browser by using User-Agent strings - REDUNDANT!
Rotating User-Agent strings - REDUNDANT!
Rate limiting - REDUNDANT!
Automatic CAPTCHA solvers - REDUNDANT!
Pretending to be human - REDUNDANT!
Sending cookies back - REDUNDANT!
Sending more headers to pretend to be more of a web browser - REDUNDANT!
Get plastic surgery done etc. - MAY JUST WORK!
I'm just going to come out and say it. none of these matters and there is the only solution
JUST INVEST IN A PROFESSIONAL ROTATING PROXY SERVICE.
It solves all these headaches for just $29 per month. Instead of spending a month digging through stackoverflow answers and duct-taping a more and more shaky piece of web crawling software, you end up at square one anyway.
So I am done pretending and being polite about my PAID offering. I have done web crawling for more than ten years (It was the first commercial project I did it college), and my team has, between us, 30 years of experience in this field, and in my expert opinion Sir/Madam, I advise you to please just buy the damn thing.
Buy any of them. There are quite a few. They are ok. We have one, and that does a great job of rotating proxy servers as a simple API you can plug and play in 2 mins into your code, and it works just fine too.
If you further want to save time and not bother even create your own web crawling setup, you can check out our cloud-based web crawler teracrawler.io
TeraCrawler handles all the things above uses distributed servers with a large IP range along with millions of residential proxy IPs to make sure certain that you get your data.