The internet is ripe with articles with working code pieces like "Web scraping Amazon reviews with Python and Beautiful Soup."
While this is great for learning, it is horrible as a starting point. Try out these pieces of code, play with them, tweak them and then, please just abandon them.
Don't use them as a starting point in any of your serious projects.
You practice on a lonely mud road when you are learning to drive because it is easy and safe as there is no traffic. But you dont set off on that road to reach San Fransisco, which is 700 miles away! You need the highway for that.
As far as web crawling projects are concerned, the highway is a Web scraping framework.
Any framework will do. But just dont go at it on your own. If you do, you will only go so far before running out of steam and near-unsolvable complexities.
There is one for every language, Scrapy, Nutch, Goutte, etc.
Start with one of them. I see many web crawling projects that end up using our service; teracrawler.io makes the same mistake. Their projects hit the place of never-ending complexity because they started off wrong.
Not using a framework means you will need to worry about a lot of extraneous things not central to your business logic like.
- How do you handle Robots.txt? Now you are building a Robots.txt parser.
- You will need to make it asynchronous and all the complexities that come with that.
- You will need to be able to throttle your connection speeds for different levels of responsiveness of web sites.
- How do you handle images? Documents large unruly files? How do you ignore small images like icons?
- How do you build in rules that allow/disallow downloads of certain file types?
- How do you build in rules that limit external link following, or following links that we dont want?
- How do we handle user-agent strings, other headers, cookies, and even sessions if needed?
- How do we handle IP blocks? Do we use a Proxy Rotation API like Proxies API? How do we handle Ajax content that we need to scrape?
I can go on. These are some of the wheels you will end up reinventing.
A framework typically handles all this stuff out of the box. It is mostly open-source with a community ready to help through the difficult areas.
Deciding to use one is one of the wisest moves that will skip a world of trouble in the future for you. If you can't be bothered even with a framework, you can even consider using a cloud-based crawler like terascraper.io, which does all this and more out of the box without needing to build any overhead on your part.