If you are looking for cloud-based web scrapers or crawlers, you might have heard of both Scrapy and TeraCrawler.io.
While many products in this domain might look more or less the same, but the ideal use case might vary tremendously. In actual reality, its a bit ridiculous to compare the 2.
Here are some of the differences between the 2.
The main use case
TeraCrawler is ideal when you just need to download a large amount of data regularly and be able to set rules like crawling depth, URL patterns to download or block, and the types of files/documents you need to download. You want all this, and you want to do it on the cloud and not invest in a massive infrastructure in house.
Generally, what you will get as an output is a big fat file that you can download and process locally at your convenience.
Scrapy is a Python-based web crawling/scraping framework. It's when you have development resources to start everything from scratch and code your crawler or scraper.
Who is it for?
TeraCrawler is ideal for developers who would like to have control over how they scrape the data but dont want to deal with the setting up and monitoring of resources and pitfalls of the crawling of the data.
Scrapy is ideally developers and teams that want full control over everything. It's also useful for small projects which can be quickly set up on a PC and data extracted.
Company focus
TeraCrawler is in the business of crawling data efficiently, quickly, predictably, and at scale for developers who deal with the problem of large scale web crawling. Its a fully automated SAAS offering for both mid-market to enterprise-grade customers.
Scrapy focus is on the developer community. The software is open source and is in use in all sorts of projects worldwide already.
Features comparison
Scrapy is like a powerful engine that you can build a car around. It can do almost anything if you know how to use it, have the time and developer resources to build something awesome.
Teracrawler is a commercial application focussed on the speed of crawling and can do almost any kind and scale of crawling. Additionally, it uses proxy IP rotation behind the scenes using the Proxies API infrastructure to get data that is difficult to access. It also is unique in the ability to render javascript based content so you can crawl AJAX-based websites. The reporting of the progress is slightly better, allows for three download formats. With data extraction, TeraCrawler has inbuilt support for boilerplate removal, article type text extraction, and even summary extraction.