Here are a bunch of aspects to consider when selecting a Web Crawling software.
Ease of Setting Up a Crawler
The software should reduce the technical complexity of setting up a crawl by defaulting to the right parameters and also automatically adjusting several on the fly. For example, the speed at which the crawler works should be appropriate to the ability of the target web server that handles it.
In the beginning, web crawling is all about using a chrome extension or a website ripper. Still, a serious crawler should be able to scale to millions of URLs easily using distributed systems.
While it's ok to use your local machine for some small-time crawling work, some desktop-based crawlers do this. These give you a sense of control but end up having no power behind them. This is because the power of desktop software is limited to your machine's RAM capacity and CPU speed and also the network speed. Plus, you have to keep the machine on, and if it is closed, your scheduled runs will fail. Making the downloaded data available to other people in the team becomes another task of archiving and uploading using Dropbox and other methods. In these cases, it is best to use a cloud-based crawler that de-couples heavy crawling operations from your life.
The crawler, while being intuitive, should be highly customizable as well. We should be able to set the crawl depth, outer limits on the number of pages, decide to crawl or not crawl certain types of pages and content, should be able to download images, pdfs, and other documents selectively.
It should play nice
The crawler should have inbuilt mechanisms to be as polite on the servers as possible. Should respect Robots.txt if you so choose and rate limit automatically based on the target servers ability to handle concurrent requests
All this power is useless if the crawler cannot bypass inevitable IP blocks. Having a built-in residential proxy rotation system is a must to be able to get data that we want on schedule every time reliably.
Get article data
The crawler should be able to work with the HTML and boilerplate and get to article type data where possible and should do this reliably and get the heart of the article text, keywords, and event summaries.
All these features I mentioned were kept in mind when we built our cloud-based web crawling software xxx. We are the same team that built the rotating proxies software Proxies API. In fact, xx uses the Proxies API infrastructure to bypass IP block issues in the crawler. Like with Proxies API, we offer 10000 free URL crawls in our free trial. Go sign up now to start your first crawl.