May 25th, 2020
Web Crawling an Entire Blog

One of the most common applications of web crawling, according to the patterns we see with many of our customers at crawltohell is scraping blog posts. Today lets look at how we can build a simple scraper to pull out and save blog posts from a blog like CopyBlogger.

Here is how the CopyBlogger blog section looks.

You can see that there are about ten posts on this page. We will try and scrape them all.

First, we need to install scrapy if you haven't already.

pip install scrapy

Once installed, we will add a simple file with some barebones code like so

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib


class SimpleNextPage(CrawlSpider):
    name = 'SimpleNextPage'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]

		custom_settings = {

    'LOG_LEVEL': 'INFO',
 
    }

def parse(self, response):

Let's examine this code before we proceed.

The allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl for us; in this example, we only need one URL.

The LOG_LEVEL settings make the scrapy output less verbose, so it is not confusing.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

Now let's see what we can write in the parse function.

For this, let's find the CSS patterns that we can use as selectors for finding the blog posts on this page.

When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the article headlines are always inside an H2 tag with the CSS class entry-title. This is good enough for us. We can just select this using the CSS selector function like this.

titles = response.css('.entry-title').extract()

This will give us the Headline. We also need the href in the 'a' which has the class entry-title-link, so we need to extract this as well.

links = response.css('.entry-title  a::attr(href)').extract()

So lets put this all together.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule

from bs4 import BeautifulSoup
import urllib


class blogCrawlerSimple(CrawlSpider):
    name = 'blogCrawlerSimple'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]

    def parse(self, response):
        #yield response
        titles = response.css('.entry-title').extract()       
        links = response.css('.entry-title  a::attr(href)').extract()

        
        for item in zip(titles, links):
            all_items = {
                'title' : BeautifulSoup(item[0]).text,
                'link' : item[1],
            }
            #found the link now lets scrape it... 
            yield scrapy.Request(item[1], callback=self.parse_blog_post)


            yield all_items

    def parse_blog_post(self, response):
        print('Fetched blog post' response.url)

Let's save it as BlogCrawler.py and then run it with these parameters, which tells scrapy to disobey Robots.txt and also to simulate a web browser.

scrapy runspider BlogCrawler.py -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" -s ROBOTSTXT_OBEY=False

When you run, it should return.

Those are all the blog posts. Let's save them into files.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule

from bs4 import BeautifulSoup
import urllib


class blogCrawlerSimple(CrawlSpider):
    name = 'blogCrawlerSimple'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]

    def parse(self, response):
        #yield response
        titles = response.css('.entry-title').extract()       
        links = response.css('.entry-title  a::attr(href)').extract()


        #links = response.css('.css-8atqhb a::attr(href)').extract()
        
        for item in zip(titles, links):
            all_items = {
                'title' : BeautifulSoup(item[0]).text,
                'link' : item[1],
            }
            #found the link now lets scrape it... 
            yield scrapy.Request(item[1], callback=self.parse_blog_post)


            yield all_items

    def parse_blog_post(self, response):
        print('Fetched blog post' response.url)
        filename = 'storage/' response.url.split("/")[3]   '.html'
        print('Saved post as :' filename)
        with open(filename, 'wb') as f:
            f.write(response.body)
        return

When you run it now, it will save all the blog posts into a file folder.

But if you look at it, there are more than 320 odd pages like this on CopyBlogger. We need a way to paginate through to them and fetch them all.


When we inspect this in the Google Chrome inspect tool, we can see that the link is inside an LI element with the CSS class pagination-next. This is good enough for us. We can just select this using the CSS selector function like this.

nextpage = response.css('.pagination-next').extract()

This will give us the text 'Next Page,' though. What we need is the href in the 'a' tag inside the LI tag. So we modify it to this.

nextpage = response.css('.pagination-next a::attr(href)').extract()

The moment we have the URL, we can ask Scrapy to fetch the URL contents like this.

yield scrapy.Request(nextpage[0], callback=self.parse_next_page)

So the whole code looks like this.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule

from bs4 import BeautifulSoup
import urllib


class blogCrawler(CrawlSpider):
    name = 'blogCrawler'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]




    def parse(self, response):
        #yield response
        titles = response.css('.entry-title').extract()       
        links = response.css('.entry-title  a::attr(href)').extract()

        nextpage = response.css('.pagination-next a::attr(href)').extract()

        yield scrapy.Request(nextpage[0], callback=self.parse)


        #links = response.css('.css-8atqhb a::attr(href)').extract()
        
        for item in zip(titles, links):
            all_items = {
                'title' : BeautifulSoup(item[0]).text,
                'link' : item[1],
            }
            #found the link now lets scrape it... 
            yield scrapy.Request(item[1], callback=self.parse_blog_post)


            yield all_items

    def parse_blog_post(self, response):
        print('Fetched blog post' response.url)
        filename = 'storage/' response.url.split("/")[3]   '.html'
        print('Saved post as :' filename)
        with open(filename, 'wb') as f:
            f.write(response.body)
        return 


    def parse_next_page(self, response):
        print('Fetched next page' response.url)

And when you run it, it should download all the blog posts that were ever written on CopyBlogger onto your system.

Scaling Scrapy

The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds, you will find that sooner or later, your access will be restricted. Web servers can tell you are a bot, so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server, so it doesn't block you.

Like this.

-s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64)/
 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" /
-s ROBOTSTXT_OBEY=False

In more advanced implementations, you will need to even rotate this string, so Copyblogger can't tell it the same browser! Welcome to web scraping.

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by Copyblogger. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and dont want to set up your infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon