We are just writing a scraper for one of our internal projects at Proxies API that needs to scrape a bunch of websites in a distributed fashion quickly.
We are using Scrapy for that as the main framework.
We use beautiful soup as a part of that operation to clean up the HTML, get us keywords and meta tags so we can summarize the article using NLTK.
This should give enough perspective. Scrapy is a much larger system that helps you crawl, scrape, and manage data in various ways. Beautiful soup cannot crawl data. It can take your existing data and allow you to query it in various ways. For example, you can use CSS selectors to get at a particular piece of the HTML like a tag for article headlines. But then, Scrapy has inbuilt support for both CSS selectors and XPATH.
Selenium is used when you need to get to dynamically generated data (using AJAX).
Scrapy, by default, doesn't do that, but it can use various external tools like Selenium as middlewares to get dynamic data. Here is an example of you can “paginate” on eBay using Scrapy and Selenium together.
import scrapy
from selenium import webdriver
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')