May 26th, 2020
Scraping Wayfair Products with Python and Beautiful Soup

Today we are going to see how we can scrape Wayfair products using Python and BeautifulSoup is a simple and elegant manner.

This article aims to get you started on a real-world problem solving while keeping it super simple, so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with

pip3 install beautifulsoup4

We will also need the library's requests, lxml, and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using.

pip3 install requests soupsieve lxml

Once installed, open an editor and type in.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

Now let's go to the Wayfair products listing page and inspect the data we can get.

This is how it looks.

Back to our code now. Let's try and get this data by pretending we are a browser like this.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.wayfair.com/rugs/sb0/area-rugs-c215386.html'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')

Save this as scrapeWayfair.py.

If you run it.

python3 scrapeWayfair.py

You will see the whole HTML page.

Now, let's use CSS selectors to get to the data we want... To do that, let's go back to Chrome and open the inspect tool.

We notice that all the individual product data are contained in a with the class 'ProductCard-container.' We can extract this with the CSS selector '.ProductCard-container' easily. So here is how the code looks then.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.wayfair.com/rugs/sb0/area-rugs-c215386.html'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')


for item in soup.select('.ProductCard-container'):
	try:
		print('----------------------------------------')
		print(item)



	except Exception as e:
		#raise e
		print('')

This will print all the content in each of the elements that hold the product data.

We can now pick out classes inside these rows that contain the data we want. We notice that the title is inside a

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.wayfair.com/rugs/sb0/area-rugs-c215386.html'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')


for item in soup.select('.ProductCard-container'):
	try:
		print('----------------------------------------')
		#print(item)

		print(item.select('.ProductCard-name')[0].get_text().strip())
		print(item.select('.ProductCard-price--listPrice')[0].get_text().strip())
		print(item.select('.ProductCard-price')[0].get_text().strip())
		print(item.select('.pl-ReviewStars-reviews')[0].get_text().strip())
		print(item.select('.pl-VisuallyHidden')[2].get_text().strip())
		print(item.select('.pl-FluidImage-image')[0]['src'])


	except Exception as e:
		#raise e
		print('')

If you run it, it will print out all the details.

Bingo!! We got them all.

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by Wayfair. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and dont want to set up your infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon