Today we are going to see how we can scrape MercadoLibre product information using Python and BeautifulSoup is a simple and elegant manner.
This article aims to get you started on a reThis article aims while keeping it super sim real-world get familiar and get practical results as fast as possible.
So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.
Then you can install beautiful soup with
pip3 install beautifulsoup4
We will also need the library's requests, lxml, and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using.
pip3 install requests soupsieve lxml
Once installed, open an editor and type in.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
Now let's go to the MercadoLibre search page and inspect the data we can get.
This is how it looks.
Back to our code now. Let's try and get this data by pretending we are a browser like this.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url='https://listado.mercadolibre.com.mx/phone#D[A:phone]'
response=requests.get(url,headers=headers)
print(response)
Save this as scrapeMercado.py.
If you run it.
python3 scrapeMercado.py
You will see the whole HTML page.
Now, let's use CSS selectors to get to the data we want... To do that, let's go back to Chrome and open the inspect tool. We now need to get to all the articles. We notice that the
If you notice that the article title is contained in an element inside the results-item class, we can get to it like this.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.11 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Accept-Encoding': 'identity'
}
#'Accept-Encoding': 'identity'
url = 'https://listado.mercadolibre.com.mx/phone#D[A:phone]'
response=requests.get(url,headers=headers)
#print(response.content)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.results-item'):
try:
print('---------------------------')
print(item.select('h2')[0].get_text())
except Exception as e:
#raise e
print('')
This selects all the pb-layout-item article blocks and runs through them, looking for the element and printing its text.
So when you run it, you get
Bingo!! We got the product titles.
Now with the same process, we get the class names of all the other data like product image, the link, and price.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.11 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Accept-Encoding': 'identity'
}
#'Accept-Encoding': 'identity'
url = 'https://listado.mercadolibre.com.mx/phone#D[A:phone]'
response=requests.get(url,headers=headers)
#print(response.content)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.results-item'):
try:
print('---------------------------')
print(item.select('h2')[0].get_text())
print(item.select('h2 a')[0]['href'])
print(item.select('.price__container .item__price')[0].get_text())
print(item.select('.image-content a img')[0]['data-src'])
except Exception as e:
#raise e
print('')
That, when run, should print everything we need from each product like this.
That was fun
If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked quickly by MercadoLibre. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.
If you want to scale the crawling speed and dont want to set up your infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.