We at TeraCrawler.io do a lot of web crawling and web scraping. We write a lot about it too here as well as on our blog. Every day is a new challenge. When a new developer joins our team, we throw a few challenges at them to test their ability to think through a quagmire that is web scraping.
We thought we will expose one such challenge here so you can have fun with it. This will help you rack your brains, force you to research, and think outside the box and in the end you will find you now know 5 new things about scraping than you knew before. Please post your answers as comments to see all the various ways you can approach this problem. Dont worry, we will post our answer in a week's time as a follow-up post.
Suppose you are a phone retailer who wants to keep tabs on all the prices of phones on Flipkart, the eCommerce website in India (there is a reason we picked Flipkart and not a global website like Amazon as you will see) and their prices, ratings and especially, the percentage discounts they are offering.
What you have to do:
You will have to crawl product data from Flipkart.com for the product category phones and extract the following details:
Title of the product, URL, Rating, Price, and discount percentage.
What's challenging about this:
As you will discover, Flipkart happens to have no obvious way to point at the data we want. Generally, websites like Amazon are easily scrapable because of the way the HTML and the CSS classes are meaningfully laid out. You can use CSS selectors to point at data and simply scrape them. None of that will fly with Flipkart because as shown below, the CSS classed are all auto-generated gibberish. Also, from having scraped with the website before, we know you will need to jump through different hoops for different pieces of data on this website. So have at it:
What you will learn
You will learn how to use creative ways to reliably point at data you want. A sort of web scraping parkour if you will. You will fail, fail, fail and then you will learn.
Tools you can use
Ideally, you can use Python and Beautiful soup to do the scraping. We are not strict about the language or the library as long as it gets the job done, but our answers a week from now will using Python and Beautiful Soup.
Enjoy the hunt. See you a week later :-)