Lead Generation Through Web Scraping

May 26th, 2020

Web scraping is one of the best ways to generate leads for your business quite easily.

Today we will look at a practical example of how to get leads from online business directories.

Here are 57 of them that you can scrape leads from

List of Online Directories

Directory	URL
Yelp	http://www.yelp.com/
Thumbtack	https://www.thumbtack.com/
Yellow Pages	http://www.yellowpages.com/
Kudzu	http://www.kudzu.com/
Better Business Bureau	http://www.bbb.org/
Manta	http://www.manta.com/
Judy's Book	http://www.judysbook.com/
CitySquares	http://citysquares.com/
Yellow Pages Directory	https://www.yellowpagesgoesgreen.org
USdirectory.com	http://www.usdirectory.com/
Here	https://mapcreator.here.com
Angie's List	https://www.angieslist.com/
Local.com	http://www.local.com/
MagicYellow	http://www.magicyellow.com/
USCity	https://uscity.net/
Yext	http://www.yext.com/
The Business Journals	http://businessdirectory.bizjournals.com/
Citysearch	http://www.citysearch.com/
Fave	https://www.getfave.com/
BrownBook	http://www.brownbook.net/
Merchant Circle	https://www.merchantcircle.com
Kompass	http://us.kompass.com/
Superpages	http://www.superpages.com/
EZlocal.com	http://ezlocal.com/
MyHuckleberry	http://www.myhuckleberry.com/
CityVoter	http://cityvoter.com/
Craigslist	http://www.craigslist.org/about/sites#US
US Combo Directory Network	http://combodirectoryusa.info/
Insider Pages	http://www.insiderpages.com/
Dexknows	http://www.dexknows.com/
City-Data	http://www.city-data.com/
Switchboard	http://switchboard.com/
Bizwiki	http://www.bizwiki.com/
USA Citylink	http://www.usacitylink.com/
Best of the Web	https://local.botw.org/
RadarFrog	http://radarfrog.gatehousemedia.com/
ALLPages	http://www.allpages.com/
The Real Yellow Pages	http://www.realpageslive.com/
Yellow.com	http://www.yellow.com/
Volta	http://www.volta.net/
Akama	http://www.akama.com/
Nexport	http://www.nexport.com/
JustDial	http://us.justdial.com/
Valley Yellow Pages	http://www.myyp.com/
OfficialUSA.com	http://www.officialusa.com/
2FL	http://www.2findlocal.com/
Seekitlocal	http://www.seekitlocal.com/default.aspx
AmericanTowns	http://www.americantowns.com/
eLocal	http://www.elocal.com/
CityGrid	http://www.citygrid.com/
Ziplocal	http://ziplocal.com/
LocalPages	http://www.localpages.com/
BizVotes	http://www.bizvotes.com/
LocalFolder.com	http://www.localfolder.com/
Zidster	http://www.zidster.com/
netHulk	http://www.nethulk.com/
My Local Services	http://www.mylocalservices.com/

That's a lot. So in this article lets learn how to scrape one of them, Yellow pages so that you can use the same techniques to scrape data from the others. Here we are imagining a scenario where we are looking to generate a list of Dentists that we can target.

BeautifulSoup will help us extract information, and we will retrieve crucial pieces of information from Yellow Pages.

To start with, this is the boilerplate code we need to get the Yellowpages.com search results page and set up BeautifulSoup to help us use CSS selectors to query the page for meaningful data.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.yellowpages.com/los-angeles-ca/dentists'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')

We are also passing the user agent headers to simulate a browser call, so we dont get blocked.

Now let's analyse the Yellow pages search results. This is how it looks.

And when we inspect the page, we find that each of the items HTML is encapsulated in a tag with the class v-card

We could just use this to break the HTML document into these cards which contain individual item information like this.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.yellowpages.com/los-angeles-ca/dentists'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')


for item in soup.select('.v-card'):
	try:
		print('----------------------------------------')
		print(item)

And when you run it.

python3 scrapeYellow.py

You can tell that the code is isolating the v-cards HTML

On further inspection, you can see that the name of the place always has the class business-name. So let's try and retrieve that.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.yellowpages.com/los-angeles-ca/dentists'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')


for item in soup.select('.v-card'):
	try:
		print('----------------------------------------')
		print(item.select('.business-name')[0].get_text())

	except Exception as e:
		#raise e
		print('')

That will get us the names.

Bingo!

Now let's get the other data pieces.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.yellowpages.com/los-angeles-ca/dentists'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')


for item in soup.select('.v-card'):
	try:
		print('----------------------------------------')
		print(item.select('.business-name')[0].get_text())
		print(item.select('.rating div')[0]['class'])
		print(item.select('.rating div span')[0].get_text())

		print(item.select('.phone')[0].get_text())
		print(item.select('.adr')[0].get_text())
		#print(item.select('.business-name')[0].get_text())

		print('----------------------------------------')
	except Exception as e:
		#raise e
		print('')

And when run.

Produces all the info we need including ratings, reviews, phone, address etc

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by Yellow Pages. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and dont want to set up you own infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

List of Online Directories

Get our articles in your inbox