In this article, we're going to build a Google search result scraper in Python!
We'll start with creating everything ourselves. And then, we'll make our lives easier by implementing a SERP API.
Let's dive in!
Let's start by creating a folder that will hold our project.
mkdir google-scraper
cd google-scraper
touch scraper.py
Next, we should have a way to retrieve the HTML of Google.
As a first test, I add the following code to get the HTML of the Google home page.
# Use urllib to perform the request
import urllib.request
url = 'https://google.com'
# Perform the request
request = urllib.request.Request(url)
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
print(html)
And when we run this, everything works as expected! 🎉
python3 main.py
// A lot of HTML gibberish here
We've got the HTML of the Google home page. But, there is not a lot of interesting information there.
Let's update our script to get a search result page.
The url format of Google is https://google.com/search?q=Your+Search+Query
Note that spaces are replaced with +
symbols.
For the next step, I update the url
variable to a question that has been burning on my mind: "What is the answer to life the universe and everything"
url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'
Let's run the program and see the result.
python3 scraper.py
urllib.error.HTTPError: HTTP Error 403: Forbidden
Hmm, something is going wrong.
It turns out that Google is no too keen on automated programs getting the search result page.
A solution is to mask the fact that we are an automated program by setting a normal User-Agent
header.
# Use urllib to perform the request
import urllib.request
url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
print(html)
And when we run the program again, it prints the HTML gibberish of the search result page! 🙌
To extract information from the raw HTML I'm going to use the BeautifulSoup package.
pip3 install beautifulsoup4
Next, we should import the package.
from bs4 import BeautifulSoup
Then, we can construct a soup
object form the html
# Other code here
# Construct the soup object
soup = BeautifulSoup(html, 'html.parser')
# Let's print the title to see if everything works
print(soup.title)
For now, we use the soup
object to print out the page title. Just to see if everything works correctly.
python3 scraper.py
<title>What is the answer to life the universe and everything - Google search</title>
Great, it extracts the title of our search page! 🔥
Let's take it a step further and extract the actual search results from the page.
To figure out how to access the search results I fire up Chrome and inspect a Google search result page.
There are 2 things I notice.
#search
div.g
class name.We can use this information to extract the search results with BeautifulSoup.
# Other code here
# Construct the soup object
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# For now just print the text contents.
print(div.get_text() + "\n\n")
Let's run the program and see if it works
python3 scraper.py
Results 42 (number) - Wikipediaen.wikipedia.org › wiki › 42_(num...en.wikipedia.org › wiki ›
42_(num...In cacheVergelijkbaarVertaal deze paginaThe number 42 is, in The Hitchhiker's Guide
to the Galaxy by Douglas Adams, the "Answer to the Ultimate Question of Life, the Universe, and
Everything", calculated by an enormous supercomputer named Deep Thought over a period of 7.5
million years.Phrases from The Hitchhiker's ... · 43 (number) · 41 (number) · Pronic number
// And many more results
The good news: It kind of works.
The bad news: A lot of gibberish is still included.
Let's only extract the search titles. When I inspect the page I see that the search titles are contained in h3
tags.
We can use that information to extract the titles.
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())</code></pre>
And now the moment of thruth. Let's run it and see if it works.
python3 scraper.py
42 (number) - Wikipedia
Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia
the answer to life, universe and everything - YouTube
The answer to life, the universe, and everything | MIT News ...
42: The answer to life, the universe and everything | The ...
42 | Dictionary.com
Five Reasons Why 42 May Actually Be the Answer to Life, the ...
Ultimate Question | Hitchhikers | Fandom
For Math Fans: A Hitchhiker's Guide to the Number 42 ...
Why 42 is NOT the answer to life, the universe and everything ...
Amazing! We've just confirmed that the answer to everything is 42.
Nice, we've just constructed a basic Google search scraper!
There is a catch though.
Google will quickly figure out that this is a bot and block the IP address.
A possible solution would be to scrape very sparsly and wait 10 seconds between requests. However, this is not the best solution if you need to scrape a lot of search queries.
Another solution would be to buy proxy servers. This way you can scrape from different IP addresses.
But once again, there is a catch here. A lot of people want to scrape Google search results, so most proxies have already been blocked by Google.
One way to overcome this is to buy residential proxies. These are IP addresses that are indistinguisable from real users.
Alright, that's it.
We've set up a Google scraper in Python using the BeautifulSoup package.
And, not to forget: we've figured out that the answer to the universe is 42
Happy coding! 👨💻
Start now with 500 free API credits, no creditcard required.
Try Scraperbox for free