How to Scrape Google Search Results with Python

Dirk author image By on 28 Dec, 2020


In this article, we're going to build a Google search result scraper in Python!

We'll start with creating everything ourselves. And then, we'll make our lives easier by implementing a SERP API.

Let's dive in!

Google office

Setting up the project

Let's start by creating a folder that will hold our project.

mkdir google-scraper
cd google-scraper
touch scraper.py

Next, we should have a way to retrieve the HTML of Google.

As a first test, I add the following code to get the HTML of the Google home page.

# Use urllib to perform the request
import urllib.request

url = 'https://google.com'

# Perform the request
request = urllib.request.Request(url)
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

print(html)

And when we run this, everything works as expected! 🎉

python3 main.py

// A lot of HTML gibberish here

Getting a Search Result page

We've got the HTML of the Google home page. But, there is not a lot of interesting information there.

Let's update our script to get a search result page.

The url format of Google is https://google.com/search?q=Your+Search+Query

Note that spaces are replaced with + symbols.

For the next step, I update the url variable to a question that has been burning on my mind: "What is the answer to life the universe and everything"

url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'

Let's run the program and see the result.

python3 scraper.py

urllib.error.HTTPError: HTTP Error 403: Forbidden

Hmm, something is going wrong.

It turns out that Google is no too keen on automated programs getting the search result page.

A solution is to mask the fact that we are an automated program by setting a normal User-Agent header.

# Use urllib to perform the request
import urllib.request

url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'

# Perform the request
request = urllib.request.Request(url)

# Set a normal User Agent header
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

print(html)

And when we run the program again, it prints the HTML gibberish of the search result page! 🙌

Setting up BeautifulSoup

To extract information from the raw HTML I'm going to use the BeautifulSoup package.

pip3 install beautifulsoup4

Next, we should import the package.

from bs4 import BeautifulSoup

Then, we can construct a soup object form the html

# Other code here

# Construct the soup object
soup = BeautifulSoup(html, 'html.parser')

# Let's print the title to see if everything works
print(soup.title)

For now, we use the soup object to print out the page title. Just to see if everything works correctly.

python3 scraper.py
What is the answer to life the universe and everything - Google search

Great, it extracts the title of our search page! 🔥

Extracting the Search Results

Let's take it a step further and extract the actual search results from the page.

To figure out how to access the search results I fire up Chrome and inspect a Google search result page.

Google search result page

There are 2 things I notice.

  1. The search results are contained in a #search div.
  2. Each result has the g class name.

We can use this information to extract the search results with BeautifulSoup.

# Other code here

# Construct the soup object
soup = BeautifulSoup(html, 'html.parser')

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # For now just print the text contents.
    print(div.get_text() + "\n\n")

Let's run the program and see if it works

python3 scraper.py
Results 42 (number) - Wikipediaen.wikipedia.org › wiki › 42_(num...en.wikipedia.org › wiki › 
42_(num...In cacheVergelijkbaarVertaal deze paginaThe number 42 is, in The Hitchhiker's Guide 
to the Galaxy by Douglas Adams, the "Answer to the Ultimate Question of Life, the Universe, and 
Everything", calculated by an enormous supercomputer named Deep Thought over a period of 7.5 
million years.‎Phrases from The Hitchhiker's ... · ‎43 (number) · ‎41 (number) · ‎Pronic number


//  And many more results

The good news: It kind of works.

The bad news: A lot of gibberish is still included.

Let's only extract the search titles. When I inspect the page I see that the search titles are contained in h3 tags.

We can use that information to extract the titles.

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # Search for a h3 tag
    results = div.select("h3")

    # Check if we have found a result
    if (len(results) >= 1):

        # Print the title
        h3 = results[0]
        print(h3.get_text())

And now the moment of thruth. Let's run it and see if it works.

python3 scraper.py

42 (number) - Wikipedia
Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia
the answer to life, universe and everything - YouTube
The answer to life, the universe, and everything | MIT News ...
42: The answer to life, the universe and everything | The ...
42 | Dictionary.com
Five Reasons Why 42 May Actually Be the Answer to Life, the ...
Ultimate Question | Hitchhikers | Fandom
For Math Fans: A Hitchhiker's Guide to the Number 42 ...
Why 42 is NOT the answer to life, the universe and everything ...

Amazing! We've just confirmed that the answer to everything is 42.

Scraping a lot of pages

Nice, we've just constructed a basic Google search scraper!

There is a catch though. Google will quickly figure out that this is a bot and block the IP address.

A possible solution would be to scrape very sparsly and wait 10 seconds between requests. However, this is not the best solution if you need to scrape a lot of search queries.

Another solution would be to buy proxy servers. This way you can scrape from different IP addresses.

But once again, there is a catch here. A lot of people want to scrape Google search results, so most proxies have already been blocked by Google.

You could buy dedicated or residential proxies, but that can quickly become very expensive.

In my opinion, the best and simplest solution is to use a SERP API!

Using a SERP API

SERP stands for Search Engine Ranking Page. In this example, I'm going to use the ScraperBox google search api.

The documentation shows an example, so le'ts use that as our base and tweak it a bit.

import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context

# Urlencode the query string
q = urllib.parse.quote_plus("What is the answer to life the universe and everything")

# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q

# Call the API.
request = urllib.request.Request(query)

raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)

# Print the result titles.
for result in response['organic_results']:
    print(result['title'])

Make sure to replace YOUR_API_TOKEN with your scraperbox API token.

And when running this it displays the search result titles!

python3 scraper.py

42 (number) - Wikipedia
Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia
The answer to life, the universe, and everything | MIT News ...
42: The answer to life, the universe and everything | The ...
the answer to life, universe and everything - YouTube
Answer To The Ultimate Question - The ... - YouTube
42 | Dictionary.com
For Math Fans: A Hitchhiker's Guide to the Number 42 ...
The Answer to Life, the Universe, and Everything - MATLAB ...

And once again the search results are shown! 🎉

Conclusion

Alright, that's it.

We've set up a Google scraper in Python using the BeautifulSoup package.

Then, we created a program that uses a SERP API.

And, not to forget: we've figured out that the answer to the universe is 42

Happy coding! 👨‍💻


Dirk author image Dirk Hoekstra has a Computer Science and Artificial Intelligence degree. He is a technical author on Medium where his articles have been read over 100,000 times. Founder of multiple tech companies of which one was acquired in 2020.