How to Scrape Google Search Results with Python

By Dirk Hoekstra on 28 Dec, 2020 - updated on 06 Mar, 2024

In this article, we're going to build a Google search result scraper in Python!

We'll start with creating everything ourselves. And then, we'll make our lives easier by implementing a SERP API.

Let's dive in!

Setting up the project

Let's start by creating a folder that will hold our project.

mkdir google-scraper
cd google-scraper
touch scraper.py

Next, we should have a way to retrieve the HTML of Google.

As a first test, I add the following code to get the HTML of the Google home page.

# Use urllib to perform the request
import urllib.request

url = 'https://google.com'

# Perform the request
request = urllib.request.Request(url)
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

print(html)

And when we run this, everything works as expected! 🎉

python3 main.py

// A lot of HTML gibberish here

Getting a Search Result page

We've got the HTML of the Google home page. But, there is not a lot of interesting information there.

Let's update our script to get a search result page.

The url format of Google is https://google.com/search?q=Your+Search+Query

Note that spaces are replaced with + symbols.

For the next step, I update the url variable to a question that has been burning on my mind: "What is the answer to life the universe and everything"

url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'

Let's run the program and see the result.

python3 scraper.py

urllib.error.HTTPError: HTTP Error 403: Forbidden

Hmm, something is going wrong.

It turns out that Google is no too keen on automated programs getting the search result page.

A solution is to mask the fact that we are an automated program by setting a normal User-Agent header.

# Use urllib to perform the request
import urllib.request

url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'

# Perform the request
request = urllib.request.Request(url)

# Set a normal User Agent header
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

print(html)

And when we run the program again, it prints the HTML gibberish of the search result page! 🙌

Setting up BeautifulSoup

To extract information from the raw HTML I'm going to use the BeautifulSoup package.

pip3 install beautifulsoup4

Next, we should import the package.

from bs4 import BeautifulSoup

Then, we can construct a soup object form the html

# Other code here

# Construct the soup object
soup = BeautifulSoup(html, 'html.parser')

# Let's print the title to see if everything works
print(soup.title)

For now, we use the soup object to print out the page title. Just to see if everything works correctly.

python3 scraper.py
<title>What is the answer to life the universe and everything - Google search</title>

Great, it extracts the title of our search page! 🔥

Extracting the Search Results

Let's take it a step further and extract the actual search results from the page.

To figure out how to access the search results I fire up Chrome and inspect a Google search result page.

There are 2 things I notice.

The search results are contained in a #search div.
Each result has the g class name.

We can use this information to extract the search results with BeautifulSoup.

# Other code here

# Construct the soup object
soup = BeautifulSoup(html, 'html.parser')

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # For now just print the text contents.
    print(div.get_text() + "\n\n")

Let's run the program and see if it works

python3 scraper.py
Results 42 (number) - Wikipediaen.wikipedia.org › wiki › 42_(num...en.wikipedia.org › wiki › 
42_(num...In&nbsp;cacheVergelijkbaarVertaal deze paginaThe number 42 is, in The Hitchhiker's Guide 
to the Galaxy by Douglas Adams, the "Answer to the Ultimate Question of Life, the Universe, and 
Everything", calculated by an enormous supercomputer named Deep Thought over a period of 7.5 
million years.‎Phrases from The Hitchhiker's ... · ‎43 (number) · ‎41 (number) · ‎Pronic number


//  And many more results

The good news: It kind of works.

The bad news: A lot of gibberish is still included.

Let's only extract the search titles. When I inspect the page I see that the search titles are contained in h3 tags.

We can use that information to extract the titles.

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # Search for a h3 tag
    results = div.select("h3")

    # Check if we have found a result
    if (len(results) >= 1):

        # Print the title
        h3 = results[0]
        print(h3.get_text())</code></pre>

And now the moment of thruth. Let's run it and see if it works.

python3 scraper.py

42 (number) - Wikipedia
Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia
the answer to life, universe and everything - YouTube
The answer to life, the universe, and everything | MIT News ...
42: The answer to life, the universe and everything | The ...
42 | Dictionary.com
Five Reasons Why 42 May Actually Be the Answer to Life, the ...
Ultimate Question | Hitchhikers | Fandom
For Math Fans: A Hitchhiker's Guide to the Number 42 ...
Why 42 is NOT the answer to life, the universe and everything ...

Amazing! We've just confirmed that the answer to everything is 42.

Scraping a lot of pages

Nice, we've just constructed a basic Google search scraper!

There is a catch though.

Google will quickly figure out that this is a bot and block the IP address.

A possible solution would be to scrape very sparsly and wait 10 seconds between requests. However, this is not the best solution if you need to scrape a lot of search queries.

Another solution would be to buy proxy servers. This way you can scrape from different IP addresses.

But once again, there is a catch here. A lot of people want to scrape Google search results, so most proxies have already been blocked by Google.

One way to overcome this is to buy residential proxies. These are IP addresses that are indistinguisable from real users.

Conclusion

Alright, that's it.

We've set up a Google scraper in Python using the BeautifulSoup package.

And, not to forget: we've figured out that the answer to the universe is 42

Happy coding! 👨‍💻