How to Scrape Google Search Results with Python
By Dirk Hoekstra on 28 Dec, 2020
In this article, we're going to build a Google search result scraper in Python!
We'll start with creating everything ourselves. And then, we'll make our lives easier by implementing a SERP API.
Let's dive in!
- Setting up a project
- Getting a search result page
- Setting up beautifulsoup
- Extracting the search results
- Scraping a lot of pages
- Using a serp api
Setting up the project
Let's start by creating a folder that will hold our project.
mkdir google-scraper cd google-scraper touch scraper.py
Next, we should have a way to retrieve the HTML of Google.
As a first test, I add the following code to get the HTML of the Google home page.
# Use urllib to perform the request import urllib.request url = 'https://google.com' # Perform the request request = urllib.request.Request(url) raw_response = urllib.request.urlopen(request).read() # Read the repsonse as a utf-8 string html = raw_response.decode("utf-8") print(html)
And when we run this, everything works as expected! 🎉
python3 main.py // A lot of HTML gibberish here
Getting a Search Result page
We've got the HTML of the Google home page. But, there is not a lot of interesting information there.
Let's update our script to get a search result page.
The url format of Google is
Note that spaces are replaced with
For the next step, I update the
url variable to a question that has been burning on my mind: "What is the answer to life the universe and everything"
url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'
Let's run the program and see the result.
python3 scraper.py urllib.error.HTTPError: HTTP Error 403: Forbidden
Hmm, something is going wrong.
It turns out that Google is no too keen on automated programs getting the search result page.
A solution is to mask the fact that we are an automated program by setting a normal
# Use urllib to perform the request import urllib.request url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything' # Perform the request request = urllib.request.Request(url) # Set a normal User Agent header request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36') raw_response = urllib.request.urlopen(request).read() # Read the repsonse as a utf-8 string html = raw_response.decode("utf-8") print(html)
And when we run the program again, it prints the HTML gibberish of the search result page! 🙌
Setting up BeautifulSoup
To extract information from the raw HTML I'm going to use the BeautifulSoup package.
pip3 install beautifulsoup4
Next, we should import the package.
from bs4 import BeautifulSoup
Then, we can construct a
soup object form the
# Other code here # Construct the soup object soup = BeautifulSoup(html, 'html.parser') # Let's print the title to see if everything works print(soup.title)
For now, we use the
soup object to print out the page title. Just to see if everything works correctly.
What is the answer to life the universe and everything - Google search
Great, it extracts the title of our search page! 🔥
Extracting the Search Results
Let's take it a step further and extract the actual search results from the page.
To figure out how to access the search results I fire up Chrome and inspect a Google search result page.
There are 2 things I notice.
- The search results are contained in a
- Each result has the
We can use this information to extract the search results with BeautifulSoup.
# Other code here # Construct the soup object soup = BeautifulSoup(html, 'html.parser') # Find all the search result divs divs = soup.select("#search div.g") for div in divs: # For now just print the text contents. print(div.get_text() + "\n\n")
Let's run the program and see if it works
python3 scraper.py Results 42 (number) - Wikipediaen.wikipedia.org › wiki › 42_(num...en.wikipedia.org › wiki › 42_(num...In cacheVergelijkbaarVertaal deze paginaThe number 42 is, in The Hitchhiker's Guide to the Galaxy by Douglas Adams, the "Answer to the Ultimate Question of Life, the Universe, and Everything", calculated by an enormous supercomputer named Deep Thought over a period of 7.5 million years.Phrases from The Hitchhiker's ... · 43 (number) · 41 (number) · Pronic number // And many more results
The good news: It kind of works.
The bad news: A lot of gibberish is still included.
Let's only extract the search titles. When I inspect the page I see that the search titles are contained in
We can use that information to extract the titles.
# Find all the search result divs divs = soup.select("#search div.g") for div in divs: # Search for a h3 tag results = div.select("h3") # Check if we have found a result if (len(results) >= 1): # Print the title h3 = results print(h3.get_text())
And now the moment of thruth. Let's run it and see if it works.
python3 scraper.py 42 (number) - Wikipedia Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia the answer to life, universe and everything - YouTube The answer to life, the universe, and everything | MIT News ... 42: The answer to life, the universe and everything | The ... 42 | Dictionary.com Five Reasons Why 42 May Actually Be the Answer to Life, the ... Ultimate Question | Hitchhikers | Fandom For Math Fans: A Hitchhiker's Guide to the Number 42 ... Why 42 is NOT the answer to life, the universe and everything ...
Amazing! We've just confirmed that the answer to everything is 42.
Scraping a lot of pages
Nice, we've just constructed a basic Google search scraper!
There is a catch though. Google will quickly figure out that this is a bot and block the IP address.
A possible solution would be to scrape very sparsly and wait 10 seconds between requests. However, this is not the best solution if you need to scrape a lot of search queries.
Another solution would be to buy proxy servers. This way you can scrape from different IP addresses.
But once again, there is a catch here. A lot of people want to scrape Google search results, so most proxies have already been blocked by Google.
You could buy dedicated or residential proxies, but that can quickly become very expensive.
In my opinion, the best and simplest solution is to use a SERP API!
Using a SERP API
SERP stands for Search Engine Ranking Page. In this example, I'm going to use the ScraperBox google search api.
The documentation shows an example, so le'ts use that as our base and tweak it a bit.
import urllib.parse import urllib.request import ssl import json ssl._create_default_https_context = ssl._create_unverified_context # Urlencode the query string q = urllib.parse.quote_plus("What is the answer to life the universe and everything") # Create the query URL. query = "https://api.scraperbox.com/google" query += "?token=%s" % "YOUR_API_TOKEN" query += "&q=%s" % q # Call the API. request = urllib.request.Request(query) raw_response = urllib.request.urlopen(request).read() raw_json = raw_response.decode("utf-8") response = json.loads(raw_json) # Print the result titles. for result in response['organic_results']: print(result['title'])
Make sure to replace
YOUR_API_TOKEN with your scraperbox API token.
And when running this it displays the search result titles!
python3 scraper.py 42 (number) - Wikipedia Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia The answer to life, the universe, and everything | MIT News ... 42: The answer to life, the universe and everything | The ... the answer to life, universe and everything - YouTube Answer To The Ultimate Question - The ... - YouTube 42 | Dictionary.com For Math Fans: A Hitchhiker's Guide to the Number 42 ... The Answer to Life, the Universe, and Everything - MATLAB ...
And once again the search results are shown! 🎉
Alright, that's it.
We've set up a Google scraper in Python using the BeautifulSoup package.
Then, we created a program that uses a SERP API.
And, not to forget: we've figured out that the answer to the universe is 42
Happy coding! 👨💻
Dirk Hoekstra has a Computer Science and Artificial Intelligence degree.
He is a technical author on Medium where his articles have been read over 100,000 times.
Founder of multiple tech companies of which one was acquired in 2020.