In this example we're going to set up an Amazon product scraper in Python.
But, before we dive into how to set up Scraperbox, let's talk about what problem it solves.
Getting data from websites can be hard. Most big websites like amazon.com have protections in place to detect automated web scrapers.
If you're not careful your scraper will run into a lot of these "robot check" pages.
Scraperbox solves this problem by using headless chrome browsers combined with proxy pools.
This makes it super hard for websites to detect that you are using a web scraper.
For this example let's try to scrape the hand shovel product from Amazon.
We can do this by sending a GET
request
to https://api.scraperbox.com/scrape
with
your API token and the Amazon detail url.
There is one catch though: we must url-encode the Amazon url. To do this I use this online url-encoder tool .
Putting it all together the final request looks like this:
https://api.scraperbox.com/scrape?api_key=YOUR_API_KEY&url=https%3A%2F%2Fwww.amazon.com%2FEdward-Tools-Bend-proof-Garden-Trowel%2Fdp%2FB01N297HU0
You can simply copy and paste this URL in your browser. This will call the ScraperBox API and return the Amazon product page! 🎉
We've just executed our first API query by hand. But, we can also write a program to scrape the Amazon detail page!
For this example, let's create a Python program that calls the ScraperBox API.
To do this I create a new file scraper.py
and add the following code.
import urllib.parse
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the URL
url = urllib.parse.quote_plus("https://www.amazon.com/Edward-Tools-Bend-proof-Garden-Trowel/dp/B01N297HU0")
# Create the query URL.
query = "https://api.scraperbox.com/scrape"
query += "?api_key=%s" % "YOUR_API_KEY"
query += "&url=%s" % url
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
html = raw_response.decode("utf-8")
print(html)
Make sure to replace YOUR_API_KEY
with your API key!
You can then execute the program by running:
python3 scraper.py
And this will print out the HTMl of the Amazon web page!
To extract data from the HTML you can use a dom parser.
Almost every programming language has a dom parser package. In our case we can use the Pyhton BeautifulSoup package.
pip3 install beautifulsoup4
Next, let's import the beautifulsoup package.
from bs4 import BeautifulSoup
# Rest of the code below
Let's try and extract the product title and price from the page.
If I open the
hand shovel
page again I can right click > inspect
to open the devtools.
I see that the title has id=title
. Similarly, the price has id=price_inside_buybox
We can use this information to extract these values from the HTML.
# Rest of the code here
# Setup beautifulsoup
soup = BeautifulSoup(html, 'html.parser')
# Find the element
title_element = soup.select_one("#title")
# Get the text content
title = title_element.getText().strip()
print("Title=" + title)
Let's run the program and see if it prints out the correct title.
python3 scraper.py
Title=Edward Tools Bend-Proof Garden Trowel - Heavy Duty Polished Stainless Steel - Rust Resistant Oversized Garden Hand Shovel for Quicker Work - Digs Through Rocky / Heavy soils - Comfort Grip (1)
Nice, it works! 🔥
Right now we can extract the title and price from a single product page. Let's extend this by getting the price and title of multiple products.
First, I change the URL to a search result page.
https://www.amazon.com/s?k=phone
Using the chrome devtools I figure out that I can use the following selector to select all search results:
.s-result-list div.sg-col-inner h2 > a.a-link-normal
.
Let's try it out.
# Setup beautifulsoup
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".s-result-list div.sg-col-inner h2 > a.a-link-normal")
for link in links:
print(link['href'])
And when running this it shows all the product links!
python3 scraper.py
/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A00179512S0C4LGZIE8LT&url=%2FGoogle-Pixel-4a-Unlocked-Smartphone%2Fdp%2FB08CFSZLQ4%2Fref%3Dsr_1_1_sspa%3Fdchild%3D1%26keywords%3Dphone%26qid%3D1610465114%26sr%3D8-1-spons%26psc%3D1&qualifier=1610465114&id=1892931732351271&widgetName=sp_atf
/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A06445422XZPWO7SQTXYP&url=%2FSamsung-Galaxy-G780F-Unlocked-Android%2Fdp%2FB08KYKS6M2%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26keywords%3Dphone%26qid%3D1610465114%26sr%3D8-2-spons%26psc%3D1&qualifier=1610465114&id=1892931732351271&widgetName=sp_atf
// etc.
Now we should get the product page for each search result.
But before we continue, let's clean up the code a bit.
Let's create a get_html
function that calls the ScraperBox API and returns the html.
def get_html(url):
# Urlencode the URL
url = urllib.parse.quote_plus(url)
# Create the query URL.
query = "https://api.scraperbox.com/scrape"
query += "?api_key=%s" % "YOUR_KEY"
query += "&url=%s" % url
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
html = raw_response.decode("utf-8")
return html
Then, we can use this function to get the search result page.
search_page_html = get_html("https://amazon.com/s?k=phone")
soup = BeautifulSoup(search_page_html, 'html.parser')
links = soup.select(".s-result-list div.sg-col-inner h2 > a.a-link-normal")
for link in links:
print(link['href'])
Alright, now we should get the product page for each link. To do this I add the following code.
for link in links:
# Get the product page html.
product_page_html = get_html("https://amazon.com" + link['href'])
soup = BeautifulSoup(product_page_html, 'html.parser')
# Find the element
title_element = soup.select_one("#title")
# Get the text content
title = title_element.getText().strip()
print("Title=" + title)
print()
And when running this it displays the products with their price! 🎉
python3 scraper.py
Title=Google Pixel 4a - New Unlocked Android Smartphone - 128 GB of Storage - Up to 24 Hour Battery - Just Black
Title=Samsung Galaxy S20 FE G780F 128GB Dual Sim GSM Unlocked Android Smart Phone - International Version - Cloud White
We've created an Amazon scraper using ScraperBox and Python. You could extend this program to get more product data from the product page.
Keep in mind that in this example we don't check if the API requests are successful. In a real world application we should do this, as sometimes the ScraperBox API will still get blocked by Amazon.
You can find the complete code on Github here.
See our documentation for all the advanced options our API offers!
Happy coding! 👨💻
Start now with 500 free API credits, no creditcard required.
Try Scraperbox for free