Getting started with ScraperBox

Erik author image By Erik on 06 Sep, 2020


In this example we're going to set up an Amazon product scraper in Python.

But, before we dive into how to set up Scraperbox, let's talk about what problem it solves.

Getting data from websites can be hard. Most big websites like amazon.com have protections in place to detect automated web scrapers.

If you're not careful your scraper will run into a lot of these "robot check" pages.

Amazon robot check page

Scraperbox solves this problem by using headless chrome browsers combined with proxy pools.

This makes it super hard for websites to detect that you are using a web scraper.

Scraping an Amazon page

For this example let's try to scrape the hand shovel product from Amazon.

We can do this by sending a GET request to https://api.scraperbox.com/scrape with your API token and the Amazon detail url.

There is one catch though: we must url-encode the Amazon url. To do this I use this online url-encoder tool .

Let's also enable Javascript rendering by setting javascript_enabled=true.

Putting it all together the final request looks like this:

https://api.scraperbox.com/scrape?token=YOUR_API_TOKEN&javascript_enabled=true&url=https%3A%2F%2Fwww.amazon.com%2FEdward-Tools-Bend-proof-Garden-Trowel%2Fdp%2FB01N297HU0

You can simply copy and paste this URL in your browser. This will call the ScraperBox API and return the Amazon product page! 🎉

Creating a Python program

We've just executed our first API query by hand. But, we can also write a program to scrape the Amazon detail page!

For this example, let's create a Python program that calls the ScraperBox API.

To do this I create a new file scraper.py and add the following code.

import urllib.parse
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Urlencode the URL
url = urllib.parse.quote_plus("https://www.amazon.com/Edward-Tools-Bend-proof-Garden-Trowel/dp/B01N297HU0")

# Create the query URL.
query = "https://api.scraperbox.com/scrape"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&url=%s" % url
query += "&javascript_enabled=true"

# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
html = raw_response.decode("utf-8")

print(html)

Make sure to replace YOUR_API_TOKEN with your API token!

You can then execute the program by running:

python3 scraper.py

And this will print out the HTMl of the Amazon web page!

Extracting data from the HTML

To extract data from the HTML you can use a dom parser.

Almost every programming language has a dom parser package. In our case we can use the Pyhton BeautifulSoup package.

pip3 install beautifulsoup

Next, let's import the beautifulsoup package.

from bs4 import BeautifulSoup

# Rest of the code below

Let's try and extract the product title and price from the page.

If I open the hand shovel page again I can right click > inspect to open the devtools.

Amazon product page

I see that the title has id=title. Similarly, the price has id=price_inside_buybox

We can use this information to extract these values from the HTML.

# Rest of the code here

# Setup beautifulsoup
soup = BeautifulSoup(html, 'html.parser')

# Find the elements
title_element = soup.select_one("#title")
price_element = soup.select_one("#price_inside_buybox")

# Get the text contents
title = title_element.getText().strip()
price = price_element.getText().strip()

print("Title=" + title)
print("Price=" + price)

Let's run the program and see if it prints out the correct title and price.

python3 scraper.py

Title=Edward Tools Bend-Proof Garden Trowel - Heavy Duty Polished Stainless Steel - Rust Resistant Oversized Garden Hand Shovel for Quicker Work - Digs Through Rocky / Heavy soils - Comfort Grip (1)
Price=$9.95

Nice, it works! 🔥

Scraping Multiple Pages

Scraping a search result page

Right now we can extract the title and price from a single product page. Let's extend this by getting the price and title of multiple products.

First, I change the URL to a search result page.

https://www.amazon.com/s?k=phone

Using the chrome devtools I figure out that I can use the following selector to select all search results: .s-result-list div.sg-col-inner h2 > a.a-link-normal.

Let's try it out.

# Setup beautifulsoup
soup = BeautifulSoup(html, 'html.parser')

links = soup.select(".s-result-list div.sg-col-inner h2 > a.a-link-normal")

for link in links:
    print(link['href'])

And when running this it shows all the product links!

python3 scraper.py

/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A00179512S0C4LGZIE8LT&url=%2FGoogle-Pixel-4a-Unlocked-Smartphone%2Fdp%2FB08CFSZLQ4%2Fref%3Dsr_1_1_sspa%3Fdchild%3D1%26keywords%3Dphone%26qid%3D1610465114%26sr%3D8-1-spons%26psc%3D1&qualifier=1610465114&id=1892931732351271&widgetName=sp_atf
/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A06445422XZPWO7SQTXYP&url=%2FSamsung-Galaxy-G780F-Unlocked-Android%2Fdp%2FB08KYKS6M2%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26keywords%3Dphone%26qid%3D1610465114%26sr%3D8-2-spons%26psc%3D1&qualifier=1610465114&id=1892931732351271&widgetName=sp_atf

// etc.

Getting the product pages

Now we should get the product page for each search result.

But before we continue, let's clean up the code a bit. Let's create a get_html function that calls the ScraperBox API and returns the html.

def get_html(url):
    # Urlencode the URL
    url = urllib.parse.quote_plus(url)

    # Create the query URL.
    query = "https://api.scraperbox.com/scrape"
    query += "?token=%s" % "90006A2828922EEE99826CE21A338F45"
    query += "&url=%s" % url
    query += "&javascript_enabled=true"

    # Call the API.
    request = urllib.request.Request(query)
    raw_response = urllib.request.urlopen(request).read()
    html = raw_response.decode("utf-8")
    return html

Then, we can use this function to get the search result page.

search_page_html = get_html("amazon.com/s?k=phone")
soup = BeautifulSoup(search_page_html, 'html.parser')

links = soup.select(".s-result-list div.sg-col-inner h2 > a.a-link-normal")

for link in links:
    print(link['href'])

Alright, now we should get the product page for each link. To do this I add the following code.

for link in links:
    # Get the product page html.
    product_page_html = get_html("https://amazon.com" + link['href'])
    soup = BeautifulSoup(product_page_html, 'html.parser')
    
    # Find the elements
    title_element = soup.select_one("#title")
    price_element = soup.select_one("#price_inside_buybox")
    
    # Get the text contents
    title = title_element.getText().strip()
    price = price_element.getText().strip()
    
    print("Title=" + title)
    print("Price=" + price)
    print()

And when running this it displays the products with their price! 🎉

python3 scraper.py

Title=Google Pixel 4a - New Unlocked Android Smartphone - 128 GB of Storage - Up to 24 Hour Battery - Just Black
Price=$346.73

Title=Samsung Galaxy S20 FE G780F 128GB Dual Sim GSM Unlocked Android Smart Phone - International Version - Cloud White
Price=$559.99

Conclusion

We've created an Amazon scraper using ScraperBox and Python. You could extend this program to get more product data from the product page.

Keep in mind that in this example we don't check if the API requests are successful. In a real world application we should do this, as sometimes the ScraperBox API will still get blocked by Amazon.

You can find the complete code on Github here.

See our documentation for all the advanced options our API offers!

Happy coding! 👨‍💻