In this article, we're going to set up a web scraper with Ruby!
I think that I'll be fun to try and scrape all developer jobs from indeed.com
We're going to start with setting up a basic web scraper. And then improve it by automating real web browsers to do the scraping.
Let's dive in!
The very first step is to perform a GET request. This way we can retrieve the HTML code of the Indeed website.
To make our lives easier I'm going to use the HTTParty
gem.
gem install httparty
And let's set up a project folder with our main ruby file.
mkdir ruby_scraper
cd ruby_scraper
touch scraper.rb
To perform the GET request I add the following code to the scraper.rb
file.
require 'httparty'
response = HTTParty.get('https://indeed.com')
puts response.code
For now, we simply print the response code of the request. This should return a 200 OK
response. Let's try it out!
ruby scraper.rb
200
Nice, so far everything works! š
It's great that the GET request works, but there is not a lot of useful information on the Indeed homepage.
Let's update the scraper to try and find all developer jobs in London.
The first step is understanding what URL we should get. So, I fire up a Chrome browser and go to the indeed homepage.
After filling out the form the browser redirects me to https://www.indeed.co.uk/jobs?q=developer&l=london
Let's update our scraper to use that URL!
require 'httparty'
# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')
# Print the response body this time
puts response.body
When running this program it will print out the entire HTML response, which kind of looks like gibberish.
The next step is to extract some useful data from this HTML!
We need a way to parse the HTML response. There is an amazing Ruby gem Nokogiri that does exactly this!
gem install nokogiri
Let's update the program to find the h1
tag.
require 'httparty'
require 'nokogiri'
# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')
# Create a Nokogiri doc from the html
doc = Nokogiri::HTML(response.body)
# Find the first h1 tag
h1Tag = doc.css("h1").first
# Get the text-content
h1Text = h1Tag.content
# Strip the content of any trailing whitespace
h1Text.strip!
print h1Text
When running this program it displays the correct title! š
ruby scraper.rb
developer jobs in London
We used a CSS selector to select the h1
element.
We can use this doc.css(...)
function to get any element on the page using CSS selectors.
Let's try to find the selector that gets all job titles from the page.
To do this I'm going to open up the Chrome dev-tools by right-clicking the page and clicking inspect.
If you look at the title elements you can see that they all look like this: <h2 class="title">
We can use this information to extract the titles!
require 'httparty'
require 'nokogiri'
# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')
doc = Nokogiri::HTML(response.body)
# Get all h2 tags with the title class
titles = doc.css("h2.title")
titles.each do |title|
print title.content.strip!
end
When running this it prints out the titles.
ruby scraper.rb
Android Developer
newJunior Web DeveloperSoftware Engineer
newCoding Academy Graduate Developer - LondonGraduate Software Developer
Developer - All LevelsSenior Full Stack Developer
There is a small mistake though, the word "new" sometimes appears in front of the title.
If we look at the HTML in more detail we can see that the structure is actually as follows:
<h2 class="title">
<a>The Title</a>
<span>NEW</span>
</h2>
So, we can easily fix this by updating our CSS selector:
# Select the first a tag of all h2.title elements.
titles = doc.css("h2.title > a")
And running the program again now shows the correct titles! š„
Let's also get the URL of the job and the salary (if it's available).
To do this I'm going to update our initial selector a bit. I'm going to grab all the div.result
elements and loop over them.
For the URL I simply grab the href
attribute of the a
tag.
For the salary, I grab the contents from the span.salaryText
if it exists.
Gluing it all together, the program now looks like this.
require 'httparty'
require 'nokogiri'
# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')
doc = Nokogiri::HTML(response.body)
# Get all the div.result elements
job_divs = doc.css("div.result")
job_divs.each do |job_div|
# Find the a element
a = job_div.css("h2 > a").first
# Get the title and detail_url from the a element
title = a.content.strip!
detail_url = "https://indeed.com" + a['href']
# Find the salary element
span = job_div.css(".salaryText").first
salary = ""
if span
salary = span.content.strip!
end
# Print the results
print title + "\n"
print salary + "\n"
print detail_url + "\n\n"
end
When running this program it prints out the developer jobs correctly!
ruby scraper.rb
Software Developer
Ā£20,000 - Ā£25,000 a year
https://indeed.com/company/learn2earn/jobs/Software-Developer-a94218307336824d?fccid=daa38bc0b50695d4&vjs=3
...etc
In many real-world applications, you would want to store this data somewhere.
In this case, let's try and store these results in a CSV file.
The first step is to require the built-in csv
gem.
require 'csv'
Then we can replace the print
statements with the code that writes the results to the CSV file.
# Write the results to the CSV file
CSV.open("developer_jobs_in_london.csv", "a+") do |csv|
csv << [title, salary, detail_url]
end
When running the program you can see that the developer_jobs_in_london.csv
file is created successfully!
We've just created a basic scraper that works. However, if we were to run this every minute, the Indeed website would quickly figure out that we are using a scraper bot.
There are 2 ways how Indeed can detect that we are using a bot to access their website:
Let's look at how we can solve this.
We can use the Kimura Framework to spin up real browsers using Ruby.
First, we will need to install the chrome driver, you can easily install it with Homebrew.
brew cask install chromedriver
Or if you are on Ubuntu you can install with this command.
sudo apt-get install chromium-chromedriver
Next, let's install the kimurai gem.
gem install kimurai
We must update our code to use the Kimurai framework.
require 'kimurai'
# Create our IndeedScraper class
class IndeedScraper < Kimurai::Base
@name = "indeed_scraper"
# This tells Kimurai to use the ChromeDriver we just installed.
@engine = :selenium_chrome
# The URL that we should scrape.
@start_urls = ["https://www.indeed.co.uk/jobs?q=developer&l=london"]
# This is called once the url has been loaded
def parse(response, url:, data: {})
print "Browser retrieved the URL!\n"
end
end
IndeedScraper.crawl!
This program is very basic, it loads the Indeed page and simply prints out a message when the page is loaded. But it useful to see if everything works correctly.
When adding HEADLESS=false
to the ruby command we can actually see what the browser is doing, so let's do that!
HEADLESS=false ruby scraper.rb
This spins up a chrome browser and loads the Indeed page! āØ
The response
parameter of the parse(...)
function is the Nokogiri response object.
We've also used Nokogiri in our basic scraper, so we can simply copy and paste our extraction code. The whole parse
method now looks like this.
def parse(response, url:, data: {})
job_divs = response.css("div.result")
job_divs.each do |job_div|
# Find the a element
a = job_div.css("h2 > a").first
# Get the title and detail_url from the a element
title = a.content.strip!
detail_url = "https://indeed.com" + a['href']
# Find the salary element
span = job_div.css(".salaryText").first
salary = ""
if span
salary = span.content.strip!
end
# Write the results to a CSV
CSV.open("developer_jobs_in_london.csv", "a+") do |csv|
csv << [title, salary, detail_url]
end
end
end
And when we run this the developer_jobs_in_londen.csv
file is filled again with the jobs!
Right now we just get the jobs on the first page. The cool thing about using a real browser is that we can interact with the page!
Let's set up a system that will keep clicking on the "next" button in order to get all job listings.
When inspecting the next button I see that it has an aria-label="Next"
attribute.
We can use this as a css selector to click the button.
# Find the next button using a css selector
next_button = browser.find(:css, "[aria-label='Next']")
next_button.click
Let's put the whole parse(...)
function in a loop so that we get 5 pages.
def parse(response, url:, data: {})
5.times do
# Use browser.current_response so that we always get the latest response object
job_divs = browser.current_response.css("div.result")
job_divs.each do |job_div|
# Find the a element
a = job_div.css("h2 > a").first
# Get the title and detail_url from the a element
title = a.content.strip!
detail_url = "https://indeed.com" + a['href']
# Find the salary element
span = job_div.css(".salaryText").first
salary = ""
if span
salary = span.content.strip!
end
# Write the results to a CSV
CSV.open("developer_jobs_in_london.csv", "a+") do |csv|
csv << [title, salary, detail_url]
end
end
# Find the next button using a css selector
next_button = browser.find(:css, "[aria-label='Next']")
next_button.click
end
end
When running this program it will click through the pages and adds all jobs to the CSV file! š
With a real browser, it becomes harder to identify our program as a scraper bot.
However, if we were to run the program every 5 minutes (or even every day), after a while Indeed will probably figure out that this is a bot.
They do this by recognizing patterns in the IP addresses that access their website.
If an IP address loads the developer jobs in London every day at exactly 12:00 they will probably arrive at the conclusion that it is an automated program.
One way a lot of scrapers solve this is by using proxy servers. If you buy enough proxy servers you can have a different IP address each time the program runs.
You can also choose to use a web scraping API to do the heavy lifting for you.
For example, an API such as Scraperbox handles proxy servers and browsers for you.
We can set up the API with our basic scraper like this.
require 'httparty'
require 'nokogiri'
# Urlencode the Indeed url
url_to_get = CGI.escape "https://indeed.co.uk/Developer-jobs-in-London"
api_token = 'YOUR_API_TOKEN'
# Call the scraperbox API with Javascript Enabled
response = HTTParty.get('https://scraperbox.com/api/scrape?token=' + api_token + '&url=' + url_to_get + '&javascript_enabled=true')
# The response is now the same Indeed HTML response
doc = Nokogiri::HTML(response.body)
# The rest of the code...
The API will spin up a real browser and connect to a random proxy server to get the Indeed website.
That's it! I've shown you the basics of web scraping with Ruby.
We've set up a basic scraper that scrapes developer jobs in London from the Indeed website.
Next, we've spun up a real chrome browser to scrape the website. And, I've shown how you can avoid getting blocked entirely by using a web scraping API.
As a final note, keep in mind that web scraping should be done responsibly. Don't scrape data that isn't meant to be scraped.
Happy coding! šØāš»
Start now with 500 free API credits, no creditcard required.
Try Scraperbox for free