Web Scraping with Ruby

Dirk author image By on 22 Dec, 2020


In this article, we're going to set up a web scraper with Ruby!

I think that I'll be fun to try and scrape all developer jobs from indeed.com

We're going to start with setting up a basic web scraper. And then improve it by automating real web browsers to do the scraping.

Let's dive in!

Indeed Homepage

Creating a Basic Scraper

Performing a Get request

The very first step is to perform a GET request. This way we can retrieve the HTML code of the Indeed website.

To make our lives easier I'm going to use the HTTParty gem.

gem install httparty

And let's set up a project folder with our main ruby file.

mkdir ruby_scraper
cd ruby_scraper
touch scraper.rb

To perform the GET request I add the following code to the scraper.rb file.

require 'httparty'

response = HTTParty.get('https://indeed.com')
puts response.code

For now, we simply print the response code of the request. This should return a 200 OK response. Let's try it out!

ruby scraper.rb
200

Nice, so far everything works! 🎉

Getting the right page

It's great that the GET request works, but there is not a lot of useful information on the Indeed homepage.

Let's update the scraper to try and find all developer jobs in London.

The first step is understanding what URL we should get. So, I fire up a Chrome browser and go to the indeed homepage.

Indeed Homepage

After filling out the form the browser redirects me to https://www.indeed.co.uk/jobs?q=developer&l=london

Let's update our scraper to use that URL!

require 'httparty'

# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')

# Print the response body this time
puts response.body

When running this program it will print out the entire HTML response, which kind of looks like gibberish.

The next step is to extract some useful data from this HTML!

Extracting useful information

We need a way to parse the HTML response. There is an amazing Ruby gem Nokogiri that does exactly this!

gem install nokogiri

Let's update the program to find the h1 tag.

require 'httparty'
require 'nokogiri'

# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')

# Create a Nokogiri doc from the html
doc = Nokogiri::HTML(response.body)

# Find the first h1 tag
h1Tag = doc.css("h1").first

# Get the text-content
h1Text = h1Tag.content

# Strip the content of any trailing whitespace
h1Text.strip!

print h1Text

When running this program it displays the correct title! 🙌

ruby scraper.rb
developer jobs in London

CSS Selectors

We used a CSS selector to select the h1 element.

We can use this doc.css(...) function to get any element on the page using CSS selectors.

Let's try to find the selector that gets all job titles from the page.

To do this I'm going to open up the Chrome dev-tools by right-clicking the page and clicking inspect.

If you look at the title elements you can see that they all look like this: <h2 class="title">

Indeed Developers in London page

We can use this information to extract the titles!

require 'httparty'
require 'nokogiri'

# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')

doc = Nokogiri::HTML(response.body)

# Get all h2 tags with the title class
titles = doc.css("h2.title")
titles.each do |title|
    print title.content.strip!
end

When running this it prints out the titles.

ruby scraper.rb

Android Developer
newJunior Web DeveloperSoftware Engineer
newCoding Academy Graduate Developer - LondonGraduate Software Developer
Developer - All LevelsSenior Full Stack Developer

There is a small mistake though, the word "new" sometimes appears in front of the title.

If we look at the HTML in more detail we can see that the structure is actually as follows:

<h2 class="title">
  <a>The Title</a>
  <span>NEW</span>
</h2>

So, we can easily fix this by updating our CSS selector:

# Select the first a tag of all h2.title elements.
titles = doc.css("h2.title > a")

And running the program again now shows the correct titles! 🔥

Scraping More Data

Let's also get the URL of the job and the salary (if it's available).

To do this I'm going to update our initial selector a bit. I'm going to grab all the div.result elements and loop over them.

For the URL I simply grab the href attribute of the a tag.

For the salary, I grab the contents from the span.salaryText if it exists.

Gluing it all together, the program now looks like this.

require 'httparty'
require 'nokogiri'

# Get the developers in London page
response = HTTParty.get('https://www.indeed.co.uk/jobs?q=developer&l=london')

doc = Nokogiri::HTML(response.body)

# Get all the div.result elements
job_divs = doc.css("div.result")
job_divs.each do |job_div|
	
	# Find the a element
	a = job_div.css("h2 > a").first

	# Get the title and detail_url from the a element
	title = a.content.strip!
	detail_url = "https://indeed.com" + a['href']

	# Find the salary element
	span = job_div.css(".salaryText").first
	salary = ""
	if span 
		salary = span.content.strip!
	end

	# Print the results
	print title + "\n"
	print salary + "\n"
	print detail_url + "\n\n"
end

When running this program it prints out the developer jobs correctly!

ruby scraper.rb

Software Developer
£20,000 - £25,000 a year
https://indeed.com/company/learn2earn/jobs/Software-Developer-a94218307336824d?fccid=daa38bc0b50695d4&vjs=3

...etc

Writing the results to a CSV file

In many real-world applications, you would want to store this data somewhere.

In this case, let's try and store these results in a CSV file.

The first step is to require the built-in csv gem.

require 'csv'

Then we can replace the print statements with the code that writes the results to the CSV file.

# Write the results to the CSV file
CSV.open("developer_jobs_in_london.csv", "a+") do |csv|
	csv << [title, salary, detail_url]
end

When running the program you can see that the developer_jobs_in_london.csv file is created successfully!

Developer jobs in London spreadsheet

Avoid Getting Blocked

We've just created a basic scraper that works. However, if we were to run this every minute, the Indeed website would quickly figure out that we are using a scraper bot.

There are 2 ways how Indeed can detect that we are using a bot to access their website:

  1. We are not using a real browser (We can't execute Javascript).
  2. We keep running the requests through the same IP address.

Let's look at how we can solve this.

Using a real browser

We can use the Kimura Framework to spin up real browsers using Ruby.

First, we will need to install the chrome driver, you can easily install it with Homebrew.

brew cask install chromedriver

Or if you are on Ubuntu you can install with this command.

sudo apt-get install chromium-chromedriver

Next, let's install the kimurai gem.

gem install kimurai

We must update our code to use the Kimurai framework.

require 'kimurai'

# Create our IndeedScraper class
class IndeedScraper < Kimurai::Base
  @name = "indeed_scraper"

  # This tells Kimurai to use the ChromeDriver we just installed.
  @engine = :selenium_chrome
  
  # The URL that we should scrape.
  @start_urls = ["https://www.indeed.co.uk/jobs?q=developer&l=london"]

  # This is called once the url has been loaded
  def parse(response, url:, data: {})
    print "Browser retrieved the URL!\n"
  end
end

IndeedScraper.crawl!

This program is very basic, it loads the Indeed page and simply prints out a message when the page is loaded. But it useful to see if everything works correctly.

When adding HEADLESS=false to the ruby command we can actually see what the browser is doing, so let's do that!

HEADLESS=false ruby scraper.rb

This spins up a chrome browser and loads the Indeed page! ✨

Extracting data from the browser

The response parameter of the parse(...) function is the Nokogiri response object.

We've also used Nokogiri in our basic scraper, so we can simply copy and paste our extraction code. The whole parse method now looks like this.

def parse(response, url:, data: {})
	job_divs = response.css("div.result")
	job_divs.each do |job_div|
		
		# Find the a element
		a = job_div.css("h2 > a").first
	
		# Get the title and detail_url from the a element
		title = a.content.strip!
		detail_url = "https://indeed.com" + a['href']
	
		# Find the salary element
		span = job_div.css(".salaryText").first
		salary = ""
		if span 
			salary = span.content.strip!
		end
	
		# Write the results to a CSV
		CSV.open("developer_jobs_in_london.csv", "a+") do |csv|
			csv << [title, salary, detail_url]
		end
	end
end

And when we run this the developer_jobs_in_londen.csv file is filled again with the jobs!

Getting all job listings

Right now we just get the jobs on the first page. The cool thing about using a real browser is that we can interact with the page!

Let's set up a system that will keep clicking on the "next" button in order to get all job listings.

When inspecting the next button I see that it has an aria-label="Next" attribute.

Indeed Job Board page

We can use this as a css selector to click the button.

# Find the next button using a css selector
next_button = browser.find(:css, "[aria-label='Next']")
next_button.click

Let's put the whole parse(...) function in a loop so that we get 5 pages.

def parse(response, url:, data: {})
	5.times do 
		# Use browser.current_response so that we always get the latest response object
		job_divs = browser.current_response.css("div.result")
		job_divs.each do |job_div|
			
			# Find the a element
			a = job_div.css("h2 > a").first
		
			# Get the title and detail_url from the a element
			title = a.content.strip!
			detail_url = "https://indeed.com" + a['href']
		
			# Find the salary element
			span = job_div.css(".salaryText").first
			salary = ""
			if span 
				salary = span.content.strip!
			end
		
			# Write the results to a CSV
			CSV.open("developer_jobs_in_london.csv", "a+") do |csv|
				csv << [title, salary, detail_url]
			end
		end

		# Find the next button using a css selector
		next_button = browser.find(:css, "[aria-label='Next']")
		next_button.click
	end
end

When running this program it will click through the pages and adds all jobs to the CSV file! 🎉

Rotating IP Addresses

With a real browser, it becomes harder to identify our program as a scraper bot.

However, if we were to run the program every 5 minutes (or even every day), after a while Indeed will probably figure out that this is a bot.

They do this by recognizing patterns in the IP addresses that access their website.

If an IP address loads the developer jobs in London every day at exactly 12:00 they will probably arrive at the conclusion that it is an automated program.

One way a lot of scrapers solve this is by using proxy servers. If you buy enough proxy servers you can have a different IP address each time the program runs.

Using a Web Scraping API

You can also choose to use a web scraping API to do the heavy lifting for you.

For example, an API such as Scraperbox handles proxy servers and browsers for you.

We can set up the API with our basic scraper like this.

require 'httparty'
require 'nokogiri'

# Urlencode the Indeed url
url_to_get = CGI.escape "https://indeed.co.uk/Developer-jobs-in-London"
api_token = 'YOUR_API_TOKEN'

# Call the scraperbox API with Javascript Enabled
response = HTTParty.get('https://api.scraperbox.com/scrape?token=' + api_token + '&url=' + url_to_get + '&javascript_enabled=true')

# The response is now the same Indeed HTML response
doc = Nokogiri::HTML(response.body)

# The rest of the code...

The API will spin up a real browser and connect to a random proxy server to get the Indeed website.

Conclusion

That's it! I've shown you the basics of web scraping with Ruby.

We've set up a basic scraper that scrapes developer jobs in London from the Indeed website.

Next, we've spun up a real chrome browser to scrape the website. And, I've shown how you can avoid getting blocked entirely by using a web scraping API.

As a final note, keep in mind that web scraping should be done responsibly. Don't scrape data that isn't meant to be scraped.

Happy coding! 👨‍💻


Dirk author image Dirk Hoekstra has a Computer Science and Artificial Intelligence degree. He is a technical author on Medium where his articles have been read over 100,000 times. Founder of multiple tech companies of which one was acquired in 2020.