How to scrape webpages using NodeJS

Erik author image By Erik on 20 August, 2020


Web scraping is a technique to automatically extract data from websites. It is of especial importance in data science and data mining projects. Using javascript and NodeJS, we can create a simple scraper in under 15 minutes.

NodeJS Web scraping

We can extract almost everything we want from a webpage using web scraping. The first step is to Fetch the HTML Code from the webpage. We can then extract text, urls, image links, structure, style/css etc. from the HTML. If you set up a regularly running scraper, you would even be able to detect modifications on a webpage by comparing the extracted code captured on different dates.

Lets start by extracting some interesting data from wikipedia using NodeJS.

Required knowledge

  • Basic understanding of javascript

Tools

  • NodeJS
  • A code editor (visuals studio code)

Scrape wikipedia

Lets start by opening up wikipedia.org and finding a random article. For this example we are using the page on Kevin Bacon. Just leave this page open, we will come back to it later.

Set up NodeJS

In this section we will set up everything you need to follow along with the main article. This will be especially useful if you are not already familiar with nodejs.

If you haven't done it already install the LTS version of NodeJS. If you are on a mac you can use homebrew: brew install node.

We will create a simple hello world application. The application will just log a string of text to the terminal.

First, create a file called scraper.js and copy-paste the below code into it.

console.log('hello world')

Then, open your terminal (powershell on windows) and cd to the directory that contains your newly created scraper.js file and start node.

cd path/to/scraper.js node scraper.js

If you see the text: Hello World; Congratulations, you just wrote your first NodeJS application!

Fetching data

To fetch data we are going to use the axios npm package. Luckily, when we installed NodeJS, the package manager NPM was included.

We can simply run the following command in our terminal.

npm i axios

In your scraper.js file replace the console.log(...) with this snippet.

const axios = require('axios') axios .get('https://en.wikipedia.org/wiki/Kevin_Bacon') .then(({ data: html }) => { console.log(html) })

Run it using node:

node scraper.js

That is a lot of data. Continue to the next section the extract just the data that we need.

Parsing HTML and extracting useful data

In our hypothetical scenario, we would like to find out what Kevin Bacons nickname is. That means we have to parse the html and look up a specific piece of content. For this goal we are going to use JSDom, JSDom parses the html for us and provides us with a DOM, similar to what you would use in frontend development.

First install JSDom using npm

npm i jsdom

Now edit scraper.js to add JSDom.

const axios = require('axios') const { JSDOM } = require('jsdom') axios .get('https://en.wikipedia.org/wiki/Kevin_Bacon') .then(({ data: html }) => { const { document } = new JSDOM(html).window const nickname = document.querySelector('.nickname') if (nickname) console.log(nickname.textContent) }) .catch(e => { console.log(e) })

The magic all happens in document.querySelector('.nickname'). This function searches through the html and finds the first element that contains the class nickname. Using the chrome devtools and similar tools in other browsers, we can see the code for each piece of data on a webpage. Simply right-click and select inspect. Look at the surrounding code and search specifically for id, class and tag names. Now go back to that open wikipedia page on Kevin Bacon. Right-click on his nickname (in the box on the right side of the page called born) and click inspect. It should say: class="nickname".

Using JSDom we can create the dom and search for any element on a website. If you are not familiar with the DOM, a good place to start is MSDN.
querySelector
querySelectorAll

Other examples

List all URLs on a page
const axios = require('axios') const { JSDOM } = require('jsdom') axios .get('https://en.wikipedia.org/wiki/Kevin_Bacon') .then(({ data: html }) => { const { document } = new JSDOM(html).window const links = document.querySelectorAll('a') links.forEach(l => console.log(l.href)) }) .catch(e => { console.log(e) })
const axios = require('axios') const { JSDOM } = require('jsdom') axios .get('https://en.wikipedia.org/wiki/Kevin_Bacon') .then(({ data: html }) => { const { document } = new JSDOM(html).window const links = document.querySelectorAll('img') links.forEach(l => console.log(l.src)) }) .catch(e => { console.log(e) })
List all headers (titles) on a page
const axios = require('axios') const { JSDOM } = require('jsdom') axios .get('https://en.wikipedia.org/wiki/Kevin_Bacon') .then(({ data: html }) => { const { document } = new JSDOM(html).window const links = document.querySelectorAll('h1, h2, h3, h4, h5, h6') links.forEach(l => console.log(l.textContent.trim())) }) .catch(e => { console.log(e) })

Conclusion

We've build a simple web scraper in Javascript that scrapes a random wikipedia page and extracts useful data. We have shown examples on how to extract a specific piece of data, all urls, all image urls and all titles from a specific webpage.

This was a simple example and useful for lenient websites. We used wikipedia because it is pretty permissive towards web scraping. However, not all websites make it this easy. Websites such as Amazon and Google build in counter measures and make it much harder.

Just like everything in Tech, however, harder does not mean impossible. If you still need to scrape these kinds of websites, you can investigate headless browsers like Puppeteer and use proxy servers to divide your requests. Of course, you can also use Scraperbox.com that does all these things for you.

Web scraping is useful, but remember to not abuse websites and only scrape data that you are allowed to scrape.

Happy coding!