How to scrape webpages using NodeJS

Erik author image By Erik on 30 Nov, 2020


With web scraping, we can automatically extract data from websites!

It is used a lot in data science to acquire data from public websites that don't have an API.

In this article, I'm going to create a simple Wikipedia scraper in NodeJS. 

Let's dive in!

NodeJS Web scraping

The first step is to Fetch the HTML Code from the webpage. We can then extract the data we want from this HTML.

Let's start by extracting some interesting data from Wikipedia using NodeJS.

Setting up NodeJS

If you haven't done it already install the LTS version of NodeJS.

If you are on a mac you can use homebrew: brew install node.

For starters, I create a file called scraper.js with the following code.

console.log('hello world')

Then I use the following command to run the program:

node scraper.js

Fetching data

It's a bit embarrassing, but I like Kevin Bacon.

So, because we need something to scrape, let's scrape some data from his Wikipedia article here.

To fetch the data we are going to use the axios npm package.

npm i axios

In your scraper.js file replace the console.log(...) with this snippet.

const axios = require('axios')

axios
  .get('https://en.wikipedia.org/wiki/Kevin_Bacon')
  .then(({ data: html }) => {
    console.log(html)
  })

And to run it using node:

node scraper.js

This will output a huge chunk of HTML, let's extract some useful data from it!

Extracting useful data

To parse the HTML we are going to use the JSDom package.

First install JSDom using npm

npm i jsdom

Next, edit scraper.js to add JSDom.

const axios = require('axios')
const { JSDOM } = require('jsdom')

axios
  .get('https://en.wikipedia.org/wiki/Kevin_Bacon')
  .then(({ data: html }) => {
    const { document } = new JSDOM(html).window
    const nickname = document.querySelector('.nickname')
    if (nickname) console.log(nickname.textContent)
  })
  .catch(e => {
    console.log(e)
  })

The magic happens in document.querySelector('.nickname'). This function searches through the html and finds the first element that contains the class nickname.

With the chrome devtools you can inspect webpages and find classnames of the data you want to scrape.

Conclusion

Alright, we've built a simple web scraper getting Kevin Bacon's nickname from Wikipedia.

And now we will know instantly when his nickname changes, because that is super important... right?

Happy coding!