This article explains how to scrape data from the web with Node js efficiently. The report is primarily aimed at programmers with some JavaScript experience. If you have a solid understanding of web scraping but are new to JavaScript, this article can help you.
- Know JavaScript
- Will use DevTools to extract element selectors
- Know some ES6 (optional)
What we will learn?
- Using multiple HTTP clients to aid in the web scraping process
- Utilize several tried-and-tested libraries for scraping the web
Table of Contents
Learn about Node.js
Javascript is a simple, modern programming language that was originally developed to add dynamic effects to web pages in the browser. Javascript code is run by the browser’s Javascript engine when a website is loaded. In order for Javascript to interact with your browser, the browser also provides a runtime environment (document, window, etc.). This means that Javascript cannot directly interact with or manipulate computer resources. For example, in a web server, the server must be able to interact with the file system in order to read and write files.
Node.js enables Javascript to run not only on the client side but also on the server side. To do this, its founder, Ryan Dahl, took Google Chrome’s v8 Javascript Engine and embedded it in a Node program developed in C++. So Node.js is a runtime environment that allows Javascript code to run on the server as well. Contrary to other languages such as C or C++ that handle concurrency through multiple threads, Node.js utilizes a single main thread and executes tasks in a non-blocking manner with the help of an event loop.
It is as simple as creating a simple web server as follows:
const http = require('http');
const PORT = 3000;
const server = http.createServer((req, res) => {
res.statusCode = 200;
res.setHeader('Content-Type', 'text/plain');
res.end('Hello World');
});
server.listen(port, () => {
console.log(`Server running at PORT:${port}/`);
});
If you have Node.js installed, you can try to run the above code. Node.js is great for I/O intensive programs.
HTTP Client: Access the Web
An HTTP client is a tool that can send requests to a server and then receive responses from the server. All of the tools mentioned below use an HTTP client to access the website you want to scrape.
Request
A request is one of the most widely used HTTP clients in the Javascript ecosystem, but the author of the Request library has officially deprecated it. That doesn’t mean it’s not available though, quite a few libraries still use it, and it’s great. Making HTTP requests with Request is very simple:
const request = require('request')
request('https://www.reddit.com/r/programming.json', function (
error,
response,
body
) {
console.error('error:', error)
console.log('body:', body)
})
You can find the Request library on Github, and installing it is very simple. You can also find the deprecation notice and its meaning here.
Axios
Axios is a promise-based HTTP client that runs in the browser and in Node.js. If you use Typescript, then Axios will override the built-in types for you. Initiating HTTP requests through Axios is very simple. By default, it comes with Promise support instead of using callbacks in Request:
const axios = require('axios')
axios
.get('https://www.reddit.com/r/programming.json')
.then((response) => {
console.log(response)
})
.catch((error) => {
console.error(error)
});
If you like the async/await syntactic sugar of the Promises API, you can too, but since the top-level await is still in stage 3, we have to use async functions instead:
async function getForum() {
try {
const response = await axios.get(
'https://www.reddit.com/r/programming.json'
)
console.log(response)
} catch (error) {
console.error(error)
}
}
All you have to do is call getForum! The Axios library can be found at https://github.com/axios/axios.
Superagent
Like Axios, Superagent is another powerful HTTP client that supports Promise and async/await syntactic sugar. It has a fairly simple API like Axios, but Superagent is less popular due to having more dependencies.
Making HTTP requests to Superagent with promises, async/await or callbacks looks like this:
const superagent = require("superagent")
const forumURL = "https://www.reddit.com/r/programming.json"
// callbacks
superagent
.get(forumURL)
.end((error, response) => {
console.log(response)
})
// promises
superagent
.get(forumURL)
.then((response) => {
console.log(response)
})
.catch((error) => {
console.error(error)
})
// promises with async/await
async function getForum() {
try {
const response = await superagent.get(forumURL)
console.log(response)
} catch (error) {
console.error(error)
}
}
Regular Expressions: The Hard Way
The easiest way to do web scraping without any dependencies is to use a bunch of regular expressions on the HTML string you receive when you query a web page with an HTTP client. Regular expressions are not that flexible, and many professionals and amateurs alike have difficulty writing correct regular expressions.
Let’s try that, assuming there’s a tab with a username in it, we need that username, similar to what you’d have to do if you were relying on regex.
const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>(.+)<\/label>/)
console.log(result[1], result[1].split(": ")[1])
// Username: John Doe, John Doe
In Javascript, match() is common to return an array containing everything that matches the regular expression. The second element (at index 1) will find the tagged textContentor we want innerHTML. But the result contains some unwanted text (“Username: “), which must be removed.
As you can see, there are a lot of steps and work to do for a very simple use case. That’s why you should rely on an HTML parser, which we’ll discuss later.
Cheerio: Core JQuery for traversing the DOM
Cheerio is an efficient and lightweight library that allows you to use JQuery’s rich and powerful API on the server side. If you’ve used JQuery before, Cheerio will feel familiar, removing all the inconsistencies and browser-specific features of the DOM and exposing an efficient API for parsing and manipulating the DOM.
const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Hello world</h2>')
$('h2.title').text('Hello there!')
$('h2').addClass('welcome')
$.html()
// <h2 class="title welcome">Hello there!</h2>
As you can see, Cheerio works very similarly to JQuery.
However, while it works differently than a web browser, that means it cannot:
- render any parsed or manipulated DOM elements
- Apply CSS or load external resources
- execute javascript
Therefore, if the website or web application you are trying to scrape is heavily reliant on Javascript (such as a “single page application”), then Cheerio is not the best choice and you may have to rely on other options discussed later.
To demonstrate the power of Cheerio, we’ll try to scrape the r/programming forum in Reddit, trying to get a list of post names.
First, install Cheerio and Axios by running the following commands: npm install cheerio Axios.
Then create a new file crawler.js and copy and paste the following code:
const axios = require('axios');
const cheerio = require('cheerio');
const getPostTitles = async () => {
try {
const { data } = await axios.get(
'https://old.reddit.com/r/programming/'
);
const $ = cheerio.load(data);
const postTitles = [];
$('div > p.title > a').each((_idx, el) => {
const postTitle = $(el).text()
postTitles.push(postTitle)
});
return postTitles;
} catch (error) {
throw error;
}
};
getPostTitles()
.then((postTitles) => console.log(postTitles));
getPostTitles() is an asynchronous function that will crawl the old Reddit r/programming forum. First, get the HTML of the website with a simple HTTP GET request with the Axios HTTP client library, then use cheerio.load()the function to feed the HTML data into Cheerio.
Then with the help of the browser’s Dev Tools, you can get a selector that can target all list items. If you have used JQuery, you must be very familiar with it $(‘div> p.title> a’). This will get all the posts, since you only want to get the title of each post individually, you have to loop through each post, which is done with the help of each() function.
To extract the text from each heading, it is necessary to get the DOM element ( referring to the current element) with the help of Cheerio. Then calling on each element text()will give you the text.
Now, open a terminal and run node crawler.js, and you’ll see an array of about titles, and it’s going to be quite long. Although this is a very simple use case, it demonstrates the simple nature of the API provided by Cheerio.
If your use case requires executing Javascript and loading external sources, the following options will be helpful.
JSDOM: DOM for Node
JSDOM is a pure Javascript implementation of the Document Object Model used in Node.js, as mentioned earlier, DOM is not available for Node, but JSDOM is the closest. It more or less mimics a browser.
Since the DOM is created, it is possible to programmatically interact with the web application or website being crawled, or to simulate clicking a button. If you are familiar with DOM manipulation, using JSDOM will be very simple.
const { JSDOM } = require('jsdom')
const { document } = new JSDOM(
'<h2 class="title">Hello world</h2>'
).window
const heading = document.querySelector('.title')
heading.textContent = 'Hello there!'
heading.classList.add('welcome')
heading.innerHTML
// <h2 class="title welcome">Hello there!</h2>
JSDOM is used to create a DOM in code, and then you can manipulate the DOM with the same methods and properties as you manipulate the browser DOM.
To demonstrate how to use JSDOM to interact with a website, we’ll get the first post on the Reddit r/programming forum, vote on it, and then verify that the post has been voted on.
First, run the following commands to install jsdom and Axios:npm install jsdom Axios
Then create a file crawler.js and copy-paste the following code:
const { JSDOM } = require("jsdom")
const axios = require('axios')
const upvoteFirstPost = async () => {
try {
const { data } = await axios.get("https://old.reddit.com/r/programming/");
const dom = new JSDOM(data, {
runScripts: "dangerously",
resources: "usable"
});
const { document } = dom.window;
const firstPost = document.querySelector("div > div.midcol > div.arrow");
firstPost.click();
const isUpvoted = firstPost.classList.contains("upmod");
const msg = isUpvoted
? "Post has been upvoted successfully!"
: "The post has not been upvoted!";
return msg;
} catch (error) {
throw error;
}
};
upvoteFirstPost().then(msg => console.log(msg));
upvoteFirstPost()
is an asynchronous function that will fetch the first post in r/programming and then vote on it. axios sends an HTTP GET request to get the HTML for the specified URL. A new DOM is then created from the previously fetched HTML. The JSDOM constructor takes HTML as the first parameter and option as the second parameter, and the 2 option items that have been added perform the following functions:
- RunScripts: Allows execution of event handlers and any Javascript code dangerously when. Suppose you don’t know the security of the scripts you will run. In that case, it’s best to set run scripts to “outside-only”, which will attach all provided Javascript specifications to the “window” object, preventing any scripts from executing inside.
- Resources: When set to “usable”, allows loading any external scripts declared with
<script>
the tag (eg: JQuery library fetched from CDN)
After the DOM is created, use the same DOM method to get the upvote button for the first post and click it. To verify that it was indeed clicked, you can classList
check that there is a upmod
class named in. Returns a message if present classListin.
Open a terminal and run node crawler.js
, and you’ll see a neat string that will indicate whether the post has been liked or not. Although this example is simple, you can build powerful things on top of this, for example, a bot that polls around a specific user’s posts.
If you don’t like the lack of expressive power of JSDOM, and in practice rely on many such operations, or need to recreate many different DOMs, then the following will be a better choice.
Puppeteer: headless browser
As the name suggests, Puppeteer allows you to programmatically manipulate the browser, like a puppet. It controls the headless version of Chrome by default by providing a high-level API for developers.
Puppeteer is more useful than the above tools because it allows you to crawl the web as if a real human were interacting with a browser. This opens up some previously unavailable possibilities:
- You can take screenshots or generate PDFs of pages.
- Single-page applications can be crawled and pre-rendered content generated.
- Automate many different user interactions such as keyboard input, form submission, navigation, etc.
It can also play an important role in other tasks besides web scraping, such as UI testing, auxiliary performance optimization, etc.
Usually, you’ll want to take screenshots of your website, perhaps to get a look at a competitor’s product catalog, which can be done with a puppeteer. First run the following command to install puppeteer: npm install puppeteer
This will download the Chromium bundle, which is approximately 180 MB to 300 MB depending on the operating system. If you want to disable this feature.
Let’s try taking screenshots and PDFs of the r/programming forum in Reddit, create a new file crawler.js, and copy-paste the following code:
const puppeteer = require('puppeteer')
async function getVisual() {
try {
const URL = 'https://www.reddit.com/r/programming/'
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(URL)
await page.screenshot({ path: 'screenshot.png' })
await page.pdf({ path: 'page.pdf' })
await browser.close()
} catch (error) {
console.error(error)
}
}
getVisual()
getVisual() is an asynchronous function that will get the screenshot and pdf corresponding to the URL in URLthe variable. First, puppeteer. launch() create and then create a new page. Think of this page as a tab in a regular browser. The previously created page is then directed to the specified URL by calling it with URL as an argument. page. goto(). Eventually, the browser instance is destroyed along with the page.
Once you’re done and the page is done loading, you’ll use page.screenshot() and page.pdf() to get the screenshot and pdf respectively. You can also listen to the javascript load event and perform these actions, which is strongly recommended at the production level. Run it in a terminal node crawler.js, and after a few seconds, you’ll notice that two files have been created named screenshot.jpg and page.pdf.
Nightmare: An Alternative to Puppeteer
Nightmare is a Puppeteer-like high-level browser automation library that uses Electron but is said to be twice as fast as its predecessor, PhantomJS. If you somehow dislike Puppeteer or are frustrated with the size of the Chromium bundle, a nightmare is an ideal choice. First, install the nightmare library by running the following command:npm install nightmare. Then, once the nightmare is downloaded, we’ll use it to find ScrapingBee’s website via the Google search engine. Create a crawler.js file named, and copy-paste the following code into it:
const Nightmare = require('nightmare')
const nightmare = Nightmare()
nightmare
.goto('https://www.google.com/')
.type("input[title='Search']", 'ScrapingBee')
.click("input[value='Google Search']")
.wait('#rso > div:nth-child(1) > div > div > div.r > a')
.evaluate(
() =>
document.querySelector(
'#rso > div:nth-child(1) > div > div > div.r > a'
).href
)
.end()
.then((link) => {
console.log('Scraping Bee Web Link': link)
})
.catch((error) => {
console.error('Search failed:', error)
})
First, create a Nightmare instance, then direct goto() that, once loaded, use its selector to get the search box, then use the search box’s value (input tag) to change to “ScrapingBee”. Once complete, submit the search form by clicking the Google Search button. Then tell Nightmare to wait until the first link has loaded, and once that’s done, it will use DOM methods to get the value of href the attribute. Finally, when everything is done, the link will be printed to the console.