By: Waltir

Scraping Data With Cheerio.JS

Cover Image for Scraping Data With Cheerio.JS

What is Cheerio?

Cheerio is a Node module that allows you to easily parse markup and extract the information you need using jQuery and Javascript. Cheerio provides an API for traversing/manipulating the resulting data structure. Additionally, it does not interpret the result as a web browser does. Cheerio simply provides access to the markup. It does not do any visual rendering, apply CSS, load external resources, or execute JavaScript. Basically, if the data is present when you ‘View Source’ or ‘Inspect’ it will also be available to Cheerio.

What's the point?

Cheerio comes in handy if you need to scrape or verify a large amount of data quickly. Sure we could use other popular QA tools to obtain this data, however, Cheerio will accomplish the task in a fraction of the time it would take for our normal automation framework to execute.

How have my projects benefited from it?

I’ve used it to verify placements of widgets and ad units on 300+ websites Verified data in our Rest API for 300+ websites Obtained Google Play / App Store versions for 300+ mobile apps Obtained Google Play / App Store / Alexa reviews for 300+ mobile apps and skills Verified the accuracy of legal and contact information on 300+ websites Verified menu items and button links in 300+ apps using our Rest API … The list goes on and on.

As you can see it has been our go to tool for the past two years when QA / Support needs to provide the business owners with the information they are looking for.

Execution Time

While this section is probably not necessary I feel that it's worth mentioning. Cheerio is fast! Typically these audits take less than 5 minutes to run against our 300+ sites.


Okay, lets dig in!

Install Cheerio

In a new project directory run the following commands npm init followed by npm install cheerio. This will initialize npm and install the cheerio module.

Setting up our Cheerio script

First were going to create a new file in the root of our directory called test.js. Next we need to add our two dependencies to the top of our new test file like so

const cheerio = require('cheerio'); 
const request = require('request');

We will be using the Request module to fetch the data from our site and we will be using the Cheerio module to parse through the data and extract the information we want.

Running a basic Cheerio script

The following example shows a basic request using Cheerio to log the title urls to the posts found on the following NPR (https://www.npr.org/sections/national/) to the terminal. It can be run by navigating to the root of your projects directory in the terminal and running node script.js.



Getting data from each post

Okay, so we have logged the title of the page and the urls to each post on the page but what if we want to access data in each post? Easy, we just need to send an extra request to each post and extract the data we want. The following Gist shows how to save data from a dozen NPR posts to a json file.


Iterating over each nav link and fetching all available posts

Finally what if we want gather posts from each page in the nav? Again, we simply need to adjust our script by adding one more request. The following working example shows how you can first iterate over each link in the navbar and then iterate over each post on the page to gather over 400 of the most recent posts on NPR.



As you can see Cheerio is a very powerful tool that can be used to assist with a wide array of tasks. Personally, it has become one of my favorite tools for automating many of my QA and Support tasks. It's worth mentioning that these simple examples are not entirely perfect and can be rewritten a dozen different ways. If you have any further questions I strongly recommend that you checkout the Cheerio documentation here.

More Posts

Cover Image for Blocking Ad Traffic In Nightwatch JS
Blocking Ad Traffic In Nightwatch JS
By: Waltir

Example showing how you can block unwanted ad traffic in your Nightwatch JS tests....

Cover Image for Blocking Ad Traffic In Cypress
Blocking Ad Traffic In Cypress
By: Waltir

Example showing how you can block unwanted ad traffic in your Cypress tests....

Cover Image for Three Ways To Resize The Browser In Nightwatch
Three Ways To Resize The Browser In Nightwatch
By: Waltir

Outlining the three different ways to resize the browser in Nightwatch JS with examples....

Cover Image for Happy Path VS Sad Path Testing
Happy Path VS Sad Path Testing
By: Waltir

As a test engineer it is crucial that both happy path and sad path use cases have been considered and fully tested...