
Scraping Data With Cheerio.JS
What is Cheerio?
Cheerio is a Node module that allows you to easily parse markup and extract the information you need using jQuery and Javascript. Cheerio provides an API for traversing/manipulating the resulting data structure. Additionally, it does not interpret the result as a web browser does. Cheerio simply provides access to the markup. It does not do any visual rendering, apply CSS, load external resources, or execute JavaScript. Basically, if the data is present when you ‘View Source’ or ‘Inspect’ it will also be available to Cheerio.
What's the point?
Cheerio comes in handy if you need to scrape or verify a large amount of data quickly. Sure we could use other popular QA tools to obtain this data, however, Cheerio will accomplish the task in a fraction of the time it would take for our normal automation framework to execute.
How have my projects benefited from it?
I’ve used it to verify placements of widgets and ad units on 300+ websites Verified data in our Rest API for 300+ websites Obtained Google Play / App Store versions for 300+ mobile apps Obtained Google Play / App Store / Alexa reviews for 300+ mobile apps and skills Verified the accuracy of legal and contact information on 300+ websites Verified menu items and button links in 300+ apps using our Rest API … The list goes on and on.
As you can see it has been our go to tool for the past two years when QA / Support needs to provide the business owners with the information they are looking for.
Execution Time
While this section is probably not necessary I feel that it's worth mentioning. Cheerio is fast! Typically these audits take less than 5 minutes to run against our 300+ sites.
Okay, lets dig in!
Install Cheerio
In a new project directory run the following commands npm init
followed by npm install cheerio
. This will initialize npm and install the cheerio module.
Setting up our Cheerio script
First were going to create a new file in the root of our directory called test.js
. Next we need to add our two dependencies to the top of our new test file like so
const cheerio = require('cheerio');
const request = require('request');
We will be using the Request module to fetch the data from our site and we will be using the Cheerio module to parse through the data and extract the information we want.
Running a basic Cheerio script
The following example shows a basic request using Cheerio to log the title urls to the posts found on the following NPR (https://www.npr.org/sections/national/) to the terminal. It can be run by navigating to the root of your projects directory in the terminal and running node script.js
.
gist:waltir/f6288a615dd84ad5226e80317f329939
Getting data from each post
Okay, so we have logged the title of the page and the urls to each post on the page but what if we want to access data in each post? Easy, we just need to send an extra request to each post and extract the data we want. The following Gist shows how to save data from a dozen NPR posts to a json file.
gist:waltir/6c331bd494992fbdda97df9c4bfebadf
Iterating over each nav link and fetching all available posts
Finally what if we want gather posts from each page in the nav? Again, we simply need to adjust our script by adding one more request. The following working example shows how you can first iterate over each link in the navbar and then iterate over each post on the page to gather over 400 of the most recent posts on NPR.
gist:waltir/23da9daa52eb6864158fbae8f61218d5
As you can see Cheerio is a very powerful tool that can be used to assist with a wide array of tasks. Personally, it has become one of my favorite tools for automating many of my QA and Support tasks. It's worth mentioning that these simple examples are not entirely perfect and can be rewritten a dozen different ways. If you have any further questions I strongly recommend that you checkout the Cheerio documentation here.
More Posts
Blocking Ad Traffic In Nightwatch JS

Example showing how you can block unwanted ad traffic in your Nightwatch JS tests....
Blocking Ad Traffic In Cypress

Example showing how you can block unwanted ad traffic in your Cypress tests....
Three Ways To Resize The Browser In Nightwatch

Outlining the three different ways to resize the browser in Nightwatch JS with examples....
Happy Path VS Sad Path Testing

As a test engineer it is crucial that both happy path and sad path use cases have been considered and fully tested...