Scraping Data With Cheerio.JS
What is Cheerio?
What's the point?
Cheerio comes in handy if you need to scrape or verify a large amount of data quickly. Sure we could use other popular QA tools to obtain this data, however, Cheerio will accomplish the task in a fraction of the time it would take for our normal automation framework to execute.
How have my projects benefited from it?
I’ve used it to verify placements of widgets and ad units on 300+ websites Verified data in our Rest API for 300+ websites Obtained Google Play / App Store versions for 300+ mobile apps Obtained Google Play / App Store / Alexa reviews for 300+ mobile apps and skills Verified the accuracy of legal and contact information on 300+ websites Verified menu items and button links in 300+ apps using our Rest API … The list goes on and on.
As you can see it has been our go to tool for the past two years when QA / Support needs to provide the business owners with the information they are looking for.
While this section is probably not necessary I feel that it's worth mentioning. Cheerio is fast! Typically these audits take less than 5 minutes to run against our 300+ sites.
Okay, lets dig in!
In a new project directory run the following commands
npm init followed by
npm install cheerio. This will initialize npm and install the cheerio module.
Setting up our Cheerio script
First were going to create a new file in the root of our directory called
test.js. Next we need to add our two dependencies to the top of our new test file like so
const cheerio = require('cheerio'); const request = require('request');
We will be using the Request module to fetch the data from our site and we will be using the Cheerio module to parse through the data and extract the information we want.
Running a basic Cheerio script
The following example shows a basic request using Cheerio to log the title urls to the posts found on the following NPR (https://www.npr.org/sections/national/) to the terminal. It can be run by navigating to the root of your projects directory in the terminal and running
Getting data from each post
Okay, so we have logged the title of the page and the urls to each post on the page but what if we want to access data in each post? Easy, we just need to send an extra request to each post and extract the data we want. The following Gist shows how to save data from a dozen NPR posts to a json file.
Iterating over each nav link and fetching all available posts
Finally what if we want gather posts from each page in the nav? Again, we simply need to adjust our script by adding one more request. The following working example shows how you can first iterate over each link in the navbar and then iterate over each post on the page to gather over 400 of the most recent posts on NPR.
As you can see Cheerio is a very powerful tool that can be used to assist with a wide array of tasks. Personally, it has become one of my favorite tools for automating many of my QA and Support tasks. It's worth mentioning that these simple examples are not entirely perfect and can be rewritten a dozen different ways. If you have any further questions I strongly recommend that you checkout the Cheerio documentation here.
Example showing how you can block unwanted ad traffic in your Nightwatch JS tests....
Example showing how you can block unwanted ad traffic in your Cypress tests....
Outlining the three different ways to resize the browser in Nightwatch JS with examples....
As a test engineer it is crucial that both happy path and sad path use cases have been considered and fully tested...