homeresume
 
   
🔍

Web scraping with cheerio

January 19, 2024

Web scraping means extracting data from websites. This post covers extracting data from the page's HTML tags.

Prerequisites

  • cheerio package is installed

  • HTML page is retrieved via an HTTP client

Usage

  • create a scraper object with load method by passing HTML content as an argument

    • set decodeEntities option to false to preserve encoded characters (like &) in their original form
    const $ = load('<div><!-- HTML content --></div>', { decodeEntities: false });
  • find DOM elements by using CSS-like selectors

    const items = $('.item');
  • iterate through found elements using each method

    items.each((index, element) => {
    // ...
    });
  • access element content using specific methods

    • text - $(element).text()

    • HTML - $(element).html()

    • attributes

      • all - $(element).attr()
      • specific one - $(element).attr('href')
    • child elements

      • first - $(element).first()
      • last - $(element).last()
      • all - $(element).children()
      • specific one - $(element).find('a')
    • siblings

      • previous - $(element).prev()
      • next - $(element).next()

Disclaimer

Please check the website's terms of service before scraping it. Some websites may have terms of service that prohibit such activity.

Demo

The demo with the mentioned examples is available here.

2023

Web scraping with jsdom

December 14, 2023

Web scraping means extracting data from websites. This post covers extracting data from the page's HTML when data is stored in JavaScript variable or stringified JSON.

The scraping prerequisite is retrieving an HTML page via an HTTP client.

Examples

The example below moves data into a global variable, executes the page scripts and accesses the data from the global variable.

import jsdom from 'jsdom';
fetch(URL)
.then((res) => res.text())
.then((response) => {
const dataVariable = 'someVariable.someField';
const html = response.replace(dataVariable, `var data=${dataVariable}`);
const dom = new jsdom.JSDOM(html, {
runScripts: 'dangerously',
virtualConsole: new jsdom.VirtualConsole()
});
console.log('data', dom?.window?.data);
});

The example below runs the page scripts, and access stringified JSON data.

import jsdom from 'jsdom';
fetch(URL)
.then((res) => res.text())
.then((response) => {
const dom = new jsdom.JSDOM(response, {
runScripts: 'dangerously',
virtualConsole: new jsdom.VirtualConsole()
});
const data = dom?.window?.document?.getElementById('someId')?.value;
console.log('data', JSON.parse(data));
});

Disclaimer

Please check the website's terms of service before scraping it. Some websites may have terms of service that prohibit such activity.

Boilerplate

Here is the link to the boilerplate I use for the development.