homecourse
 
   

Web scraping with cheerio

Published January 19, 2024Last updated December 16, 20242 min read

Web scraping means extracting data from websites. This post covers extracting data from the page's HTML tags.

Prerequisites

  • cheerio package is installed

  • HTML page is retrieved via an HTTP client

Usage

  • create a scraper object with load method by passing HTML content as an argument

    • set decodeEntities option to false to preserve encoded characters (like &) in their original form
    const $ = load('<div><!-- HTML content --></div>', { decodeEntities: false });
  • find DOM elements by using CSS-like selectors

    const items = $('.item');
  • iterate through found elements using each method

    items.each((index, element) => {
    // ...
    });
  • access element content using specific methods

    • text - $(element).text()

    • HTML - $(element).html()

    • attributes

      • all - $(element).attr()
      • specific one - $(element).attr('href')
    • child elements

      • first - $(element).first()
      • last - $(element).last()
      • all - $(element).children()
      • specific one - $(element).find('a')
    • siblings

      • previous - $(element).prev()
      • next - $(element).next()

Disclaimer

Please check the website's terms of service before scraping it. Some websites may have terms of service that prohibit such activity.

Course

Build your SaaS in 2 weeks - Start Now