homeresume
 
   

Web scraping with jsdom

Published December 14, 2023Last updated September 9, 20241 min read

Web scraping means extracting data from websites. This post covers extracting data from the page's HTML when data is stored in JavaScript variable or stringified JSON.

The scraping prerequisite is retrieving an HTML page via an HTTP client.

Examples

The example below moves data into a global variable, executes the page scripts and accesses the data from the global variable.

import jsdom from 'jsdom';
fetch(URL)
.then((res) => res.text())
.then((response) => {
const dataVariable = 'someVariable.someField';
const html = response.replace(dataVariable, `var data=${dataVariable}`);
const dom = new jsdom.JSDOM(html, {
runScripts: 'dangerously',
virtualConsole: new jsdom.VirtualConsole()
});
console.log('data', dom?.window?.data);
});

The example below runs the page scripts, and access stringified JSON data.

import jsdom from 'jsdom';
fetch(URL)
.then((res) => res.text())
.then((response) => {
const dom = new jsdom.JSDOM(response, {
runScripts: 'dangerously',
virtualConsole: new jsdom.VirtualConsole()
});
const data = dom?.window?.document?.getElementById('someId')?.value;
console.log('data', JSON.parse(data));
});

Disclaimer

Please check the website's terms of service before scraping it. Some websites may have terms of service that prohibit such activity.

Boilerplate

Here is the link to the boilerplate I use for the development.