homecourse
 
   

Natural language processing with Node.js

Published December 7, 2020Last updated December 16, 20243 min read

I have recently worked on an NLP classifier for open borders related to COVID-19 restrictions. Tech-stack I used on it includes Node.js, TypeScript, NestJS as a back-end framework, Redis as the database, node-nlp for natural language processing, puppeteer and cheerio for scraping, @nestjs/schedule for a cron job, and React with Next.js for the front-end.

This blog post covers its main parts and their potential improvements.

Cron job

Since the data from the official website is updated once every several days on average, the cron job is invoked when the database connection is established. It runs twice daily to get the updated data if any.

Cron job scrapes the data, and every country is mapped with its information. Countries are classified with the trained classifier and put into the database.

@Cron(CronExpression.EVERY_12_HOURS)
async upsertData() {
const pageSource = await this.scraperService.getPageSource(WEBPAGE_URL);
const countriesInfo = this.scraperService.getCountriesInfo(pageSource);
const classifiedCountries = await this.nlpService.getClassifiedCountries(countriesInfo);
return this.databaseService.set('countries', JSON.stringify(countriesData));
}

Scraper

Countries have text information that may contain links and/or e-mail addresses.

A headless browser is used for scraping since some JavaScript code has to be executed in order to show e-mail addresses. To make it running on the Heroku dyno, the additional build pack has to be added.

Natural language processing

Training

The classifier is trained with utterances and several intents, and trained classifier is saved into the JSON file. One hundred eighty-eight countries are classified with training data which consists of 76 utterances.

// nlp.data.ts
export const trainingData = [
// ...
{
utterance,
intent
}
// ...
];
// nlp.service.ts
trainAndSaveModel = async (): Promise<void> => {
const modelFileName = this.getModelFileName();
const manager = this.getNlpManager(modelFileName);
this.addTrainingData(manager);
await manager.train();
manager.save(modelFileName);
};

Preprocessing

Before processing, the data is split into sentences where links and e-mail addresses are skipped, and diacritics are converted from strings to Latin characters.

Processing

Information is processed sentence by sentence using the trained model. Some sentences are classified as skipped and jumped over since they need to provide more information for classification.

for (let i = 0; i < sentences.length; i += 1) {
// ...
const { intent } = await nlpManager.process(sentences[i]);
// ...
if (!SKIPPED_INTENTS.includes(intent)) {
return {
...country,
status: intent
};
}
// ...
}

API

There is one endpoint to get all of the data. Some potential improvements include pagination and filtering of the classified data.

const classifiedCountries = await this.databaseService.get('countries');
if (!classifiedCountries) return [];
return JSON.parse(classifiedCountries);

Database

Since reading is the main operation, in-memory reading is fast, and the total amount of stored data is less than 1MB, Redis is chosen as the primary database.

Front-end

Front-end is a Progressive Web App that uses IndexedDB (not supported in Firefox when private mode is used) for caching the data, Bootstrap for styling, and React with Next.js for server-side rendering.

Course

Build your SaaS in 2 weeks - Start Now