homeresume
 
   

Natural language processing with Node.js

December 7, 20203 min read

I have recently worked on an NLP classifier for open borders related to COVID-19 restrictions. Tech-stack I used on it includes Node.js, TypeScript, NestJS as back-end framework, Redis as the database, node-nlp for natural language processing, puppeteer and cheerio for scraping, @nestjs/schedule for cron job, and React with Next.js for the front-end. This blog post covers its main parts and their potential improvements.

Cron job

Since the data from the official website is updated once every several days on average, cron job is invoked when the database connection is established and it runs two times per day to get all of the updated data, if any. Cron job scrapes the data, every country is mapped with its information, countries are classified with the trained classifier and put into the database.

@Cron(CronExpression.EVERY_12_HOURS)
async upsertData() {
const pageSource = await this.scraperService.getPageSource(WEBPAGE_URL);
const countriesInfo = this.scraperService.getCountriesInfo(pageSource);
const classifiedCountries = await this.nlpService.getClassifiedCountries(countriesInfo);
return this.databaseService.set('countries', JSON.stringify(countriesData));
}

Scraper

Countries have text information which may contain links and/or e-mail addresses. A headless browser is used for scraping since some JavaScript code has to be executed in order to show e-mail addresses. To make it running on the Heroku dyno, the additional build pack has to be added.

Natural language processing

Training

The classifier is trained with utterances and several intents, trained classifier is saved into the JSON file. 188 countries are classified with training data which consists of 76 utterances.

// nlp.data.ts
export const trainingData = [
// ...
{
utterance,
intent
}
// ...
];
// nlp.service.ts
trainAndSaveModel = async (): Promise<void> => {
const modelFileName = this.getModelFileName();
const manager = this.getNlpManager(modelFileName);
this.addTrainingData(manager);
await manager.train();
manager.save(modelFileName);
};

Preprocessing

Before processing, the data is split into sentences where links and e-mail addresses are skipped and diacritics are converted from strings to Latin characters.

Processing

Information is processed sentence by sentence using the trained model. Some sentences are classified as skipped and jumped over since they don't provide enough information for classification.

for (let i = 0; i < sentences.length; i += 1) {
// ...
const { intent } = await nlpManager.process(sentences[i]);
// ...
if (!SKIPPED_INTENTS.includes(intent)) {
return {
...country,
status: intent
};
}
// ...
}

API

There is one endpoint to get all of the data. Some potential improvements include pagination and filtering of the classified data.

const classifiedCountries = await this.databaseService.get('countries');
if (!classifiedCountries) return [];
return JSON.parse(classifiedCountries);

Database

Since reading is the main operation, in-memory reading is fast and the total amount of stored data is less than 1MB, Redis is chosen as the main database.

Front-end

Front-end is a Progressive Web App that uses IndexedDB (it is not supported in Firefox when private mode is used) for caching the data, Bootstrap for styling, and React with Next.js for server-side rendering.

Demo

The demo can be checked out at https://otvorene-granice.com

Željko Šević

Željko Šević

Software Engineer, Node.js Developer

 

© 2021