homeresume
 
   
🔍

Document loaders and chunking with LangChain

June 16, 2026

This post covers local file ingestion and chunking in Node.js. For LangChain basics (LCEL, packages, agents), see the LangChain overview post. For the full RAG chain with pgvector, see the RAG with pgvector post.

Prerequisites

  • Node.js version 26
  • langchain, @langchain/core, @langchain/classic, and @langchain/textsplitters installed
npm i langchain @langchain/core @langchain/classic @langchain/textsplitters

More loader types (web, cloud, audio) live in standalone integration packages - see the document loader integrations page.

The Document type

Every loader returns Document instances from @langchain/core:

  • pageContent - the text of the chunk or file
  • metadata - optional key/value pairs (source path, section, page) used for citations
import { Document } from '@langchain/core/documents';
const doc = new Document({
pageContent: 'pgvector adds vector search to PostgreSQL.',
metadata: { source: 'notes/pgvector.txt', section: 'basics' }
});

Load a single file

Use TextLoader for plain text or markdown files:

import { TextLoader } from '@langchain/classic/document_loaders/fs/text';
const loader = new TextLoader('./notes/pgvector.txt');
const docs = await loader.load();
console.log(docs[0].pageContent);
console.log(docs[0].metadata.source);

The loader sets metadata.source to the file path - keep it for citations in RAG answers.

Load a directory

Use DirectoryLoader when you have many files. Map extensions to loader factories:

import { DirectoryLoader } from '@langchain/classic/document_loaders/fs/directory';
import { TextLoader } from '@langchain/classic/document_loaders/fs/text';
const loader = new DirectoryLoader('./notes', {
'.txt': (path) => new TextLoader(path),
'.md': (path) => new TextLoader(path)
});
const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);

PDF, CSV, and JSON loaders are available via other integration packages. This post uses .txt and .md files.

Split documents

Chunking makes retrieval more precise. Instead of embedding one large file, split it into smaller overlapping parts. Pass the docs array from TextLoader or DirectoryLoader to a splitter:

Two parameters matter most:

  • chunkSize - target maximum size per chunk (characters or tokens, depending on splitter)
  • chunkOverlap - shared text between adjacent chunks so context is not lost at boundaries

Start with chunkSize: 800 and chunkOverlap: 120, then tune based on document style and answer quality.

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 120
});
const chunks = await splitter.splitDocuments(docs);
console.log(chunks.length);

Splitter comparison

The example above uses RecursiveCharacterTextSplitter, the default for most RAG setups. Alternatives:

SplitterBest for
RecursiveCharacterTextSplitterDefault choice; tries paragraphs, then sentences, then words
CharacterTextSplitterFixed character windows when structure does not matter
TokenTextSplitterWhen chunk limits must match model token budgets

Character-based:

import { CharacterTextSplitter } from '@langchain/textsplitters';
const splitter = new CharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 120
});
const chunks = await splitter.splitDocuments(docs);

Token-based:

import { TokenTextSplitter } from '@langchain/textsplitters';
const splitter = new TokenTextSplitter({
encodingName: 'cl100k_base',
chunkSize: 200,
chunkOverlap: 20
});
const chunks = await splitter.splitDocuments(docs);

Use token-based splitting when chunks must fit within a model's context window. Character-based recursive splitting is the usual starting point for RAG over prose.

Metadata through the pipeline

Pass metadata when creating documents manually, or rely on loader metadata - splitters preserve it on each chunk:

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 60
});
const chunks = await splitter.createDocuments(
['First paragraph.\n\nSecond paragraph.'],
[{ source: 'manual', section: 'intro' }]
);
console.log(chunks[0].metadata);

After splitDocuments(docs), each chunk keeps fields like source from the parent document. Use those fields when storing chunks in a vector database or displaying citations.

Choosing parameters

  • Short FAQs or API docs - smaller chunkSize (300–500) for precise retrieval
  • Long guides or blog posts - larger chunkSize (800–1200) to keep sections together
  • More overlap - helps when answers span chunk boundaries; increases storage and embedding cost
  • Less overlap - fewer redundant chunks; risk losing context at splits

Tune with real questions from your domain.

Demo

Runnable loader and splitter scripts for this post live in the langchain-loaders-chunking-demo folder. Get access via code demos.

RAG with OpenAI Embeddings, pgvector and LangChain

June 2, 2026

Retrieval-Augmented Generation (RAG) is a practical pattern: store knowledge as embeddings, retrieve the most relevant chunks with semantic search, then generate an answer grounded in that context.

This guide shows an end-to-end RAG flow with LangChain, OpenAI embeddings, PostgreSQL + pgvector, and an LCEL answer chain. For LangChain basics, see the LangChain overview post. For loaders and splitter choice, see the loaders and chunking post.

Prerequisites

  • OpenAI account
  • Generated API key
  • Enabled billing
  • Node.js version 26
  • PostgreSQL with pgvector extension enabled
  • npm packages: @langchain/pgvector, @langchain/openai, @langchain/core, @langchain/textsplitters, langchain, pg
npm i @langchain/pgvector @langchain/openai @langchain/core @langchain/textsplitters langchain pg

What are embeddings?

Embeddings are numeric vectors that represent the semantic meaning of text. Similar text should produce vectors that are close in vector space.

In this pipeline:

  • Split source documents into chunks
  • Embed chunks with OpenAIEmbeddings and store them in pgvector via PGVectorStore
  • Embed the user question at query time and retrieve nearest chunks with a LangChain retriever
  • Pass retrieved context into an LCEL chain that calls ChatOpenAI

Chunk documents

Chunking makes retrieval more precise. Instead of embedding one large document, split it into smaller overlapping parts. Start with chunkSize: 800 and chunkOverlap: 120, then adjust based on your document style and answer quality.

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 120
});
const docs = await splitter.createDocuments(
['RAG combines retrieval and generation. Store chunks as vectors and fetch similar chunks at query time.'],
[{ source: 'notes.md' }]
);

Store chunks in pgvector

Use PGVectorStore from @langchain/pgvector. It creates the table if needed, embeds documents, and stores vectors with metadata.

import pg from 'pg';
import { OpenAIEmbeddings } from '@langchain/openai';
import { PGVectorStore } from '@langchain/pgvector';
const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-small' });
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
const vectorStore = await PGVectorStore.initialize(embeddings, {
pool,
tableName: 'rag_documents',
columns: {
idColumnName: 'id',
vectorColumnName: 'vector',
contentColumnName: 'content',
metadataColumnName: 'metadata'
},
distanceStrategy: 'cosine'
});
await vectorStore.addDocuments(docs);

Retrieve context

Turn the vector store into a retriever to fetch the top-k relevant chunks for a question:

const retriever = vectorStore.asRetriever({ k: 4 });
const chunks = await retriever.invoke('How does pgvector semantic search work?');

RAG chain with LCEL

Wire retrieval and generation with LCEL. The retriever supplies context; the model answers from that context only.

import { ChatPromptTemplate } from '@langchain/core/prompts';
import { StringOutputParser } from '@langchain/core/output_parsers';
import { RunnablePassthrough, RunnableSequence } from '@langchain/core/runnables';
import { ChatOpenAI } from '@langchain/openai';
const prompt = ChatPromptTemplate.fromMessages([
[
'system',
'Answer only from the provided context. If context is insufficient, say you need more data.'
],
['human', 'Context:\n{context}\n\nQuestion: {question}']
]);
const model = new ChatOpenAI({ model: 'gpt-5.5' });
const formatDocs = (documents) =>
documents.map((doc) => doc.pageContent).join('\n\n---\n\n');
const chain = RunnableSequence.from([
{
context: retriever,
question: new RunnablePassthrough()
},
(input) => ({
context: formatDocs(input.context),
question: input.question
}),
prompt,
model,
new StringOutputParser()
]);
const answer = await chain.invoke('How does pgvector semantic search work?');
console.log(answer);

Demo

Runnable scripts for this post live in the rag-openai-embeddings-pgvector-demo folder in the private demos repository. Get access via code demos.