Document loaders and chunking with LangChain
This post covers local file ingestion and chunking in Node.js. For LangChain basics (LCEL, packages, agents), see the LangChain overview post. For embeddings, pgvector, and the full RAG flow, see the RAG with pgvector post - it uses one splitter inline; this post goes deeper on loaders and splitter choice.
Prerequisites
- Node.js version 26
langchain,@langchain/core,@langchain/classic, and@langchain/textsplittersinstalled
npm i langchain @langchain/core @langchain/classic @langchain/textsplitters
More loader types (web, cloud, audio) live in standalone integration packages - see the document loader integrations page.
The Document type
Every loader returns Document instances from @langchain/core:
pageContent- the text of the chunk or filemetadata- optional key/value pairs (source path, section, page) used for citations
import { Document } from '@langchain/core/documents';const doc = new Document({pageContent: 'pgvector adds vector search to PostgreSQL.',metadata: { source: 'notes/pgvector.txt', section: 'basics' }});
Load a single file
Use TextLoader for plain text or markdown files:
import { TextLoader } from '@langchain/classic/document_loaders/fs/text';const loader = new TextLoader('./notes/pgvector.txt');const docs = await loader.load();console.log(docs[0].pageContent);console.log(docs[0].metadata.source);
The loader sets metadata.source to the file path - keep it for citations in RAG answers.
Load a directory
Use DirectoryLoader when you have many files. Map extensions to loader factories:
import { DirectoryLoader } from '@langchain/classic/document_loaders/fs/directory';import { TextLoader } from '@langchain/classic/document_loaders/fs/text';const loader = new DirectoryLoader('./notes', {'.txt': (path) => new TextLoader(path),'.md': (path) => new TextLoader(path)});const docs = await loader.load();console.log(`Loaded ${docs.length} documents`);
PDF, CSV, and JSON loaders are available via other integration packages. This post uses .txt and .md files.
Split documents
Chunking makes retrieval more precise. Instead of embedding one large file, split it into smaller overlapping parts. Pass the docs array from TextLoader or DirectoryLoader to a splitter:
Two parameters matter most:
chunkSize- target maximum size per chunk (characters or tokens, depending on splitter)chunkOverlap- shared text between adjacent chunks so context is not lost at boundaries
Start with chunkSize: 800 and chunkOverlap: 120, then tune based on document style and answer quality.
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';const splitter = new RecursiveCharacterTextSplitter({chunkSize: 800,chunkOverlap: 120});const chunks = await splitter.splitDocuments(docs);console.log(chunks.length);
Splitter comparison
The example above uses RecursiveCharacterTextSplitter, the default for most RAG setups. Alternatives:
| Splitter | Best for |
|---|---|
RecursiveCharacterTextSplitter | Default choice; tries paragraphs, then sentences, then words |
CharacterTextSplitter | Fixed character windows when structure does not matter |
TokenTextSplitter | When chunk limits must match model token budgets |
Character-based:
import { CharacterTextSplitter } from '@langchain/textsplitters';const splitter = new CharacterTextSplitter({chunkSize: 800,chunkOverlap: 120});const chunks = await splitter.splitDocuments(docs);
Token-based:
import { TokenTextSplitter } from '@langchain/textsplitters';const splitter = new TokenTextSplitter({encodingName: 'cl100k_base',chunkSize: 200,chunkOverlap: 20});const chunks = await splitter.splitDocuments(docs);
Use token-based splitting when chunks must fit within a model's context window. Character-based recursive splitting is the usual starting point for RAG over prose.
Metadata through the pipeline
Pass metadata when creating documents manually, or rely on loader metadata - splitters preserve it on each chunk:
const splitter = new RecursiveCharacterTextSplitter({chunkSize: 400,chunkOverlap: 60});const chunks = await splitter.createDocuments(['First paragraph.\n\nSecond paragraph.'],[{ source: 'manual', section: 'intro' }]);console.log(chunks[0].metadata);
After splitDocuments(docs), each chunk keeps fields like source from the parent document. Use those fields when storing chunks in a vector database or displaying citations.
Choosing parameters
- Short FAQs or API docs - smaller
chunkSize(300–500) for precise retrieval - Long guides or blog posts - larger
chunkSize(800–1200) to keep sections together - More overlap - helps when answers span chunk boundaries; increases storage and embedding cost
- Less overlap - fewer redundant chunks; risk losing context at splits
Tune with real questions from your domain.
Demo
Runnable loader and splitter scripts for this post live in the langchain-loaders-chunking-demo folder. Get access via code demos.