RAG with OpenAI Embeddings, pgvector and LangChain
Retrieval-Augmented Generation (RAG) is a practical pattern: store knowledge as embeddings, retrieve the most relevant chunks with semantic search, then generate an answer grounded in that context.
This guide shows a simple end-to-end flow with OpenAI embeddings, PostgreSQL + pgvector, and LangChain chunking.
Architecture at a glance
Prerequisites
- OpenAI account
- Generated API key
- Enabled billing
- Node.js version 26
- PostgreSQL with pgvector extension enabled
- npm packages:
openai,langchain,pg,pgvector
What are embeddings?
Embeddings are numeric vectors that represent the semantic meaning of text. Similar text should produce vectors that are close in vector space.
In practice:
- Convert document chunks to vectors and store them in pgvector
- Convert a user question to a vector
- Run a nearest-neighbor search to find the most relevant chunks
OpenAI client setup
import OpenAI from 'openai';const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
Embedding one input element
Use a single string when embedding a user query.
const response = await client.embeddings.create({model: 'text-embedding-3-small',input: 'How do I connect pgvector to PostgreSQL?',});const queryEmbedding = response.data[0].embedding;console.log(queryEmbedding.length);
Embedding multiple input elements
Use an array to embed multiple chunks in one API call.
const chunks = ['pgvector adds vector similarity search to PostgreSQL.','LangChain helps split long documents into retrieval-friendly chunks.','RAG retrieves context first, then asks an LLM to answer.',];const response = await client.embeddings.create({model: 'text-embedding-3-small',input: chunks,});const rows = response.data.map((item, index) => ({text: chunks[index],embedding: item.embedding,}));console.log(rows.length); // 3
Chunking documents with LangChain
Chunking makes retrieval more precise. Instead of embedding one large document, split it into smaller overlapping parts.
Start with chunkSize: 800 and chunkOverlap: 120, then adjust based on your document style and answer quality.
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';const splitter = new RecursiveCharacterTextSplitter({chunkSize: 800,chunkOverlap: 120,});const docs = await splitter.createDocuments([`RAG combines retrieval and generation. Store chunks as vectors and fetch similar chunks at query time.`,]);console.log(docs.map((doc) => doc.pageContent));
Store embeddings in pgvector
Create a table with a vector column. text-embedding-3-small outputs 1536 dimensions.
CREATE EXTENSION IF NOT EXISTS vector;CREATE TABLE IF NOT EXISTS rag_chunks (id BIGSERIAL PRIMARY KEY,content TEXT NOT NULL,embedding VECTOR(1536) NOT NULL,source TEXT,created_at TIMESTAMPTZ NOT NULL DEFAULT NOW());
Insert chunk vectors from Node.js:
import pg from 'pg';import pgvector from 'pgvector/pg';const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });await pgvector.registerTypes(pool);await pool.query(`INSERT INTO rag_chunks (content, embedding, source)VALUES ($1, $2, $3)`,['Chunk content', pgvector.toSql(queryEmbedding), 'notes.md']);
Semantic search in pgvector
Embed the user question, then retrieve nearest chunks using cosine distance.
Lower distance means a closer semantic match.
top-k means how many nearest chunks you return (in this query, k=4 with LIMIT 4).
You can also use a simple threshold (for example 0.4) to discard weak matches.
As a starting point, many setups work well in the 0.35 to 0.45 range for cosine distance, then tune with real questions from your domain.
const searchResult = await pool.query(`SELECT id, content, source, embedding <=> $1::vector AS distanceFROM rag_chunksORDER BY embedding <=> $1::vectorLIMIT 4`,[pgvector.toSql(queryEmbedding)]);const contextChunks = searchResult.rows.map((row) => row.content);
Threshold filtering example:
const DISTANCE_THRESHOLD = 0.4;const filteredChunks = searchResult.rows.filter((row) => Number(row.distance) <= DISTANCE_THRESHOLD).map((row) => row.content);
If no chunks pass the threshold, skip answer generation and return a fallback message:
if (filteredChunks.length === 0) {console.log('I do not have enough context to answer this.');process.exit(0);}
Generate answer from retrieved context
Use retrieved chunks as grounded context for the final model call.
const context = contextChunks.join('\n\n---\n\n');const answer = await client.responses.create({model: 'gpt-5.5',instructions:'Answer only from the provided context. If context is insufficient, respond with: I do not have enough context to answer this.',input: `Context:\n${context}\n\nQuestion: How does pgvector semantic search work?`,});console.log(answer.output_text);
Demo
Runnable scripts for this post live in the rag-openai-embeddings-pgvector-demo folder in the private demos repository. Get access via code demos.