Introduction
I built a RAG (Retrieval-Augmented Generation) application that lets users ask natural language questions about a research project’s published reports (10 volumes) and receive answers with citations to the relevant source materials.
This article covers the tech stack and key design decisions behind the app.
Architecture Overview
User
↓ Question
Next.js (App Router)
↓ API Route
Query Rewrite (LLM)
↓ Refined search query
Embedding Generation (text-embedding-3-small)
↓ Vector
Pinecone (Vector Search, topK=8)
↓ Relevant chunks
LLM (Claude Sonnet) ← System prompt + Context
↓ SSE Streaming
Display answer to user
Frontend
Next.js 16 + React 19 + TypeScript
The app uses the App Router with a simple 3-page structure.
| Path | Content |
|---|---|
/ | Landing page (with example question links) |
/chat | Chat UI |
/about | About the site |
The chat page uses useSearchParams so that clicking an example question on the landing page navigates to /chat?q=... and automatically submits the query.
Tailwind CSS v4
Tailwind CSS v4 is used for styling. In v4, setup is as simple as @import "tailwindcss", eliminating the need for tailwind.config.js.
Streaming Display
Answers are streamed via SSE (Server-Sent Events). The frontend reads chunks incrementally using ReadableStream’s getReader() and renders Markdown in real-time with react-markdown.
While waiting for a response, a dot-pulse animation (CSS @keyframes) indicates that processing is underway.
// Show loading animation for empty assistant messages
{msg.content ? (
<ReactMarkdown>{msg.content}</ReactMarkdown>
) : (
<div className="flex items-center gap-1.5">
<span className="w-2 h-2 rounded-full dot-pulse" />
<span className="w-2 h-2 rounded-full dot-pulse" />
<span className="w-2 h-2 rounded-full dot-pulse" />
</div>
)}
Backend
API Route (/api/chat)
A single Next.js Route Handler processes requests through the following pipeline:
- Query Rewrite — If there’s conversation history, the LLM rewrites the search query into a self-contained question
- Embedding Generation — Vectorize the search query
- Vector Search — Similarity search in Pinecone (topK=8)
- Filtering — Remove low-quality chunks (score threshold, Japanese character ratio, CSV detection, minimum length)
- Answer Generation — Stream an LLM response using the context-enriched prompt
AI Provider Abstraction
AI providers are abstracted in src/lib/ai.ts and can be switched via the AI_PROVIDER environment variable.
| Provider | Embedding | Chat |
|---|---|---|
| OpenRouter (default) | OpenAI text-embedding-3-small | Claude Sonnet 4 |
| AWS Bedrock | Amazon Titan Embed Text v2 | Claude Sonnet 4 |
OpenRouter provides an OpenAI-compatible API, allowing various models to be used through a unified interface. AWS Bedrock uses the Anthropic Messages API format, and the response stream is converted to SSE format for consistent handling.
// Convert Bedrock stream to SSE format
const stream = new ReadableStream({
async start(controller) {
for await (const event of response.body) {
if (event.chunk?.bytes) {
const json = JSON.parse(new TextDecoder().decode(event.chunk.bytes));
if (json.type === "content_block_delta" && json.delta?.text) {
const sseData = {
choices: [{ delta: { content: json.delta.text } }],
};
controller.enqueue(
encoder.encode(`data: ${JSON.stringify(sseData)}\n\n`)
);
}
}
}
controller.close();
},
});
Vector DB — Pinecone
Pinecone was chosen as the vector database.
Initially, I considered Supabase’s pgvector and had even prepared the schema, but migrated to Pinecone because its managed vector search service is simpler to operate and delivers sufficient performance on the free tier.
Pinecone’s free tier (Starter plan) limits are as follows:
| Item | Limit |
|---|---|
| Indexes | 5 |
| Storage | 2 GB |
| Writes | 2M write units/month |
| Reads | 1M read units/month |
| Region | us-east-1 only |
For a RAG app with a few thousand chunks like this one, the free tier is more than enough. However, note that indexes are paused after 3 weeks of inactivity.
// Simple query
const results = await index.query({
vector: queryEmbedding,
topK: 8,
includeMetadata: true,
});
Data Ingestion Pipeline
The historical documents are provided as HTML files (some in Shift_JIS encoding) and text files. The ingestion script (scripts/ingest.mjs) performs the following steps.
1. Text Extraction
- HTML: Parsed with
cheerio, removingscript/styletags to extract text - Character Encoding: UTF-8 is tried first; if decoding fails,
iconv-litedecodes as Shift_JIS - TXT: Buffer-based reading with the same encoding detection
The source materials consist of approximately 7,600 HTML files and 420 text files, with HTML being the vast majority. Many historical HTML documents use Shift_JIS encoding, making automatic encoding detection a subtly important feature.
2. Chunking
A fixed-length (1000 characters) sliding window approach with 200-character overlap is used. Chunks shorter than 50 characters are discarded.
function splitIntoChunks(text, maxChars = 1000, overlap = 200) {
const chunks = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + maxChars, text.length);
const chunk = text.slice(start, end);
if (chunk.length > 50) chunks.push(chunk);
start += maxChars - overlap;
}
return chunks;
}
3. Idempotent Ingestion
Chunk IDs are generated as MD5 hashes of the content + source path. Before ingestion, existing IDs are checked against Pinecone, and only unregistered chunks are processed. A --fresh flag enables full re-ingestion.
4. Batch Processing
Embedding generation and Pinecone upserts are executed in batches of 50, with 1-second intervals for rate limiting. Failed operations are retried up to 3 times.
Search Quality Improvements
Query Rewrite
In multi-turn conversations, users often ask follow-up questions with pronouns like “tell me more about that.” By rewriting these into self-contained search queries using conversation history, vector search accuracy is maintained.
Low-Quality Chunk Filtering
The following filters are applied to remove noise from vector search results:
- Exclude chunks with similarity scores below 0.5
- Exclude chunks with less than 20% Japanese characters (to filter out HTML remnants and alphanumeric noise)
- Exclude chunks where commas exceed 5% of the content (to filter out CSV data)
- Exclude chunks shorter than 100 characters
Source Citations
Every answer includes citations (volume number, document name) with links to the original sources. The system prompt explicitly instructs the LLM to “never supplement, guess, or fabricate information not found in the reference materials,” suppressing hallucinations.
Authentication
Setting BASIC_USER / BASIC_PASS environment variables enables Basic authentication. If unset, the app is publicly accessible. This is convenient for switching between staging and production environments.
Tech Stack Summary
| Category | Technology |
|---|---|
| Framework | Next.js 16 (App Router) |
| Language | TypeScript, React 19 |
| Styling | Tailwind CSS v4 |
| Vector DB | Pinecone |
| Embedding | OpenAI text-embedding-3-small (via OpenRouter) |
| LLM | Claude Sonnet 4 (via OpenRouter / AWS Bedrock) |
| Data Ingestion | cheerio, iconv-lite |
| Markdown Rendering | react-markdown |
| Authentication | Basic Auth (optional) |
Conclusion
For specialized domains like historical documents, RAG provides a significant advantage by enabling “source-grounded answers.” Relying solely on LLM knowledge tends to produce inaccurate information, but by retrieving relevant documents through vector search and citing sources explicitly, trustworthy answers become possible.
On the technical side, the architecture leverages each tool’s strengths: OpenRouter for API unification, Pinecone for managed vector search, and Next.js App Router for streaming support. Switching to AWS Bedrock requires only changing an environment variable thanks to the provider abstraction, allowing flexible choices based on cost and requirements.