Tech Stack for a RAG App That Searches Historical Documents with AI

Introduction

I built a RAG (Retrieval-Augmented Generation) application that lets users ask natural language questions about a research project’s published reports (10 volumes) and receive answers with citations to the relevant source materials.

This article covers the tech stack and key design decisions behind the app.

Architecture Overview

User
  ↓ Question
Next.js (App Router)
  ↓ API Route
Query Rewrite (LLM)
  ↓ Refined search query
Embedding Generation (text-embedding-3-small)
  ↓ Vector
Pinecone (Vector Search, topK=8)
  ↓ Relevant chunks
LLM (Claude Sonnet) ← System prompt + Context
  ↓ SSE Streaming
Display answer to user

Frontend

Next.js 16 + React 19 + TypeScript

The app uses the App Router with a simple 3-page structure.

Path	Content
`/`	Landing page (with example question links)
`/chat`	Chat UI
`/about`	About the site

The chat page uses useSearchParams so that clicking an example question on the landing page navigates to /chat?q=... and automatically submits the query.

Tailwind CSS v4

Tailwind CSS v4 is used for styling. In v4, setup is as simple as @import "tailwindcss", eliminating the need for tailwind.config.js.

Streaming Display

Answers are streamed via SSE (Server-Sent Events). The frontend reads chunks incrementally using ReadableStream’s getReader() and renders Markdown in real-time with react-markdown.

While waiting for a response, a dot-pulse animation (CSS @keyframes) indicates that processing is underway.

// Show loading animation for empty assistant messages
{msg.content ? (
  <ReactMarkdown>{msg.content}</ReactMarkdown>
) : (
  <div className="flex items-center gap-1.5">
    <span className="w-2 h-2 rounded-full dot-pulse" />
    <span className="w-2 h-2 rounded-full dot-pulse" />
    <span className="w-2 h-2 rounded-full dot-pulse" />
  </div>
)}

Backend

API Route (`/api/chat`)

A single Next.js Route Handler processes requests through the following pipeline:

Query Rewrite — If there’s conversation history, the LLM rewrites the search query into a self-contained question
Embedding Generation — Vectorize the search query
Vector Search — Similarity search in Pinecone (topK=8)
Filtering — Remove low-quality chunks (score threshold, Japanese character ratio, CSV detection, minimum length)
Answer Generation — Stream an LLM response using the context-enriched prompt

AI Provider Abstraction

AI providers are abstracted in src/lib/ai.ts and can be switched via the AI_PROVIDER environment variable.

Provider	Embedding	Chat
OpenRouter (default)	OpenAI text-embedding-3-small	Claude Sonnet 4
AWS Bedrock	Amazon Titan Embed Text v2	Claude Sonnet 4

OpenRouter provides an OpenAI-compatible API, allowing various models to be used through a unified interface. AWS Bedrock uses the Anthropic Messages API format, and the response stream is converted to SSE format for consistent handling.

// Convert Bedrock stream to SSE format
const stream = new ReadableStream({
  async start(controller) {
    for await (const event of response.body) {
      if (event.chunk?.bytes) {
        const json = JSON.parse(new TextDecoder().decode(event.chunk.bytes));
        if (json.type === "content_block_delta" && json.delta?.text) {
          const sseData = {
            choices: [{ delta: { content: json.delta.text } }],
          };
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify(sseData)}\n\n`)
          );
        }
      }
    }
    controller.close();
  },
});

Vector DB — Pinecone

Pinecone was chosen as the vector database.

Initially, I considered Supabase’s pgvector and had even prepared the schema, but migrated to Pinecone because its managed vector search service is simpler to operate and delivers sufficient performance on the free tier.

Pinecone’s free tier (Starter plan) limits are as follows:

Item	Limit
Indexes	5
Storage	2 GB
Writes	2M write units/month
Reads	1M read units/month
Region	us-east-1 only

For a RAG app with a few thousand chunks like this one, the free tier is more than enough. However, note that indexes are paused after 3 weeks of inactivity.

// Simple query
const results = await index.query({
  vector: queryEmbedding,
  topK: 8,
  includeMetadata: true,
});

Data Ingestion Pipeline

The historical documents are provided as HTML files (some in Shift_JIS encoding) and text files. The ingestion script (scripts/ingest.mjs) performs the following steps.

1. Text Extraction

HTML: Parsed with cheerio, removing script/style tags to extract text
Character Encoding: UTF-8 is tried first; if decoding fails, iconv-lite decodes as Shift_JIS
TXT: Buffer-based reading with the same encoding detection

The source materials consist of approximately 7,600 HTML files and 420 text files, with HTML being the vast majority. Many historical HTML documents use Shift_JIS encoding, making automatic encoding detection a subtly important feature.

2. Chunking

A fixed-length (1000 characters) sliding window approach with 200-character overlap is used. Chunks shorter than 50 characters are discarded.

function splitIntoChunks(text, maxChars = 1000, overlap = 200) {
  const chunks = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + maxChars, text.length);
    const chunk = text.slice(start, end);
    if (chunk.length > 50) chunks.push(chunk);
    start += maxChars - overlap;
  }
  return chunks;
}

3. Idempotent Ingestion

Chunk IDs are generated as MD5 hashes of the content + source path. Before ingestion, existing IDs are checked against Pinecone, and only unregistered chunks are processed. A --fresh flag enables full re-ingestion.

4. Batch Processing

Embedding generation and Pinecone upserts are executed in batches of 50, with 1-second intervals for rate limiting. Failed operations are retried up to 3 times.

Search Quality Improvements

Query Rewrite

In multi-turn conversations, users often ask follow-up questions with pronouns like “tell me more about that.” By rewriting these into self-contained search queries using conversation history, vector search accuracy is maintained.

Low-Quality Chunk Filtering

The following filters are applied to remove noise from vector search results:

Exclude chunks with similarity scores below 0.5
Exclude chunks with less than 20% Japanese characters (to filter out HTML remnants and alphanumeric noise)
Exclude chunks where commas exceed 5% of the content (to filter out CSV data)
Exclude chunks shorter than 100 characters

Source Citations

Every answer includes citations (volume number, document name) with links to the original sources. The system prompt explicitly instructs the LLM to “never supplement, guess, or fabricate information not found in the reference materials,” suppressing hallucinations.

Authentication

Setting BASIC_USER / BASIC_PASS environment variables enables Basic authentication. If unset, the app is publicly accessible. This is convenient for switching between staging and production environments.

Tech Stack Summary

Category	Technology
Framework	Next.js 16 (App Router)
Language	TypeScript, React 19
Styling	Tailwind CSS v4
Vector DB	Pinecone
Embedding	OpenAI text-embedding-3-small (via OpenRouter)
LLM	Claude Sonnet 4 (via OpenRouter / AWS Bedrock)
Data Ingestion	cheerio, iconv-lite
Markdown Rendering	react-markdown
Authentication	Basic Auth (optional)

Conclusion

For specialized domains like historical documents, RAG provides a significant advantage by enabling “source-grounded answers.” Relying solely on LLM knowledge tends to produce inaccurate information, but by retrieving relevant documents through vector search and citing sources explicitly, trustworthy answers become possible.

On the technical side, the architecture leverages each tool’s strengths: OpenRouter for API unification, Pinecone for managed vector search, and Next.js App Router for streaming support. Switching to AWS Bedrock requires only changing an environment variable thanks to the provider abstraction, allowing flexible choices based on cost and requirements.

Introduction#

Architecture Overview#

Frontend#

Next.js 16 + React 19 + TypeScript#

Tailwind CSS v4#

Streaming Display#

Backend#

API Route (/api/chat)#

AI Provider Abstraction#

Vector DB — Pinecone#

Data Ingestion Pipeline#

1. Text Extraction#

2. Chunking#

3. Idempotent Ingestion#

4. Batch Processing#

Search Quality Improvements#

Query Rewrite#

Low-Quality Chunk Filtering#

Source Citations#

Authentication#

Tech Stack Summary#

Conclusion#