Introduction

I built a RAG (Retrieval-Augmented Generation) application that lets users ask natural language questions about a research project’s published reports (10 volumes) and receive answers with citations to the relevant source materials.

This article covers the tech stack and key design decisions behind the app.

Architecture Overview

User
  ↓ Question
Next.js (App Router)
  ↓ API Route
Query Rewrite (LLM)
  ↓ Refined search query
Embedding Generation (text-embedding-3-small)
  ↓ Vector
Pinecone (Vector Search, topK=8)
  ↓ Relevant chunks
LLM (Claude Sonnet) ← System prompt + Context
  ↓ SSE Streaming
Display answer to user

Frontend

Next.js 16 + React 19 + TypeScript

The app uses the App Router with a simple 3-page structure.

PathContent
/Landing page (with example question links)
/chatChat UI
/aboutAbout the site

The chat page uses useSearchParams so that clicking an example question on the landing page navigates to /chat?q=... and automatically submits the query.

Tailwind CSS v4

Tailwind CSS v4 is used for styling. In v4, setup is as simple as @import "tailwindcss", eliminating the need for tailwind.config.js.

Streaming Display

Answers are streamed via SSE (Server-Sent Events). The frontend reads chunks incrementally using ReadableStream’s getReader() and renders Markdown in real-time with react-markdown.

While waiting for a response, a dot-pulse animation (CSS @keyframes) indicates that processing is underway.

// Show loading animation for empty assistant messages
{msg.content ? (
  <ReactMarkdown>{msg.content}</ReactMarkdown>
) : (
  <div className="flex items-center gap-1.5">
    <span className="w-2 h-2 rounded-full dot-pulse" />
    <span className="w-2 h-2 rounded-full dot-pulse" />
    <span className="w-2 h-2 rounded-full dot-pulse" />
  </div>
)}

Backend

API Route (/api/chat)

A single Next.js Route Handler processes requests through the following pipeline:

  1. Query Rewrite — If there’s conversation history, the LLM rewrites the search query into a self-contained question
  2. Embedding Generation — Vectorize the search query
  3. Vector Search — Similarity search in Pinecone (topK=8)
  4. Filtering — Remove low-quality chunks (score threshold, Japanese character ratio, CSV detection, minimum length)
  5. Answer Generation — Stream an LLM response using the context-enriched prompt

AI Provider Abstraction

AI providers are abstracted in src/lib/ai.ts and can be switched via the AI_PROVIDER environment variable.

ProviderEmbeddingChat
OpenRouter (default)OpenAI text-embedding-3-smallClaude Sonnet 4
AWS BedrockAmazon Titan Embed Text v2Claude Sonnet 4

OpenRouter provides an OpenAI-compatible API, allowing various models to be used through a unified interface. AWS Bedrock uses the Anthropic Messages API format, and the response stream is converted to SSE format for consistent handling.

// Convert Bedrock stream to SSE format
const stream = new ReadableStream({
  async start(controller) {
    for await (const event of response.body) {
      if (event.chunk?.bytes) {
        const json = JSON.parse(new TextDecoder().decode(event.chunk.bytes));
        if (json.type === "content_block_delta" && json.delta?.text) {
          const sseData = {
            choices: [{ delta: { content: json.delta.text } }],
          };
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify(sseData)}\n\n`)
          );
        }
      }
    }
    controller.close();
  },
});

Vector DB — Pinecone

Pinecone was chosen as the vector database.

Initially, I considered Supabase’s pgvector and had even prepared the schema, but migrated to Pinecone because its managed vector search service is simpler to operate and delivers sufficient performance on the free tier.

Pinecone’s free tier (Starter plan) limits are as follows:

ItemLimit
Indexes5
Storage2 GB
Writes2M write units/month
Reads1M read units/month
Regionus-east-1 only

For a RAG app with a few thousand chunks like this one, the free tier is more than enough. However, note that indexes are paused after 3 weeks of inactivity.

// Simple query
const results = await index.query({
  vector: queryEmbedding,
  topK: 8,
  includeMetadata: true,
});

Data Ingestion Pipeline

The historical documents are provided as HTML files (some in Shift_JIS encoding) and text files. The ingestion script (scripts/ingest.mjs) performs the following steps.

1. Text Extraction

  • HTML: Parsed with cheerio, removing script/style tags to extract text
  • Character Encoding: UTF-8 is tried first; if decoding fails, iconv-lite decodes as Shift_JIS
  • TXT: Buffer-based reading with the same encoding detection

The source materials consist of approximately 7,600 HTML files and 420 text files, with HTML being the vast majority. Many historical HTML documents use Shift_JIS encoding, making automatic encoding detection a subtly important feature.

2. Chunking

A fixed-length (1000 characters) sliding window approach with 200-character overlap is used. Chunks shorter than 50 characters are discarded.

function splitIntoChunks(text, maxChars = 1000, overlap = 200) {
  const chunks = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + maxChars, text.length);
    const chunk = text.slice(start, end);
    if (chunk.length > 50) chunks.push(chunk);
    start += maxChars - overlap;
  }
  return chunks;
}

3. Idempotent Ingestion

Chunk IDs are generated as MD5 hashes of the content + source path. Before ingestion, existing IDs are checked against Pinecone, and only unregistered chunks are processed. A --fresh flag enables full re-ingestion.

4. Batch Processing

Embedding generation and Pinecone upserts are executed in batches of 50, with 1-second intervals for rate limiting. Failed operations are retried up to 3 times.

Search Quality Improvements

Query Rewrite

In multi-turn conversations, users often ask follow-up questions with pronouns like “tell me more about that.” By rewriting these into self-contained search queries using conversation history, vector search accuracy is maintained.

Low-Quality Chunk Filtering

The following filters are applied to remove noise from vector search results:

  • Exclude chunks with similarity scores below 0.5
  • Exclude chunks with less than 20% Japanese characters (to filter out HTML remnants and alphanumeric noise)
  • Exclude chunks where commas exceed 5% of the content (to filter out CSV data)
  • Exclude chunks shorter than 100 characters

Source Citations

Every answer includes citations (volume number, document name) with links to the original sources. The system prompt explicitly instructs the LLM to “never supplement, guess, or fabricate information not found in the reference materials,” suppressing hallucinations.

Authentication

Setting BASIC_USER / BASIC_PASS environment variables enables Basic authentication. If unset, the app is publicly accessible. This is convenient for switching between staging and production environments.

Tech Stack Summary

CategoryTechnology
FrameworkNext.js 16 (App Router)
LanguageTypeScript, React 19
StylingTailwind CSS v4
Vector DBPinecone
EmbeddingOpenAI text-embedding-3-small (via OpenRouter)
LLMClaude Sonnet 4 (via OpenRouter / AWS Bedrock)
Data Ingestioncheerio, iconv-lite
Markdown Renderingreact-markdown
AuthenticationBasic Auth (optional)

Conclusion

For specialized domains like historical documents, RAG provides a significant advantage by enabling “source-grounded answers.” Relying solely on LLM knowledge tends to produce inaccurate information, but by retrieving relevant documents through vector search and citing sources explicitly, trustworthy answers become possible.

On the technical side, the architecture leverages each tool’s strengths: OpenRouter for API unification, Pinecone for managed vector search, and Next.js App Router for streaming support. Switching to AWS Bedrock requires only changing an environment variable thanks to the provider abstraction, allowing flexible choices based on cost and requirements.