Building an NDC Book Classifier with LoRA: Fine-Tuning a Japanese LLM on Library Data

Notebook: Open in Google Colab / GitHub

TL;DR

Collected 617 bibliographic records from the National Diet Library Search API (SRU endpoint)
Fine-tuned llm-jp-3-1.8b with LoRA, training only 0.67% of all parameters
Pre-training accuracy: 22.0% → Post-training: 78.0% (+56 points)
LoRA teaches the model how to perform a task, not just memorize facts

What is NDC?

The Nippon Decimal Classification (NDC) is the standard book classification system used across Japanese libraries. Every book is assigned a numeric code, where the first digit indicates one of ten broad categories:

NDC	Category
0	General works (encyclopedias, information science)
1	Philosophy & religion
2	History & geography
3	Social sciences (law, economics, education)
4	Natural sciences (math, physics, medicine)
5	Technology & engineering
6	Industry (agriculture, commerce, transport)
7	Arts & sports
8	Language
9	Literature

Assigning NDC codes during cataloging requires subject analysis expertise. An AI that can estimate the broad category from a title alone would be useful as a first-pass screening tool, supporting librarians in the classification workflow.

LoRA in a Nutshell

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models. Instead of updating all 1.8 billion parameters, LoRA freezes the original model and inserts small adapter matrices into the Attention layers:

Base model (1.8B parameters)  →  Frozen (unchanged)
      ↓
LoRA adapters (~9M parameters)  →  Only these are trained

In this project, only about 0.67% of the total parameters (12,582,912 / 1,880,197,120) are trainable. This keeps GPU memory usage low while still achieving task-specific performance. Task adaptation often lies in a low-rank subspace, so updating a small fraction of the parameters can be sufficient.

Step 1. Data Collection from the NDL Search API

The National Diet Library Search provides a free SRU (Search/Retrieve via URL) API. We fetch up to 80 books per NDC category (roughly 800 records before filtering). After filtering by title length (3–80 characters), the per-category counts are as follows:

NDC	Category	Records
0	General works	65
1	Philosophy	67
2	History	73
3	Social sciences	59
4	Natural sciences	52
5	Technology	63
6	Industry	65
7	Arts & sports	57
8	Language	67
9	Literature	49

Note that the dataset contains some noise — the API returns records with very short or ambiguous titles that are difficult to classify even for humans.

NDC_NAMES = {
    "0": "General works", "1": "Philosophy", "2": "History",
    "3": "Social sciences", "4": "Natural sciences", "5": "Technology",
    "6": "Industry", "7": "Arts & sports", "8": "Language", "9": "Literature",
}

def fetch_ndl_books(ndc_digit, count=80, start=1):
    """Fetch bibliographic records for a given NDC digit from the NDL SRU API."""
    base_url = "https://ndlsearch.ndl.go.jp/api/sru"
    query = f'ndc="{ndc_digit}*"'
    params = (
        f"?operation=searchRetrieve"
        f"&query={urllib.parse.quote(query)}"
        f"&maximumRecords={count}"
        f"&startRecord={start}"
        f"&recordSchema=dcndl"
    )
    url = base_url + params
    # Parse XML response, extract title and NDC code
    # Filter to titles between 3-80 characters
    return books

# Collect from all 10 categories
all_books = []
for digit in "0123456789":
    books = fetch_ndl_books(digit, count=80)
    all_books.extend(books)
    time.sleep(1)  # Rate limiting

The data is shuffled and split: 5 samples per category (50 total) are reserved for testing, and the rest (567) are used for training.

Step 2. Prompt Design

The model is given a classification task: look at a book title and output the first digit of its NDC code.

以下の本のタイトルから、NDC（日本十進分類法）の1桁目を答えてください。

【NDC一覧】
0: 総記
1: 哲学
2: 歴史
3: 社会科学
4: 自然科学
5: 技術・工学
6: 産業
7: 芸術・スポーツ
8: 言語
9: 文学

【タイトル】吾輩は猫である
【NDC】

The prompt is in Japanese (matching the model’s training language). For training samples, the correct answer digit is appended after 【NDC】. At inference time, the model generates freely and the first digit (0-9) in the output is taken as the prediction.

Step 3. Model and LoRA Configuration

All experiments were run on Google Colab with a Tesla T4 GPU. The base model is llm-jp/llm-jp-3-1.8b, a Japanese-focused causal language model (~1.88 billion parameters).

MODEL_NAME = "llm-jp/llm-jp-3-1.8b"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto",
)

The LoRA configuration:

lora_config = LoraConfig(
    r=32,                # Rank of the adapter matrices
    lora_alpha=32,       # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # All attention projections
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

Total parameters:     1,880,197,120
Trainable (LoRA):        12,582,912 (0.67%)
→ Only 1/149 of all parameters are trained

With r=32, the adapter matrices are 32-dimensional. Applying LoRA to all four attention projections (Q, K, V, O) gives the model enough flexibility to learn the classification mapping. The resulting trainable parameters are roughly 0.67% of the total.

Step 4. Training

Training uses TRL’s SFTTrainer for supervised fine-tuning:

training_args = SFTConfig(
    output_dir="./lora_ndc_output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # Effective batch size: 16
    learning_rate=5e-4,
    num_train_epochs=5,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    dataset_text_field="text",
    report_to="none",
)

trainer = SFTTrainer(
    model=model, train_dataset=train_dataset, args=training_args,
)
trainer.train()

The training runs for 5 epochs (180 steps total) over 567 samples with an effective batch size of 16. With bf16 precision and LoRA, the entire training completes in a few minutes on a Tesla T4 GPU.

Reading the Training Loss

During training, the Training Loss indicates how wrong the model’s predictions are — lower is better.

Step	Training Loss
10	0.8592
20	0.5131
30	0.4557
40	0.4147
50	0.3825
60	0.3772
70	0.3523
80	0.3150
90	0.3094
100	0.2815
110	0.2545
120	0.2198
130	0.2311
140	0.2286
150	0.1983
160	0.1681
170	0.1775
180	0.1766

The loss decreased from 0.86 to 0.18 over 5 epochs (180 steps). This is a cross-entropy loss, so Loss = -log(probability of correct token):

Loss	Correct Prediction Probability	Interpretation
2.30	~10%	Random guessing (10 classes)
0.86	~42%	Step 10
0.46	~63%	Step 30
0.18	~84%	Final step

Note: This loss is averaged over the entire prompt (including the boilerplate NDC legend and title text). The template portions are memorized quickly and contribute low loss, while the actual NDC digit prediction — the part we care about — has higher loss than the average suggests. So the final test accuracy can’t be read directly from the loss number; you need to evaluate on held-out test data.

What LoRA Actually Teaches — Behavior, Not Knowledge

LoRA is not cramming NDC classification expertise into the model. It teaches a behavioral skill: given this input format, produce this output format. This is the same pattern seen across LoRA use cases:

	Legal Exam Example	NDC Classification
What’s taught	Answer format (a/b/c/d for multiple choice)	Answer format (0-9 for NDC)
What’s NOT taught	Legal knowledge	NDC classification expertise
Result	Accuracy improves	Accuracy improves

LoRA efficiently teaches the model to leverage its pre-existing Japanese vocabulary knowledge (e.g., “programming” → technology, “poetry collection” → literature) in a task-appropriate output format. The domain knowledge was already latent in the base model’s 1.8 billion parameters.

Step 5. Results

Before vs. After

Before LoRA: 11/50 correct = 22.0% accuracy (above the 10% random baseline, but unstable)
After LoRA: 39/50 correct = 78.0% accuracy (+56 points)

This was achieved by training only 0.67% of the model’s parameters.

Output Format: Before vs. After

A notable difference is in the output format itself:

Before: The model tends to output full NDC codes like “910.2”, “010”, “010.3”, or “369.3” — it doesn’t understand that the task requires a single digit
After: Outputs are consistently single digits ("1", “7”, “9”), showing that the model has learned the expected task format

Here is a sample from the 50-item test set comparing predictions before and after LoRA:

Title	Answer	Before	After
嗚呼孝子元政上人	1 (Philosophy)	9 (Literature)	1 (Philosophy)
「アーカイブ中核拠点形成モデル事業」（撮影所等に）	7 (Arts)	9 (Literature)	7 (Arts)
アーカーシャ年代記より	1 (Philosophy)	0 (General)	1 (Philosophy)
ああ言えばこう食う往復エッセイ	9 (Literature)	9 (Literature)	9 (Literature)
あゝ愛宕丘の灯：追憶の四十有余年	3 (Social sci.)	9 (Literature)	3 (Social sci.)
アーク溶接作業における粉じん対策に関する調査研究報告	4 (Natural sci.)	3 (Social sci.)	4 (Natural sci.)
アーキテクチャとプログラミングの基礎	4 (Natural sci.)	0 (General)	5 (Technology)
ああアメリカ：傷だらけの巨象	3 (Social sci.)	9 (Literature)	3 (Social sci.)

Before training, the model defaults to “9 (Literature)” or “0 (General)” for most inputs, showing no real classification ability. After LoRA, most predictions are correct. Cases like “アーキテクチャとプログラミングの基礎” (Fundamentals of Architecture and Programming) remain misclassified — distinguishing Technology (5) from Natural Sciences (4) based on title alone is inherently difficult.

Per-Category Analysis

Performance varies across categories. Categories with distinctive vocabulary in their titles (e.g., “Philosophy” with characteristic terms) tend to score higher. “General works” (NDC 0), which encompasses a broad range of topics, is harder to classify from title alone.

Step 6. Interactive Demo

The trained model can classify arbitrary book titles:

def predict_ndc(model, title):
    """Predict NDC category from a book title."""
    book = {"title": title, "ndc": "?", "ndc_name": "?"}
    prompt = make_prompt(book, include_answer=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_new_tokens=5,
            do_sample=False, pad_token_id=tokenizer.pad_token_id,
        )
    generated = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    ).strip()

    predicted = "?"
    for ch in generated:
        if ch in "0123456789":
            predicted = ch
            break
    return predicted, NDC_NAMES.get(predicted, "Unknown")

Example predictions:

Title	Predicted NDC	Category
吾輩は猫である (I Am a Cat)	8	Language
相対性理論入門 (Intro to Relativity)	4	Natural sciences
日本経済の構造改革 (Structural Reform of Japan’s Economy)	3	Social sciences
フランス料理の基本技法 (French Cooking Techniques)	7	Arts & sports
はじめてのPython入門 (Intro to Python)	4	Natural sciences
万葉集を読む (Reading the Man’yoshu)	9	Literature
西洋美術史 (History of Western Art)	7	Arts & sports
憲法判例百選 (100 Constitutional Law Cases)	3	Social sciences
英語の語源辞典 (English Etymology Dictionary)	8	Language
鉄道の歴史と未来 (History and Future of Railways)	6	Industry

Most predictions are reasonable. Some are incorrect — “I Am a Cat” (a classic novel) is predicted as Language (8) instead of Literature (9), and “Intro to Python” gets Natural Sciences (4) instead of General Works (0) — but on the whole the model shows a reasonable title-to-category mapping from 567 training examples.

Practical Considerations

Applications

Library cataloging support: Auto-classify incoming books as a first pass, reducing manual effort for librarians
Bookstore/publisher categorization: Automatic shelf assignment for inventory management
Bibliographic data enrichment: Fill in missing classification codes in incomplete records

Limitations

Top-level classification only: Real-world use requires 3-digit NDC codes (e.g., 913 = Japanese novels, 490 = Medicine). This is achievable with more training data.
Title-only input: Adding author names, publisher, and table of contents would improve accuracy substantially.
Data bias: Books available through the API skew toward recent publications.

Future Directions

Extend to 3-digit NDC classification for practical utility
Incorporate additional metadata (author, publisher) into the prompt
Combine with RAG (Retrieval-Augmented Generation) to reference similar books’ classifications during inference

Acknowledgments

I would like to thank Toru Aoike of the National Diet Library for introducing me to LoRA.

TL;DR#

What is NDC?#

LoRA in a Nutshell#

Step 1. Data Collection from the NDL Search API#

Step 2. Prompt Design#

Step 3. Model and LoRA Configuration#

Step 4. Training#

Reading the Training Loss#

What LoRA Actually Teaches — Behavior, Not Knowledge#

Step 5. Results#

Before vs. After#

Output Format: Before vs. After#

Per-Category Analysis#

Step 6. Interactive Demo#

Practical Considerations#

Applications#

Limitations#

Future Directions#

Acknowledgments#