Notebook: Open in Google Colab / GitHub
TL;DR
- Collected 617 bibliographic records from the National Diet Library Search API (SRU endpoint)
- Fine-tuned llm-jp-3-1.8b with LoRA, training only 0.67% of all parameters
- Pre-training accuracy: 22.0% → Post-training: 78.0% (+56 points)
- LoRA teaches the model how to perform a task, not just memorize facts
What is NDC?
The Nippon Decimal Classification (NDC) is the standard book classification system used across Japanese libraries. Every book is assigned a numeric code, where the first digit indicates one of ten broad categories:
| NDC | Category |
|---|---|
| 0 | General works (encyclopedias, information science) |
| 1 | Philosophy & religion |
| 2 | History & geography |
| 3 | Social sciences (law, economics, education) |
| 4 | Natural sciences (math, physics, medicine) |
| 5 | Technology & engineering |
| 6 | Industry (agriculture, commerce, transport) |
| 7 | Arts & sports |
| 8 | Language |
| 9 | Literature |
Assigning NDC codes during cataloging requires subject analysis expertise. An AI that can estimate the broad category from a title alone would be useful as a first-pass screening tool, supporting librarians in the classification workflow.
LoRA in a Nutshell
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models. Instead of updating all 1.8 billion parameters, LoRA freezes the original model and inserts small adapter matrices into the Attention layers:
Base model (1.8B parameters) → Frozen (unchanged)
↓
LoRA adapters (~9M parameters) → Only these are trained
In this project, only about 0.67% of the total parameters (12,582,912 / 1,880,197,120) are trainable. This keeps GPU memory usage low while still achieving task-specific performance. Task adaptation often lies in a low-rank subspace, so updating a small fraction of the parameters can be sufficient.
Step 1. Data Collection from the NDL Search API
The National Diet Library Search provides a free SRU (Search/Retrieve via URL) API. We fetch up to 80 books per NDC category (roughly 800 records before filtering). After filtering by title length (3–80 characters), the per-category counts are as follows:
| NDC | Category | Records |
|---|---|---|
| 0 | General works | 65 |
| 1 | Philosophy | 67 |
| 2 | History | 73 |
| 3 | Social sciences | 59 |
| 4 | Natural sciences | 52 |
| 5 | Technology | 63 |
| 6 | Industry | 65 |
| 7 | Arts & sports | 57 |
| 8 | Language | 67 |
| 9 | Literature | 49 |
Note that the dataset contains some noise — the API returns records with very short or ambiguous titles that are difficult to classify even for humans.
NDC_NAMES = {
"0": "General works", "1": "Philosophy", "2": "History",
"3": "Social sciences", "4": "Natural sciences", "5": "Technology",
"6": "Industry", "7": "Arts & sports", "8": "Language", "9": "Literature",
}
def fetch_ndl_books(ndc_digit, count=80, start=1):
"""Fetch bibliographic records for a given NDC digit from the NDL SRU API."""
base_url = "https://ndlsearch.ndl.go.jp/api/sru"
query = f'ndc="{ndc_digit}*"'
params = (
f"?operation=searchRetrieve"
f"&query={urllib.parse.quote(query)}"
f"&maximumRecords={count}"
f"&startRecord={start}"
f"&recordSchema=dcndl"
)
url = base_url + params
# Parse XML response, extract title and NDC code
# Filter to titles between 3-80 characters
return books
# Collect from all 10 categories
all_books = []
for digit in "0123456789":
books = fetch_ndl_books(digit, count=80)
all_books.extend(books)
time.sleep(1) # Rate limiting
The data is shuffled and split: 5 samples per category (50 total) are reserved for testing, and the rest (567) are used for training.
Step 2. Prompt Design
The model is given a classification task: look at a book title and output the first digit of its NDC code.
以下の本のタイトルから、NDC(日本十進分類法)の1桁目を答えてください。
【NDC一覧】
0: 総記
1: 哲学
2: 歴史
3: 社会科学
4: 自然科学
5: 技術・工学
6: 産業
7: 芸術・スポーツ
8: 言語
9: 文学
【タイトル】吾輩は猫である
【NDC】
The prompt is in Japanese (matching the model’s training language). For training samples, the correct answer digit is appended after 【NDC】. At inference time, the model generates freely and the first digit (0-9) in the output is taken as the prediction.
Step 3. Model and LoRA Configuration
All experiments were run on Google Colab with a Tesla T4 GPU. The base model is llm-jp/llm-jp-3-1.8b, a Japanese-focused causal language model (~1.88 billion parameters).
MODEL_NAME = "llm-jp/llm-jp-3-1.8b"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto",
)
The LoRA configuration:
lora_config = LoraConfig(
r=32, # Rank of the adapter matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # All attention projections
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Total parameters: 1,880,197,120
Trainable (LoRA): 12,582,912 (0.67%)
→ Only 1/149 of all parameters are trained
With r=32, the adapter matrices are 32-dimensional. Applying LoRA to all four attention projections (Q, K, V, O) gives the model enough flexibility to learn the classification mapping. The resulting trainable parameters are roughly 0.67% of the total.
Step 4. Training
Training uses TRL’s SFTTrainer for supervised fine-tuning:
training_args = SFTConfig(
output_dir="./lora_ndc_output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 16
learning_rate=5e-4,
num_train_epochs=5,
logging_steps=10,
save_strategy="epoch",
bf16=True,
dataset_text_field="text",
report_to="none",
)
trainer = SFTTrainer(
model=model, train_dataset=train_dataset, args=training_args,
)
trainer.train()
The training runs for 5 epochs (180 steps total) over 567 samples with an effective batch size of 16. With bf16 precision and LoRA, the entire training completes in a few minutes on a Tesla T4 GPU.
Reading the Training Loss
During training, the Training Loss indicates how wrong the model’s predictions are — lower is better.
| Step | Training Loss |
|---|---|
| 10 | 0.8592 |
| 20 | 0.5131 |
| 30 | 0.4557 |
| 40 | 0.4147 |
| 50 | 0.3825 |
| 60 | 0.3772 |
| 70 | 0.3523 |
| 80 | 0.3150 |
| 90 | 0.3094 |
| 100 | 0.2815 |
| 110 | 0.2545 |
| 120 | 0.2198 |
| 130 | 0.2311 |
| 140 | 0.2286 |
| 150 | 0.1983 |
| 160 | 0.1681 |
| 170 | 0.1775 |
| 180 | 0.1766 |
The loss decreased from 0.86 to 0.18 over 5 epochs (180 steps). This is a cross-entropy loss, so Loss = -log(probability of correct token):
| Loss | Correct Prediction Probability | Interpretation |
|---|---|---|
| 2.30 | ~10% | Random guessing (10 classes) |
| 0.86 | ~42% | Step 10 |
| 0.46 | ~63% | Step 30 |
| 0.18 | ~84% | Final step |
Note: This loss is averaged over the entire prompt (including the boilerplate NDC legend and title text). The template portions are memorized quickly and contribute low loss, while the actual NDC digit prediction — the part we care about — has higher loss than the average suggests. So the final test accuracy can’t be read directly from the loss number; you need to evaluate on held-out test data.
What LoRA Actually Teaches — Behavior, Not Knowledge
LoRA is not cramming NDC classification expertise into the model. It teaches a behavioral skill: given this input format, produce this output format. This is the same pattern seen across LoRA use cases:
| Legal Exam Example | NDC Classification | |
|---|---|---|
| What’s taught | Answer format (a/b/c/d for multiple choice) | Answer format (0-9 for NDC) |
| What’s NOT taught | Legal knowledge | NDC classification expertise |
| Result | Accuracy improves | Accuracy improves |
LoRA efficiently teaches the model to leverage its pre-existing Japanese vocabulary knowledge (e.g., “programming” → technology, “poetry collection” → literature) in a task-appropriate output format. The domain knowledge was already latent in the base model’s 1.8 billion parameters.
Step 5. Results
Before vs. After
- Before LoRA: 11/50 correct = 22.0% accuracy (above the 10% random baseline, but unstable)
- After LoRA: 39/50 correct = 78.0% accuracy (+56 points)
This was achieved by training only 0.67% of the model’s parameters.
Output Format: Before vs. After
A notable difference is in the output format itself:
- Before: The model tends to output full NDC codes like “
910.2”, “010”, “010.3”, or “369.3” — it doesn’t understand that the task requires a single digit - After: Outputs are consistently single digits ("
1", “7”, “9”), showing that the model has learned the expected task format
Here is a sample from the 50-item test set comparing predictions before and after LoRA:
| Title | Answer | Before | After |
|---|---|---|---|
| 嗚呼孝子元政上人 | 1 (Philosophy) | 9 (Literature) | 1 (Philosophy) |
| 「アーカイブ中核拠点形成モデル事業」(撮影所等に) | 7 (Arts) | 9 (Literature) | 7 (Arts) |
| アーカーシャ年代記より | 1 (Philosophy) | 0 (General) | 1 (Philosophy) |
| ああ言えばこう食う 往復エッセイ | 9 (Literature) | 9 (Literature) | 9 (Literature) |
| あゝ愛宕丘の灯:追憶の四十有余年 | 3 (Social sci.) | 9 (Literature) | 3 (Social sci.) |
| アーク溶接作業における粉じん対策に関する調査研究報告 | 4 (Natural sci.) | 3 (Social sci.) | 4 (Natural sci.) |
| アーキテクチャとプログラミングの基礎 | 4 (Natural sci.) | 0 (General) | 5 (Technology) |
| ああアメリカ:傷だらけの巨象 | 3 (Social sci.) | 9 (Literature) | 3 (Social sci.) |
Before training, the model defaults to “9 (Literature)” or “0 (General)” for most inputs, showing no real classification ability. After LoRA, most predictions are correct. Cases like “アーキテクチャとプログラミングの基礎” (Fundamentals of Architecture and Programming) remain misclassified — distinguishing Technology (5) from Natural Sciences (4) based on title alone is inherently difficult.
Per-Category Analysis
Performance varies across categories. Categories with distinctive vocabulary in their titles (e.g., “Philosophy” with characteristic terms) tend to score higher. “General works” (NDC 0), which encompasses a broad range of topics, is harder to classify from title alone.
Step 6. Interactive Demo
The trained model can classify arbitrary book titles:
def predict_ndc(model, title):
"""Predict NDC category from a book title."""
book = {"title": title, "ndc": "?", "ndc_name": "?"}
prompt = make_prompt(book, include_answer=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs, max_new_tokens=5,
do_sample=False, pad_token_id=tokenizer.pad_token_id,
)
generated = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
).strip()
predicted = "?"
for ch in generated:
if ch in "0123456789":
predicted = ch
break
return predicted, NDC_NAMES.get(predicted, "Unknown")
Example predictions:
| Title | Predicted NDC | Category |
|---|---|---|
| 吾輩は猫である (I Am a Cat) | 8 | Language |
| 相対性理論入門 (Intro to Relativity) | 4 | Natural sciences |
| 日本経済の構造改革 (Structural Reform of Japan’s Economy) | 3 | Social sciences |
| フランス料理の基本技法 (French Cooking Techniques) | 7 | Arts & sports |
| はじめてのPython入門 (Intro to Python) | 4 | Natural sciences |
| 万葉集を読む (Reading the Man’yoshu) | 9 | Literature |
| 西洋美術史 (History of Western Art) | 7 | Arts & sports |
| 憲法判例百選 (100 Constitutional Law Cases) | 3 | Social sciences |
| 英語の語源辞典 (English Etymology Dictionary) | 8 | Language |
| 鉄道の歴史と未来 (History and Future of Railways) | 6 | Industry |
Most predictions are reasonable. Some are incorrect — “I Am a Cat” (a classic novel) is predicted as Language (8) instead of Literature (9), and “Intro to Python” gets Natural Sciences (4) instead of General Works (0) — but on the whole the model shows a reasonable title-to-category mapping from 567 training examples.
Practical Considerations
Applications
- Library cataloging support: Auto-classify incoming books as a first pass, reducing manual effort for librarians
- Bookstore/publisher categorization: Automatic shelf assignment for inventory management
- Bibliographic data enrichment: Fill in missing classification codes in incomplete records
Limitations
- Top-level classification only: Real-world use requires 3-digit NDC codes (e.g., 913 = Japanese novels, 490 = Medicine). This is achievable with more training data.
- Title-only input: Adding author names, publisher, and table of contents would improve accuracy substantially.
- Data bias: Books available through the API skew toward recent publications.
Future Directions
- Extend to 3-digit NDC classification for practical utility
- Incorporate additional metadata (author, publisher) into the prompt
- Combine with RAG (Retrieval-Augmented Generation) to reference similar books’ classifications during inference
Acknowledgments
I would like to thank Toru Aoike of the National Diet Library for introducing me to LoRA.