LLM | Digital Archive Systems Tech Blog

Building an NDC Book Classifier with LoRA: Fine-Tuning a Japanese LLM on Library Data

Notebook: Open in Google Colab / GitHub TL;DR Collected 617 bibliographic records from the National Diet Library Search API (SRU endpoint) Fine-tuned llm-jp-3-1.8b with LoRA, training only 0.67% of all parameters Pre-training accuracy: 22.0% → Post-training: 78.0% (+56 points) LoRA teaches the model how to perform a task, not just memorize facts What is NDC? The Nippon Decimal Classification (NDC) is the standard book classification system used across Japanese libraries. Every book is assigned a numeric code, where the first digit indicates one of ten broad categories: ...

March 19, 2026 · 8 min · Nakamura

Azure OpenAI GPT-4 vs Document Intelligence: Comparative Evaluation of Japanese Vertical Text OCR

Overview We performed OCR processing on Japanese vertical-writing manuscript paper using two OCR services provided by Microsoft Azure (Azure OpenAI GPT-4 Vision and Azure Document Intelligence), and conducted a detailed comparative evaluation of the results. Test Image Image Source: Canva template (400-character manuscript paper) URL: https://www.canva.com/ja_jp/templates/EAFbqUoH7P8/ Image Characteristics: 20x20 grid, 400-character manuscript paper Vertical writing layout Light grid lines (cells) Distinction between title and body sections Ground Truth 原稿のタイトル佐藤ちあき原稿用紙に書くテキストが入ります。作文や小論文を作ったり、小説を書いたりなどにご活用ください。このテキストを使用する場合は、日本語の全角を使うことでマスにあった文字を打つことができます。手書きで使用したい場合は、このテキストを削除し、印刷してご使用ください。 1. Recognition Results by Azure OpenAI GPT-4.1 Recognized Text 原稿のタイトル佐藤　ちあき原稿用紙に書くテキストが入ります。作文や小論文を作ったり、小説を書いたりなどにご活用ください。このテキストを使用する場合は、日本語の全角を使うことでマスにあった文字を打つことができます。手書きで使用したい場合は、このテキストを削除し、印刷してご使用ください。 Evaluation GPT-4.1 demonstrated the following characteristics with vertical-writing manuscript paper: ...

September 29, 2025 · Updated: September 29, 2025 · 4 min · Nakamura

LLM-Based Manuscript Paper OCR Performance Comparison: Verification of Vertical Japanese Recognition Accuracy

Introduction In this article, we compared and verified the OCR performance of major LLM models using actual manuscript paper images. While many OCR benchmarks target printed documents and horizontally written text, we evaluate recognition accuracy on the special format of Japanese vertical manuscript paper to more practically verify each model’s Japanese document understanding capabilities. Features of This Verification Using the uniquely Japanese manuscript paper format: Verification with images containing complex elements such as characters placed in grid cells, vertical writing layout, and distinctive margin composition Assuming practical use cases: Performance evaluation on manuscript paper used in actual writing scenarios such as essays, novels, and academic papers Comprehensive comparison of the latest models: Comparison of the latest models – GPT-5, GPT-4.1, Gemini 2.5 Pro, Claude Opus 4.1, and Claude Sonnet 4 – under identical conditions Verification Overview Image Used Image source: Canva template (400-character manuscript paper) URL: https://www.canva.com/ja_jp/templates/EAFbqUoH7P8/ Image characteristics: 20x20 grid, 400-character manuscript paper Vertical writing layout Faint grid lines (cells) Distinction between title area and body area ...

September 27, 2025 · Updated: September 27, 2025 · 4 min · Nakamura

Notes on LLM-Related Tools

Overview This is a memo on tools related to LLMs. LangChain https://www.langchain.com/ It is described as follows. LangChain is a composable framework to build with LLMs. LangGraph is the orchestration framework for controllable agentic workflows. LlamaIndex https://docs.llamaindex.ai/en/stable/ It is described as follows. LlamaIndex is a framework for building context-augmented generative AI applications with LLMs including agents and workflows. LangChain and LlamaIndex The response from gpt-4o was as follows. ...

November 29, 2024 · Updated: November 29, 2024 · 3 min · Nakamura

Running a Local LLM Using mdx.jp 1GPU Pack and Ollama

Overview I had the opportunity to run a local LLM using mdx.jp’s 1GPU pack and Ollama, so this is a memo of the process. https://mdx.jp/mdx1/p/guide/charge References I referred to the following article. https://highreso.jp/edgehub/machinelearning/ollamainference.html Downloading the Model Here, we target llama3.1:70b. After the download is complete, it becomes selectable as shown below. Usage Example We use the following “Shibusawa Eiichi Biographical Materials.” https://github.com/shibusawa-dlab/lab1 Using the API Documentation was found at the following location. ...

November 4, 2024 · Updated: November 4, 2024 · 3 min · Nakamura