Important Usage Notice
The system described in this article may place load on external servers. Please exercise caution when using it.
- Server load: Parallel requests place load on target servers
- DoS risk: A large number of simultaneous accesses may be mistaken for a DoS attack
- Recommended approach: Download images locally in advance and run only the OCR processing in parallel
- Check terms of service: Always review the target server’s terms of service and obtain prior permission if necessary
- Appropriate rate limiting: In production, conservative concurrency settings (around 5-10 parallel) are strongly recommended
- Responsible use: Always be considerate of server administrators and other users
This article is a record of a technical proof of concept. We ask readers to use the system responsibly.
Background
This article introduces a case study of building a scalable OCR processing system on Azure Container Apps using NDL Classical Japanese OCR Lite, developed by the National Diet Library (NDL) of Japan. We describe the design and implementation of a system that achieves pay-per-use billing and auto-scaling through a cloud-native architecture.
System Overview
Architecture
IIIF Images -> Azure Container Apps -> NDL Classical Japanese OCR -> TEI XML Output
|
Auto-scaling
(0-30 replicas)
Key Components
- OCR engine: NDL Classical Japanese OCR Lite (specialized for Japanese classical texts)
- Infrastructure: Azure Container Apps (serverless containers)
- API design: REST API (image URL -> OCR result)
- Output format: TEI P5-compliant XML
- Scaling: Automatic scaling based on demand
Features of NDL Classical Japanese OCR Lite
OCR Optimized for Japanese Classical Texts
- Vertical text layout support: Vertical writing structure specific to classical texts
- Reading order optimization: Right-to-left, top-to-bottom Japanese reading order
- Classical character recognition: Support for cursive script (kuzushiji) and variant kana
- Lightweight implementation: Cloud-deployable through Docker containerization
Why Azure Container Apps
Benefits of Serverless Containers
# Scaling configuration example
scale:
minReplicas: 0 # Idle: zero cost
maxReplicas: 30 # On demand: auto-expand
cooldownPeriod: 300 # Scale down after 5 minutes
Cost Optimization
- Pay-per-use billing: Charged only for actual usage
- Zero replicas: Completely zero cost when idle
- Auto-scaling: Resource adjustment based on demand
System Implementation
Server-Side Implementation
# Flask + NDL OCR integration
from flask import Flask, request, jsonify
from flask_restx import Api, Resource
from simple_ocr_service import OCRService
app = Flask(__name__)
api = Api(app, doc='/docs/')
@api.route('/api/image')
class ImageOCR(Resource):
def get(self):
image_url = request.args.get('image_url')
# Process image with NDL OCR
result = ocr_service.process_single_image(image_url)
return result
Reading Order Algorithm
def sort_japanese_reading_order(lines):
"""Sort in Japanese classical text reading order"""
return sorted(lines, key=lambda line: (
-line["bbox"][0], # x-coordinate descending (right to left)
line["bbox"][1] # y-coordinate ascending (top to bottom)
))
TEI XML Output
xml version="1.0" encoding="UTF-8"?>
TEI xmlns="http://www.tei-c.org/ns/1.0">
teiHeader>
fileDesc>
titleStmt>
title>桐壺title>
titleStmt>
respStmt>
resp>Automated Transcriptionresp>
name ref="https://github.com/ndl-lab/ndlkotenocr-lite">
NDL古典籍OCR Lite
name>
respStmt>
fileDesc>
teiHeader>
facsimile>
surface xml:id="surface-1">
zone xml:id="zone-1-1" ulx="3391" uly="1141"
lrx="3727" lry="2924" cert="0.799"/>
surface>
facsimile>
text>
body>
div type="transcription">
pb n="1" facs="#surface-1"/>
lb n="1.1" corresp="#zone-1-1" cert="high"/>
いづれの御時にか
div>
body>
text>
TEI>
Processing Results
Small-Scale Test (Kiritsubo)
- Target: “Kiritsubo” held by the University of Tokyo
- Pages: 32 pages
- Processing time: Approximately 30 seconds
- Success rate: 100%
- Concurrency: 10 parallel
- Cost: Approximately $0.05
Performance Characteristics
Processing time = ~1 second/page (with parallel processing)
Cost efficiency = $1.5-2.0/1000 pages
Scaling = 0 to 20 replicas in seconds
Technical Features
1. Cold Start Handling
async def process_with_retry(image_url, max_retries=3):
"""Automatic retry for cold starts"""
for attempt in range(max_retries + 1):
try:
if attempt > 0:
wait_time = 2 ** (attempt - 1)
await asyncio.sleep(wait_time)
return await ocr_request(image_url)
except (HTTPError, TimeoutError) as e:
if attempt == max_retries:
raise e
2. Externalized Configuration
# Configuration via environment variables
OCR_API_URL=https://your-ocr-service.azurecontainerapps.io
DEFAULT_MAX_CONCURRENT=10
DEFAULT_CONFIDENCE_THRESHOLD=0.3
DEFAULT_OUTPUT_FORMAT=xml
3. Swagger UI Integration
# Automatic API specification generation
api = Api(app,
version='1.0',
title='NDL Classical Japanese OCR API',
description='OCR processing API specialized for Japanese classical texts',
doc='/docs/'
)
Deployment
Azure Container Apps Deployment
# Create container app
az containerapp create \
--name ocr-service \
--resource-group rg-ocr \
--environment container-env \
--image registry.azurecr.io/ocr-app:latest \
--target-port 80 \
--ingress external \
--min-replicas 0 \
--max-replicas 30 \
--cpu 2.0 \
--memory 4Gi
Docker Configuration
FROM python:3.11-slim
# Place NDL OCR model
COPY model/ /app/model/
COPY config/ /app/config/
# Application setup
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 80
CMD ["gunicorn", "--bind", "0.0.0.0:80", "app:app"]
Operations and Monitoring
Performance Metrics
- Response time: Average 2-3 seconds/image
- Throughput: 10-15 images/second (with 20 replicas)
- Success rate: Over 99%
- Cost efficiency: $0 when idle, charged only during processing
Log Monitoring
# Check Container Apps logs
az containerapp logs show \
--name ocr-service \
--resource-group rg-ocr \
--follow
Future Prospects
Technical Improvements
- Image caching: Reduction of duplicate processing
- Batch processing: Efficient bulk processing
- GPU support: Acceleration of OCR processing
- Enhanced metrics: Detailed performance analysis
Potential Applications
- Digital archives: Use in libraries and museums
- Research support: Digitization for humanities research
- Education: Creating teaching materials from classical texts
- Cultural preservation: Digital preservation of rare materials
Summary
By combining NDL Classical Japanese OCR Lite with Azure Container Apps, we built a classical text OCR system that achieves both cost efficiency and scalability. The serverless architecture enables pay-per-use billing and auto-scaling, making it a practical digital humanities tool.
Key Points
- Cost optimization: Charged only during use
- Auto-scaling: Resource adjustment based on demand
- TEI P5-compliant: Standardized XML output
- Classical text specialization: OCR optimized for Japanese classical texts
- API design: Simple and extensible architecture
This system was developed as a technical proof of concept. In production use, please give adequate consideration to the load on target servers, apply appropriate rate limiting, and comply with terms of service.