
Overview
Digitizing Tibetan manuscripts is one of the important challenges in digital humanities. Precious Buddhist scriptures and historical documents are preserved in libraries around the world, yet most have not yet been converted to text data. Manual transcription requires enormous time and cost, and researchers with the necessary expertise are limited.
This article introduces BDRC Tibetan OCR, an open-source Tibetan OCR system developed by the Buddhist Digital Resource Center (BDRC).
It also presents an implementation example from a project to digitize 114 Tibetan manuscript Kangyur volumes.
What is BDRC Tibetan OCR?
BDRC Tibetan OCR is a free, open-source tool that automatically extracts text from Tibetan images.
Key Features
1. Desktop Application
A GUI application that runs on Windows and macOS (Intel/M1, M2).
Installation:
- Download the ZIP file for your OS from the releases page
- Simply extract and run the executable
2. Multiple Output Formats
- Plain text: Extracted Unicode Tibetan characters
- PageXML: XML with coordinate information (compatible with Transkribus)
- Wylie: Romanized transliteration format
3. Image Correction Features
- Dewarping: Corrects page curvature
- Rotation correction: Automatically detects and corrects page tilt
- Line detection: Line segmentation functionality
4. Batch Processing Support
- Batch processing of multiple image files
- Direct OCR from PDF files
- Automatic retrieval and processing from IIIF (International Image Interoperability Framework) manifests
Four Specialized OCR Models
One of the distinguishing features of BDRC Tibetan OCR is that it provides four specialized models optimized for different scripts and material types.
1. Uchen Model - For Modern Print
Use case: Printed scriptures, modern publications
Dataset: Unified Uchen model with 4.4 million samples
Characteristics: The most standard script in Tibetan
Application: Computer-font printed text, woodblock printing
Uchen means “with head” and is the most standard script in Tibetan. It is used in modern print materials and digital fonts.
2. Ume Model - For Handwritten Manuscripts
Use case: Handwritten manuscripts, cursive documents
Dataset: Two types: Ume-Druma and Ume-Petsuk
Characteristics: More rounded and flowing than Uchen
Application: Traditional handwritten scriptures, decorative manuscripts
Ume means “headless letters” and is a script widely used in Buddhist manuscripts.
3. Woodblock Model - For Classical Block Prints
Use case: Traditional woodblock printed texts
Dataset: Approximately 30,000 reviewed sample lines from the 8th Karmapa's miscellaneous works
Characteristics: Handles irregularities and degradation specific to old block prints
Application: Traditional Tibetan Buddhist woodblock prints, including chipped, faded, and ink-blurred prints
4. Other Specialized Models
- Khyentse Wangpo dataset: Trained on approximately 13,000 sample lines from modern typeset editions
- Dunhuang manuscript model: A specialized model for ancient documents dating back to the 8th century
Training Data
These models were trained on datasets collected from the following sources:
- BDRC - Buddhist Digital Resource Center
- ALL - Asian Legacy Library
- Adarsha
- NorbuKetaka
The trained models and portions of the datasets are published as open access on BDRC’s HuggingFace account and OpenPecha.
Implementation Example: Tibetan Manuscript Kangyur Digitization Project
This section presents an implementation example from a project to digitize 114 Tibetan manuscript Kangyur volumes.
Project Overview
- Target materials: 114 Tibetan manuscript Kangyur volumes
- Processing method: Automatic image retrieval via IIIF Image API + batch OCR processing
- Output format: TEI/XML format (Text Encoding Initiative P5 compliant)
- Publication: Side-by-side display of images and text via a web viewer
Technical Architecture
1. Efficient Image Retrieval via IIIF Integration
# Main functions of batch_ocr_from_iiif.py
# Automatically extract image URLs from IIIF manifests
def extract_image_urls(manifest: dict) -> List[Tuple[str, str, int, int, str]]:
"""Extract image URLs from IIIF manifest
Returns: List of (label, image_url, width, height, iiif_service_url)
"""
images = []
sequences = manifest.get('sequences', [])
for sequence in sequences:
canvases = sequence.get('canvases', [])
for canvas in canvases:
# Extract image URLs and metadata
label = canvas.get('label', 'unknown')
width = canvas.get('width', 0)
height = canvas.get('height', 0)
# ...
return images
This automatically retrieves metadata and high-resolution images from image servers compliant with the IIIF (International Image Interoperability Framework) specification.
2. Batch OCR Processing
# Initialize OCR pipeline
pipeline = OCRPipeline(platform, ocr_config, line_config)
# Run OCR on each image
status, result = pipeline.run_ocr(
image=image,
k_factor=2.5, # Line extraction adjustment parameter
bbox_tolerance=4.0, # Bounding box tolerance
merge_lines=True, # Merge line chunks
use_tps=False, # TPS dewarping toggle
target_encoding=Encoding.Unicode
)
Key parameters:
k_factor: Adjusts line detection sensitivity (2.5 used for woodblock prints)bbox_tolerance: Character bounding box tolerance (default: 4.0)merge_lines: Automatically merge split linesuse_tps: Dewarping via TPS (Thin Plate Spline) transformation
3. TEI/XML Output
# Output in TEI/XML format
def create_tei_xml(manifest_label: str, page_results: List[dict],
output_path: str, identifier: str = None):
"""Create a TEI XML file for all pages with IIIF information"""
# TEI header (metadata)
title = etree.SubElement(title_stmt, f"{{{TEI_NS}}}title")
title.text = manifest_label
# Text body
for page_data in page_results:
# Page break (including IIIF image URL)
pb = etree.SubElement(p, f"{{{TEI_NS}}}pb")
pb.attrib["n"] = str(page_num)
pb.attrib["facs"] = page_data['iiif_service_url']
# Line data (links to coordinate information)
for line_idx, line_text in enumerate(page_data['line_texts'], 1):
lb = etree.SubElement(p, f"{{{TEI_NS}}}lb")
lb.attrib["corresp"] = f"#z{page_num}_l{line_idx}"
lb.attrib["n"] = str(line_idx)
lb.tail = line_text
# Facsimile (image and coordinate data)
for page_data in page_results:
# Save line coordinates as zone elements
for line_idx, line_coords in enumerate(page_data['line_coords'], 1):
zone = etree.SubElement(surface, f"{{{TEI_NS}}}zone")
zone.attrib[f"{{{XML_NS}}}id"] = f"z{page_num}_l{line_idx}"
zone.attrib["ulx"] = str(min(x_coords))
zone.attrib["uly"] = str(min(y_coords))
zone.attrib["lrx"] = str(max(x_coords))
zone.attrib["lry"] = str(max(y_coords))
TEI/XML structure:
TEI xmlns="http://www.tei-c.org/ns/1.0">
teiHeader>
titleStmt>title>Trisaṁvara-nirdeśatitle>titleStmt>
publicationStmt>
idno type="UUID">845065fa-1f31-11ec-9621-0242ac130002idno>
publicationStmt>
teiHeader>
text>
body>
p>
pb n="1" facs="https://img.toyobunko-lab.jp/iiif/..."/>
lb corresp="#z1_l1" n="1" type="line"/>ཨ་ཀ།
p>
body>
text>
facsimile>
surface xml:id="page-0" n="1" ulx="0" uly="0" lrx="7360" lry="4912">
graphic url="https://img.toyobunko-lab.jp/iiif/..."/>
zone xml:id="z1_l1" ulx="2720" uly="678" lrx="3942" lry="774"/>
surface>
facsimile>
TEI>
Processing Flow
1. Retrieve image URLs and metadata from IIIF manifests
|
2. Download high-resolution images (or use existing ones)
|
3. Process each page with BDRC Tibetan OCR (Woodblock model)
|
4. Output OCR results in the following formats:
- Individual page text files (page_0001.txt)
- PageXML format (with coordinate information)
- TEI/XML format (with coordinate information and IIIF URLs)
|
5. Publish via web viewer
- Side-by-side display of images (OpenSeadragon) and text
- Provide metadata via DTS Collections API
Usage
Single Image OCR
python run_ocr.py \
--image /path/to/tibetan_page.jpg \
--model Woodblock \
--format all \
--encoding unicode
Batch Processing from IIIF Manifests
python batch_ocr_from_iiif.py \
--manifest "https://app.toyobunko-lab.jp/iiif/2/UUID/manifest" \
--output-dir iiif_ocr_output \
--model Woodblock \
--format both \
--identifier UUID
Key options:
--model: OCR model to use (Modern, Ume_Druma, Ume_Petsuk, Woodblock, Woodblock-Stacks)--format: Output format (text, xml, json, all)--encoding: Character encoding (unicode, wylie)--dewarp: Apply dewarping correction--bbox-tolerance: Bounding box tolerance (default: 4.0)
Project Results
- Processed documents: 33 (in progress)
- TEI/XML output: Generated coordinate-enhanced XML for each manuscript
- IIIF integration: Achieved a web viewer integrating images and text
- DTS Collections API: Provided standardized metadata API
Technical Details
Architecture
BDRC Tibetan OCR consists of two main neural networks:
Line detection model (PhotiLines)
- Line region detection via semantic segmentation
- Patch size: 512x512
- Provided in ONNX format
OCR model (Easter2 architecture)
- CRNN (Convolutional Recurrent Neural Network) based
- Input: Variable width x fixed height images
- Output: Unicode strings or Wylie transliteration
- Fast inference with ONNX Runtime
Programmatic Usage
Basic Python Usage
import sys
import cv2
sys.path.insert(0, '/path/to/tibetan-ocr-app')
from BDRC.Utils import get_platform
from BDRC.Data import Encoding, LineDetectionConfig, OCRModelConfig
from BDRC.Inference import OCRPipeline
from BDRC.Data import CharsetEncoder, OCRArchitecture
# Load model configuration
ocr_config = OCRModelConfig(
model_file="/path/to/OCRModels/Woodblock/OCRModel.onnx",
architecture=OCRArchitecture.Easter2,
input_height=96,
input_width=1024,
charset="charset.json",
encoder=CharsetEncoder.Stack
)
line_config = LineDetectionConfig(
model_file="/path/to/Models/Lines/PhotiLines.onnx",
patch_size=512
)
# Initialize pipeline
platform = get_platform()
pipeline = OCRPipeline(platform, ocr_config, line_config)
# Load image and run OCR
image = cv2.imread("tibetan_page.jpg")
status, result = pipeline.run_ocr(
image=image,
target_encoding=Encoding.Unicode
)
if status.name == "SUCCESS":
rot_mask, lines, ocr_lines, angle = result
for line in ocr_lines:
print(line.text)
Output Format Details
1. Plain Text (.txt)
༄༅། །འཕགས་པ་གསང་བ་གསུམ་པ་བསྟན་པ་ཞེས་བྱ་བ་ཐེག་པ་ཆེན་པོའི་མདོ།
ན་མོ་བུདྡྷཱ་ཡ། ན་མོ་དྷརྨཱ་ཡ། ན་མོ་སངྒྷཱ་ཡ།
...
2. PageXML (.xml)
PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
Page imageFilename="page_0001.jpg" imageWidth="7360" imageHeight="4912">
TextRegion id="page_1_region">
TextLine id="page_1_line_0">
Coords points="2720,678 3942,678 3942,774 2720,774"/>
TextEquiv>Unicode>༄༅། །འཕགས་པ་གསང་བ་གསུམ་པ་བསྟན་པ་Unicode>TextEquiv>
TextLine>
TextRegion>
Page>
PcGts>
3. JSONL (.jsonl)
{"line": 0, "text": "༄༅། །འཕགས་པ་གསང་བ་གསུམ་པ་བསྟན་པ་", "coords": [[2720, 678], [3942, 678], [3942, 774], [2720, 774]]}
{"line": 1, "text": "ན་མོ་བུདྡྷཱ་ཡ། ན་མོ་དྷརྨཱ་ཡ། ན་མོ་སངྒྷཱ་ཡ།", "coords": [[2650, 820], [4012, 820], [4012, 916], [2650, 916]]}
Performance
Measured values with the Woodblock model (MacBook Pro M1):
- Processing speed: Approximately 1 page per 15-20 seconds (7360x4912 pixel high-resolution images)
- Line detection accuracy: Over 95%
- Character recognition accuracy: 90-95% (varies depending on material condition)
- Memory usage: Approximately 2 GB
Comparison with Related Tools
Tesseract OCR
An open-source OCR engine developed by Google.
- Supported languages: Over 100 languages (including Tibetan)
- Accuracy: Low recognition accuracy for Tibetan (especially classical manuscripts)
- Use case: Suitable for general document OCR
Transkribus
A handwritten document recognition platform developed by READ-COOP.
- Features: Specialized in HTR (Handwritten Text Recognition)
- Accuracy: Custom model training is possible
- Compatibility: BDRC Tibetan OCR is compatible with Transkribus via PageXML format
- Limitation: Free version limited to 500 credits per month
Advantages of BDRC Tibetan OCR
- Provides four specialized models tailored to Tibetan
- Completely free and open source
- Supports woodblock prints and classical manuscripts
- Supports IIIF-integrated workflows
- Runs in a local environment
Summary
Key Features
- Specialized models: Four models optimized by script and material type
- Completely free: Available without restrictions as open source
- Application: Provides both a GUI app and CLI tools
- Standards compliance: Supports international standards including IIIF, TEI/XML, and PageXML
- Local processing: Processing is completed entirely in a local environment
Applicable Projects
- Building digital libraries
- Converting Buddhist scriptures to text
- Creating research corpora
- Archiving historical documents
- Developing digital educational materials
Future Possibilities
The following feature extensions are conceivable for projects like this:
- Automatic proofreading: Improving accuracy through post-processing of OCR results
- Parallel text display: Side-by-side comparison of multiple editions
- Full-text search: Searching against OCR text
- Annotation functionality: Adding comments and annotations by researchers
Resources
Official Links
- GitHub repository: https://github.com/buda-base/tibetan-ocr-app
- Releases page: https://github.com/buda-base/tibetan-ocr-app/releases
- Trained models (HuggingFace): https://huggingface.co/BDRC
- Training code: https://github.com/buda-base/tibetan-ocr-training
References
- Buddhist Digital Resource Center: https://www.bdrc.io/
- TEI (Text Encoding Initiative): https://tei-c.org/
- IIIF (International Image Interoperability Framework): https://iiif.io/
- DTS (Distributed Text Services): https://distributed-text-services.github.io/
Acknowledgments
BDRC Tibetan OCR is an open-source tool developed by the Buddhist Digital Resource Center (BDRC). We thank Eric Werner, the developer of the tool.
Published: 2025-11-13