BDRC Tibetan OCR: Introduction and Implementation Examples of a Tibetan OCR Tool

Overview

Digitizing Tibetan manuscripts is one of the important challenges in digital humanities. Precious Buddhist scriptures and historical documents are preserved in libraries around the world, yet most have not yet been converted to text data. Manual transcription requires enormous time and cost, and researchers with the necessary expertise are limited.

This article introduces BDRC Tibetan OCR, an open-source Tibetan OCR system developed by the Buddhist Digital Resource Center (BDRC).

It also presents an implementation example from a project to digitize 114 Tibetan manuscript Kangyur volumes.

What is BDRC Tibetan OCR?

BDRC Tibetan OCR is a free, open-source tool that automatically extracts text from Tibetan images.

Key Features

1. Desktop Application

A GUI application that runs on Windows and macOS (Intel/M1, M2).

Installation:

Download the ZIP file for your OS from the releases page
Simply extract and run the executable

2. Multiple Output Formats

Plain text: Extracted Unicode Tibetan characters
PageXML: XML with coordinate information (compatible with Transkribus)
Wylie: Romanized transliteration format

3. Image Correction Features

Dewarping: Corrects page curvature
Rotation correction: Automatically detects and corrects page tilt
Line detection: Line segmentation functionality

4. Batch Processing Support

Batch processing of multiple image files
Direct OCR from PDF files
Automatic retrieval and processing from IIIF (International Image Interoperability Framework) manifests

Four Specialized OCR Models

One of the distinguishing features of BDRC Tibetan OCR is that it provides four specialized models optimized for different scripts and material types.

1. Uchen Model - For Modern Print

Use case: Printed scriptures, modern publications
Dataset: Unified Uchen model with 4.4 million samples
Characteristics: The most standard script in Tibetan
Application: Computer-font printed text, woodblock printing

Uchen means “with head” and is the most standard script in Tibetan. It is used in modern print materials and digital fonts.

2. Ume Model - For Handwritten Manuscripts

Use case: Handwritten manuscripts, cursive documents
Dataset: Two types: Ume-Druma and Ume-Petsuk
Characteristics: More rounded and flowing than Uchen
Application: Traditional handwritten scriptures, decorative manuscripts

Ume means “headless letters” and is a script widely used in Buddhist manuscripts.

3. Woodblock Model - For Classical Block Prints

Use case: Traditional woodblock printed texts
Dataset: Approximately 30,000 reviewed sample lines from the 8th Karmapa's miscellaneous works
Characteristics: Handles irregularities and degradation specific to old block prints
Application: Traditional Tibetan Buddhist woodblock prints, including chipped, faded, and ink-blurred prints

4. Other Specialized Models

Khyentse Wangpo dataset: Trained on approximately 13,000 sample lines from modern typeset editions
Dunhuang manuscript model: A specialized model for ancient documents dating back to the 8th century

Training Data

These models were trained on datasets collected from the following sources:

BDRC - Buddhist Digital Resource Center
ALL - Asian Legacy Library
Adarsha
NorbuKetaka

The trained models and portions of the datasets are published as open access on BDRC’s HuggingFace account and OpenPecha.

Implementation Example: Tibetan Manuscript Kangyur Digitization Project

This section presents an implementation example from a project to digitize 114 Tibetan manuscript Kangyur volumes.

Project Overview

Target materials: 114 Tibetan manuscript Kangyur volumes
Processing method: Automatic image retrieval via IIIF Image API + batch OCR processing
Output format: TEI/XML format (Text Encoding Initiative P5 compliant)
Publication: Side-by-side display of images and text via a web viewer

Technical Architecture

1. Efficient Image Retrieval via IIIF Integration

# Main functions of batch_ocr_from_iiif.py

# Automatically extract image URLs from IIIF manifests
def extract_image_urls(manifest: dict) -> List[Tuple[str, str, int, int, str]]:
    """Extract image URLs from IIIF manifest
    Returns: List of (label, image_url, width, height, iiif_service_url)
    """
    images = []
    sequences = manifest.get('sequences', [])

    for sequence in sequences:
        canvases = sequence.get('canvases', [])
        for canvas in canvases:
            # Extract image URLs and metadata
            label = canvas.get('label', 'unknown')
            width = canvas.get('width', 0)
            height = canvas.get('height', 0)
            # ...

    return images

This automatically retrieves metadata and high-resolution images from image servers compliant with the IIIF (International Image Interoperability Framework) specification.

2. Batch OCR Processing

# Initialize OCR pipeline
pipeline = OCRPipeline(platform, ocr_config, line_config)

# Run OCR on each image
status, result = pipeline.run_ocr(
    image=image,
    k_factor=2.5,           # Line extraction adjustment parameter
    bbox_tolerance=4.0,     # Bounding box tolerance
    merge_lines=True,       # Merge line chunks
    use_tps=False,          # TPS dewarping toggle
    target_encoding=Encoding.Unicode
)

Key parameters:

k_factor: Adjusts line detection sensitivity (2.5 used for woodblock prints)
bbox_tolerance: Character bounding box tolerance (default: 4.0)
merge_lines: Automatically merge split lines
use_tps: Dewarping via TPS (Thin Plate Spline) transformation

3. TEI/XML Output

# Output in TEI/XML format
def create_tei_xml(manifest_label: str, page_results: List[dict],
                   output_path: str, identifier: str = None):
    """Create a TEI XML file for all pages with IIIF information"""

    # TEI header (metadata)
    title = etree.SubElement(title_stmt, f"{{{TEI_NS}}}title")
    title.text = manifest_label

    # Text body
    for page_data in page_results:
        # Page break (including IIIF image URL)
        pb = etree.SubElement(p, f"{{{TEI_NS}}}pb")
        pb.attrib["n"] = str(page_num)
        pb.attrib["facs"] = page_data['iiif_service_url']

        # Line data (links to coordinate information)
        for line_idx, line_text in enumerate(page_data['line_texts'], 1):
            lb = etree.SubElement(p, f"{{{TEI_NS}}}lb")
            lb.attrib["corresp"] = f"#z{page_num}_l{line_idx}"
            lb.attrib["n"] = str(line_idx)
            lb.tail = line_text

    # Facsimile (image and coordinate data)
    for page_data in page_results:
        # Save line coordinates as zone elements
        for line_idx, line_coords in enumerate(page_data['line_coords'], 1):
            zone = etree.SubElement(surface, f"{{{TEI_NS}}}zone")
            zone.attrib[f"{{{XML_NS}}}id"] = f"z{page_num}_l{line_idx}"
            zone.attrib["ulx"] = str(min(x_coords))
            zone.attrib["uly"] = str(min(y_coords))
            zone.attrib["lrx"] = str(max(x_coords))
            zone.attrib["lry"] = str(max(y_coords))

TEI/XML structure:

TEI xmlns="http://www.tei-c.org/ns/1.0">
  teiHeader>
    titleStmt>title>Trisaṁvara-nirdeśatitle>titleStmt>
    publicationStmt>
      idno type="UUID">845065fa-1f31-11ec-9621-0242ac130002idno>
    publicationStmt>
  teiHeader>
  text>
    body>
      p>
        pb n="1" facs="https://img.toyobunko-lab.jp/iiif/..."/>
        lb corresp="#z1_l1" n="1" type="line"/>ཨ་ཀ།
      p>
    body>
  text>
  facsimile>
    surface xml:id="page-0" n="1" ulx="0" uly="0" lrx="7360" lry="4912">
      graphic url="https://img.toyobunko-lab.jp/iiif/..."/>
      zone xml:id="z1_l1" ulx="2720" uly="678" lrx="3942" lry="774"/>
    surface>
  facsimile>
TEI>

Processing Flow

1. Retrieve image URLs and metadata from IIIF manifests
   |
2. Download high-resolution images (or use existing ones)
   |
3. Process each page with BDRC Tibetan OCR (Woodblock model)
   |
4. Output OCR results in the following formats:
   - Individual page text files (page_0001.txt)
   - PageXML format (with coordinate information)
   - TEI/XML format (with coordinate information and IIIF URLs)
   |
5. Publish via web viewer
   - Side-by-side display of images (OpenSeadragon) and text
   - Provide metadata via DTS Collections API

Usage

Single Image OCR

python run_ocr.py \
  --image /path/to/tibetan_page.jpg \
  --model Woodblock \
  --format all \
  --encoding unicode

Batch Processing from IIIF Manifests

python batch_ocr_from_iiif.py \
  --manifest "https://app.toyobunko-lab.jp/iiif/2/UUID/manifest" \
  --output-dir iiif_ocr_output \
  --model Woodblock \
  --format both \
  --identifier UUID

Key options:

--model: OCR model to use (Modern, Ume_Druma, Ume_Petsuk, Woodblock, Woodblock-Stacks)
--format: Output format (text, xml, json, all)
--encoding: Character encoding (unicode, wylie)
--dewarp: Apply dewarping correction
--bbox-tolerance: Bounding box tolerance (default: 4.0)

Project Results

Processed documents: 33 (in progress)
TEI/XML output: Generated coordinate-enhanced XML for each manuscript
IIIF integration: Achieved a web viewer integrating images and text
DTS Collections API: Provided standardized metadata API

Technical Details

Architecture

BDRC Tibetan OCR consists of two main neural networks:

Line detection model (PhotiLines)
- Line region detection via semantic segmentation
- Patch size: 512x512
- Provided in ONNX format
OCR model (Easter2 architecture)
- CRNN (Convolutional Recurrent Neural Network) based
- Input: Variable width x fixed height images
- Output: Unicode strings or Wylie transliteration
- Fast inference with ONNX Runtime

Programmatic Usage

Basic Python Usage

import sys
import cv2
sys.path.insert(0, '/path/to/tibetan-ocr-app')

from BDRC.Utils import get_platform
from BDRC.Data import Encoding, LineDetectionConfig, OCRModelConfig
from BDRC.Inference import OCRPipeline
from BDRC.Data import CharsetEncoder, OCRArchitecture

# Load model configuration
ocr_config = OCRModelConfig(
    model_file="/path/to/OCRModels/Woodblock/OCRModel.onnx",
    architecture=OCRArchitecture.Easter2,
    input_height=96,
    input_width=1024,
    charset="charset.json",
    encoder=CharsetEncoder.Stack
)

line_config = LineDetectionConfig(
    model_file="/path/to/Models/Lines/PhotiLines.onnx",
    patch_size=512
)

# Initialize pipeline
platform = get_platform()
pipeline = OCRPipeline(platform, ocr_config, line_config)

# Load image and run OCR
image = cv2.imread("tibetan_page.jpg")
status, result = pipeline.run_ocr(
    image=image,
    target_encoding=Encoding.Unicode
)

if status.name == "SUCCESS":
    rot_mask, lines, ocr_lines, angle = result
    for line in ocr_lines:
        print(line.text)

Output Format Details

1. Plain Text (.txt)

༄༅། །འཕགས་པ་གསང་བ་གསུམ་པ་བསྟན་པ་ཞེས་བྱ་བ་ཐེག་པ་ཆེན་པོའི་མདོ།
ན་མོ་བུདྡྷཱ་ཡ། ན་མོ་དྷརྨཱ་ཡ། ན་མོ་སངྒྷཱ་ཡ།
...

2. PageXML (.xml)

PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
  Page imageFilename="page_0001.jpg" imageWidth="7360" imageHeight="4912">
    TextRegion id="page_1_region">
      TextLine id="page_1_line_0">
        Coords points="2720,678 3942,678 3942,774 2720,774"/>
        TextEquiv>Unicode>༄༅། །འཕགས་པ་གསང་བ་གསུམ་པ་བསྟན་པ་Unicode>TextEquiv>
      TextLine>
    TextRegion>
  Page>
PcGts>

3. JSONL (.jsonl)

{"line": 0, "text": "༄༅། །འཕགས་པ་གསང་བ་གསུམ་པ་བསྟན་པ་", "coords": [[2720, 678], [3942, 678], [3942, 774], [2720, 774]]}
{"line": 1, "text": "ན་མོ་བུདྡྷཱ་ཡ། ན་མོ་དྷརྨཱ་ཡ། ན་མོ་སངྒྷཱ་ཡ།", "coords": [[2650, 820], [4012, 820], [4012, 916], [2650, 916]]}

Performance

Measured values with the Woodblock model (MacBook Pro M1):

Processing speed: Approximately 1 page per 15-20 seconds (7360x4912 pixel high-resolution images)
Line detection accuracy: Over 95%
Character recognition accuracy: 90-95% (varies depending on material condition)
Memory usage: Approximately 2 GB

Tesseract OCR

An open-source OCR engine developed by Google.

Supported languages: Over 100 languages (including Tibetan)
Accuracy: Low recognition accuracy for Tibetan (especially classical manuscripts)
Use case: Suitable for general document OCR

Transkribus

A handwritten document recognition platform developed by READ-COOP.

Features: Specialized in HTR (Handwritten Text Recognition)
Accuracy: Custom model training is possible
Compatibility: BDRC Tibetan OCR is compatible with Transkribus via PageXML format
Limitation: Free version limited to 500 credits per month

Advantages of BDRC Tibetan OCR

Provides four specialized models tailored to Tibetan
Completely free and open source
Supports woodblock prints and classical manuscripts
Supports IIIF-integrated workflows
Runs in a local environment

Summary

Key Features

Specialized models: Four models optimized by script and material type
Completely free: Available without restrictions as open source
Application: Provides both a GUI app and CLI tools
Standards compliance: Supports international standards including IIIF, TEI/XML, and PageXML
Local processing: Processing is completed entirely in a local environment

Applicable Projects

Building digital libraries
Converting Buddhist scriptures to text
Creating research corpora
Archiving historical documents
Developing digital educational materials

Future Possibilities

The following feature extensions are conceivable for projects like this:

Automatic proofreading: Improving accuracy through post-processing of OCR results
Parallel text display: Side-by-side comparison of multiple editions
Full-text search: Searching against OCR text
Annotation functionality: Adding comments and annotations by researchers

Resources

Official Links

GitHub repository: https://github.com/buda-base/tibetan-ocr-app
Releases page: https://github.com/buda-base/tibetan-ocr-app/releases
Trained models (HuggingFace): https://huggingface.co/BDRC
Training code: https://github.com/buda-base/tibetan-ocr-training

References

Buddhist Digital Resource Center: https://www.bdrc.io/
TEI (Text Encoding Initiative): https://tei-c.org/
IIIF (International Image Interoperability Framework): https://iiif.io/
DTS (Distributed Text Services): https://distributed-text-services.github.io/

Acknowledgments

BDRC Tibetan OCR is an open-source tool developed by the Buddhist Digital Resource Center (BDRC). We thank Eric Werner, the developer of the tool.

Published: 2025-11-13

Overview#

What is BDRC Tibetan OCR?#

Key Features#

1. Desktop Application#

2. Multiple Output Formats#

3. Image Correction Features#

4. Batch Processing Support#

Four Specialized OCR Models#

1. Uchen Model - For Modern Print#

2. Ume Model - For Handwritten Manuscripts#

3. Woodblock Model - For Classical Block Prints#

4. Other Specialized Models#

Training Data#

Implementation Example: Tibetan Manuscript Kangyur Digitization Project#

Project Overview#

Technical Architecture#

1. Efficient Image Retrieval via IIIF Integration#

2. Batch OCR Processing#

3. TEI/XML Output#

Processing Flow#

Usage#

Single Image OCR#

Batch Processing from IIIF Manifests#

Project Results#

Technical Details#

Architecture#

Programmatic Usage#

Basic Python Usage#

Output Format Details#

1. Plain Text (.txt)#

2. PageXML (.xml)#

3. JSONL (.jsonl)#

Performance#

Comparison with Related Tools#

Tesseract OCR#

Transkribus#

Advantages of BDRC Tibetan OCR#

Summary#

Key Features#

Applicable Projects#

Future Possibilities#

Resources#

Official Links#

References#

Overview

What is BDRC Tibetan OCR?

Key Features

1. Desktop Application

2. Multiple Output Formats

3. Image Correction Features

4. Batch Processing Support

Four Specialized OCR Models

1. Uchen Model - For Modern Print

2. Ume Model - For Handwritten Manuscripts

3. Woodblock Model - For Classical Block Prints

4. Other Specialized Models

Training Data

Implementation Example: Tibetan Manuscript Kangyur Digitization Project

Project Overview

Technical Architecture

1. Efficient Image Retrieval via IIIF Integration

2. Batch OCR Processing

3. TEI/XML Output

Processing Flow

Usage

Single Image OCR

Batch Processing from IIIF Manifests

Project Results

Technical Details

Architecture

Programmatic Usage

Basic Python Usage

Output Format Details

1. Plain Text (.txt)

2. PageXML (.xml)

3. JSONL (.jsonl)

Performance

Comparison with Related Tools

Tesseract OCR

Transkribus

Advantages of BDRC Tibetan OCR

Summary

Key Features

Applicable Projects

Future Possibilities

Resources

Official Links

References