TEI ODD File Customization: A Case Study with NDL Classical Book OCR

Overview

TEI (Text Encoding Initiative) is an international standard for digitizing and sharing texts in humanities research. This article introduces the process of customizing a TEI ODD file to match the output format of the NDL Classical Book OCR-Lite application.

ODD (One Document Does it all) is a mechanism for customizing TEI schemas, allowing you to define your own schema containing only the elements and attributes you need.

Background: Developing the NDL Classical Book OCR-Lite Application

We are developing an application that outputs the results of NDL Classical Book OCR-Lite in TEI/XML format. The application is designed to perform OCR processing on Japanese classical books and output the results in standard TEI format.

We decided to include the following information in the output TEI XML:

Text information: Character strings recognized by OCR
Layout information: Coordinate information (bounding boxes) for each line
Image references: IIIF (International Image Interoperability Framework) compatible image URLs
Metadata: Document title, processing information, etc.

We wrote the schema used by this application in ODD. The following describes the customization process.

Customization Approaches

1. Initial Approach: Using Standard Modules

Initially, we created the ODD using TEI’s standard modules:

schemaSpec ident="ndl_koten_ocr" start="TEI" prefix="tei_">
  moduleRef key="tei"/>
  moduleRef key="header" include="teiHeader fileDesc titleStmt publicationStmt sourceDesc"/>
  moduleRef key="core" include="p title name resp respStmt lb pb graphic"/>
  moduleRef key="textstructure" include="TEI text body"/>
  moduleRef key="transcr" include="facsimile surface zone"/>
schemaSpec>

Importance of the include Attribute

The include attribute of the moduleRef element is an important feature that selectively includes only specific elements from a module:

moduleRef key="header" include="teiHeader fileDesc titleStmt publicationStmt sourceDesc"/>

Benefits of using the include attribute:

You can explicitly specify only the elements you need
The schema size is smaller than including the entire module
It is clear which elements are being used

Without the include attribute:

moduleRef key="header"/>

In this case, all elements from the header module (encodingDesc, profileDesc, revisionDesc, etc.) would be included.

Specifying multiple elements:

moduleRef key="core" include="p title name resp respStmt lb pb graphic"/>

Exclusion using the exclude attribute:

moduleRef key="core" exclude="hi del add note"/>

The exclude attribute is the opposite of include, excluding specific elements from a module. It is useful when most elements are needed and only a few are unnecessary.

Criteria for choosing between include and exclude:

When few elements are needed: use include
When few elements are unnecessary: use exclude
When clarity is important: use include (makes it clear what is being used)

However, even with this approach, related model classes and attribute classes are automatically included, making it impossible to fully minimize the schema.

2. Improved Approach: Deleting Unnecessary Elements

Next, we explicitly deleted unnecessary classes:

classSpec ident="model.emphLike" type="model" mode="delete"/>
classSpec ident="model.highlighted" type="model" mode="delete"/>


classSpec ident="att.datable" type="atts" mode="delete"/>
classSpec ident="att.editLike" type="atts" mode="delete"/>

3. Final Approach: Minimal Definition

Ultimately, we adopted an approach of explicitly defining only the necessary elements and attributes:

schemaSpec ident="ndl_koten_ocr_minimal" start="TEI" prefix="tei_" docLang="ja">

  classSpec ident="att.global" type="atts" mode="add">
    attList>
      attDef ident="xml:id" mode="add">
        desc xml:lang="ja">一意識別子desc>
        datatype>dataRef key="ID"/>datatype>
      attDef>
      attDef ident="xml:lang" mode="add">
        desc xml:lang="ja">言語コードdesc>
        datatype>dataRef key="teidata.language"/>datatype>
      attDef>
    attList>
  classSpec>
schemaSpec>

Implementation Details

Managing Coordinate Information

To manage OCR coordinate information, we defined a dedicated attribute class:

classSpec ident="att.coordinated" type="atts" mode="add">
  desc xml:lang="ja">座標属性desc>
  attList>
    attDef ident="ulx" mode="add">
      desc xml:lang="ja">左上X座標desc>
      datatype>dataRef key="teidata.numeric"/>datatype>
    attDef>
    attDef ident="uly" mode="add">
      desc xml:lang="ja">左上Y座標desc>
      datatype>dataRef key="teidata.numeric"/>datatype>
    attDef>
    attDef ident="lrx" mode="add">
      desc xml:lang="ja">右下X座標desc>
      datatype>dataRef key="teidata.numeric"/>datatype>
    attDef>
    attDef ident="lry" mode="add">
      desc xml:lang="ja">右下Y座標desc>
      datatype>dataRef key="teidata.numeric"/>datatype>
    attDef>
  attList>
classSpec>

IIIF Integration

To link with IIIF manifests, we added the sameAs attribute:

elementSpec ident="facsimile" mode="add">
  desc xml:lang="ja">ファクシミリdesc>
  attList>
    attDef ident="sameAs" mode="add">
      desc xml:lang="ja">IIIFマニフェストURLdesc>
      datatype>dataRef key="teidata.pointer"/>datatype>
    attDef>
  attList>
elementSpec>

Line Number Format Constraints

We used Schematron to constrain the line number format:

constraintSpec ident="page-numbering" scheme="schematron">
  constraint>
    sch:rule context="tei:lb[@n]">
      sch:assert test="matches(@n, '^\d+\.\d+$')">
        Line numbers must follow this format: page.line (e.g., 1.1, 2.3)
      sch:assert>
    sch:rule>
  constraint>
constraintSpec>

Writing Examples

Basic Structure of exemplum and egXML

In ODD, you use the exemplum and egXML elements to describe usage examples:

elementSpec ident="pb" mode="change">
  desc xml:lang="ja">ページ区切りdesc>
  exemplum>
    egXML xmlns="http://www.tei-c.org/ns/Examples">
      pb n="1" facs="https://catalog.lib.kyushu-u.ac.jp/image/iiif/1/820/411193/368828.tiff/full/max/0/default.jpg"/>
    egXML>
  exemplum>
elementSpec>

Writing Complex Examples

When showing examples containing multiple elements:

elementSpec ident="teiHeader" mode="change">
  exemplum>
    egXML xmlns="http://www.tei-c.org/ns/Examples">
      teiHeader>
        fileDesc>
          titleStmt>
            title>OCR処理結果title>
            respStmt>
              resp>Automated Transcriptionresp>
              name ref="https://github.com/ndl-lab/ndlkotenocr-lite">
                NDL古典籍OCR-Liteアプリケーション
              name>
            respStmt>
          titleStmt>
          publicationStmt>
            p>Converted from IIIF Manifestp>
          publicationStmt>
          sourceDesc>
            p>https://catalog.lib.kyushu-u.ac.jp/image/manifest/1/820/411193.jsonp>
          sourceDesc>
        fileDesc>
      teiHeader>
    egXML>
  exemplum>
elementSpec>

Namespace Issues and Solutions

Problem: TEI Elements Not Recognized

When using TEI elements (especially root elements) inside egXML, namespace issues can occur:

exemplum>
  egXML xmlns="http://www.tei-c.org/ns/Examples">
    TEI xmlns="http://www.tei-c.org/ns/1.0">
      teiHeader>...teiHeader>
    TEI>
  egXML>
exemplum>

Solution 1: Use Namespace Prefixes

exemplum>
  egXML xmlns="http://www.tei-c.org/ns/Examples" xmlns:tei="http://www.tei-c.org/ns/1.0">
    tei:TEI>
      tei:teiHeader>...tei:teiHeader>
      tei:text>...tei:text>
    tei:TEI>
  egXML>
exemplum>

Solution 2: Simplify with Comments

exemplum>
  egXML xmlns="http://www.tei-c.org/ns/Examples">

  egXML>
exemplum>

Solution 3: Omit the Example

To avoid validation errors, completely omitting problematic examples is also an option.

Providing Language-Specific Examples

When providing multilingual examples:

elementSpec ident="zone" mode="change">

  exemplum xml:lang="ja">
    egXML xmlns="http://www.tei-c.org/ns/Examples">
      zone xml:id="zone-1-1" ulx="453" uly="55" lrx="492" lry="744"/>

    egXML>
  exemplum>


  exemplum xml:lang="en">
    egXML xmlns="http://www.tei-c.org/ns/Examples">
      zone xml:id="zone-1-1" ulx="100" uly="100" lrx="500" lry="200"/>

    egXML>
  exemplum>
elementSpec>

Demonstrating Attribute Usage

When showing various attribute values:

elementSpec ident="lb" mode="change">
  exemplum>
    egXML xmlns="http://www.tei-c.org/ns/Examples">

      lb n="1.1" type="line" corresp="#zone-1-1"/>


      lb n="2.5" type="line" corresp="#zone-2-5"/>


      lb n="3.10" type="line" corresp="#zone-3-10"/>
    egXML>
  exemplum>
elementSpec>

Display in Roma

These examples are automatically included in the HTML documentation generated by the Roma tool. Having examples:

Clarifies how elements are used
Shows actual attribute values
Makes it easier for schema users to implement

Japanese Language Support

Multilingual Descriptions

Providing descriptions in both Japanese and English within the ODD file:

elementSpec ident="TEI" mode="change">
  desc xml:lang="ja">NDL古典籍OCR TEIドキュメントのルート要素desc>
  desc xml:lang="en">Root element for NDL Koten OCR TEI documentsdesc>
elementSpec>

Setting the Document Language

To use the Japanese interface in the Roma tool:

schemaSpec ident="ndl_koten_ocr" start="TEI" prefix="tei_" docLang="ja">

Sample Output

An example of TEI XML generated from this ODD:

xml version="1.0" encoding="UTF-8"?>
TEI xmlns="http://www.tei-c.org/ns/1.0">
  teiHeader>
    fileDesc>
      titleStmt>
        title>OCR処理結果title>
        respStmt>
          resp>Automated Transcriptionresp>
          name ref="https://github.com/ndl-lab/ndlkotenocr-lite">
            NDL古典籍OCR-Liteアプリケーション
          name>
        respStmt>
      titleStmt>
      publicationStmt>
        p>Converted from IIIF Manifestp>
      publicationStmt>
      sourceDesc>
        p>https://catalog.lib.kyushu-u.ac.jp/image/manifest/1/820/411193.jsonp>
      sourceDesc>
    fileDesc>
  teiHeader>
  text>
    body>
      p>
        pb n="1" facs="https://example.com/image1.jpg"/>
        lb n="1.1" type="line" corresp="#zone-1-1"/>
        いつれの御時により女御更花あまたさふらひ給ける
        lb n="1.2" type="line" corresp="#zone-1-2"/>
        中に。いとやんことなきはにはあらぬか。すくれ
      p>
    body>
  text>
  facsimile sameAs="https://catalog.lib.kyushu-u.ac.jp/image/manifest/1/820/411193.json">
    surface sameAs="https://example.com/image1.jpg" ulx="0" uly="0" lrx="563" lry="790">
      graphic url="https://example.com/image1.jpg" width="563px" height="790px"/>
      zone xml:id="zone-1-1" ulx="453" uly="55" lrx="492" lry="744"/>
      zone xml:id="zone-1-2" ulx="412" uly="56" lrx="447" lry="733"/>
    surface>
  facsimile>
TEI>

Using the Roma Tool

Loading the ODD File

Access Roma
Upload your ODD file from “Upload ODD”
Perform additional customizations as needed

Generating Schemas

Roma can generate schemas in the following formats:

RelaxNG Schema
W3C Schema (XSD)
DTD
Schematron

Generating HTML Documentation

From the “Documentation” tab in Roma, you can generate HTML documentation. With the minimal configuration, only the elements and attributes actually used are documented.

Troubleshooting

Common Issues and Solutions

Error with TEI elements inside egXML
- Problem: The <TEI> element causes an error inside <egXML>
- Solution: Use namespace prefixes or simplify the example
mode=“keep” is invalid
- Problem: mode="keep" is not recognized in attDef
- Solution: Use mode="change" instead
Too many unnecessary classes
- Problem: Using standard modules includes unnecessary classes
- Solution: Use mode="add" to define only what is needed

Summary

There are multiple approaches to TEI ODD customization:

Using standard modules: Easy but includes many unnecessary elements
Deletion approach: Remove unnecessary items from the standard
Addition approach: Explicitly add only what is needed (recommended)

It is important to choose the appropriate method based on project requirements. For NDL Classical Book OCR, the minimal definition approach achieved a clear and manageable schema.

References

License

This ODD file is provided under the Creative Commons Attribution 4.0 International License.

Overview#

Background: Developing the NDL Classical Book OCR-Lite Application#

Customization Approaches#

1. Initial Approach: Using Standard Modules#

Importance of the include Attribute#

2. Improved Approach: Deleting Unnecessary Elements#

3. Final Approach: Minimal Definition#

Implementation Details#

Managing Coordinate Information#

IIIF Integration#

Line Number Format Constraints#

Writing Examples#

Basic Structure of exemplum and egXML#

Writing Complex Examples#

Namespace Issues and Solutions#

Problem: TEI Elements Not Recognized#

Solution 1: Use Namespace Prefixes#

Solution 2: Simplify with Comments#

Solution 3: Omit the Example#

Providing Language-Specific Examples#

Demonstrating Attribute Usage#

Display in Roma#

Japanese Language Support#

Multilingual Descriptions#

Setting the Document Language#

Sample Output#

Using the Roma Tool#

Loading the ODD File#

Generating Schemas#

Generating HTML Documentation#

Troubleshooting#

Common Issues and Solutions#

Summary#

References#

License#