Overview
TEI (Text Encoding Initiative) is an international standard for digitizing and sharing texts in humanities research. This article introduces the process of customizing a TEI ODD file to match the output format of the NDL Classical Book OCR-Lite application.
ODD (One Document Does it all) is a mechanism for customizing TEI schemas, allowing you to define your own schema containing only the elements and attributes you need.
Background: Developing the NDL Classical Book OCR-Lite Application
We are developing an application that outputs the results of NDL Classical Book OCR-Lite in TEI/XML format. The application is designed to perform OCR processing on Japanese classical books and output the results in standard TEI format.
We decided to include the following information in the output TEI XML:
- Text information: Character strings recognized by OCR
- Layout information: Coordinate information (bounding boxes) for each line
- Image references: IIIF (International Image Interoperability Framework) compatible image URLs
- Metadata: Document title, processing information, etc.
We wrote the schema used by this application in ODD. The following describes the customization process.
Customization Approaches
1. Initial Approach: Using Standard Modules
Initially, we created the ODD using TEI’s standard modules:
schemaSpec ident="ndl_koten_ocr" start="TEI" prefix="tei_">
moduleRef key="tei"/>
moduleRef key="header" include="teiHeader fileDesc titleStmt publicationStmt sourceDesc"/>
moduleRef key="core" include="p title name resp respStmt lb pb graphic"/>
moduleRef key="textstructure" include="TEI text body"/>
moduleRef key="transcr" include="facsimile surface zone"/>
schemaSpec>
Importance of the include Attribute
The include attribute of the moduleRef element is an important feature that selectively includes only specific elements from a module:
moduleRef key="header" include="teiHeader fileDesc titleStmt publicationStmt sourceDesc"/>
Benefits of using the include attribute:
- You can explicitly specify only the elements you need
- The schema size is smaller than including the entire module
- It is clear which elements are being used
Without the include attribute:
moduleRef key="header"/>
In this case, all elements from the header module (encodingDesc, profileDesc, revisionDesc, etc.) would be included.
Specifying multiple elements:
moduleRef key="core" include="p title name resp respStmt lb pb graphic"/>
Exclusion using the exclude attribute:
moduleRef key="core" exclude="hi del add note"/>
The exclude attribute is the opposite of include, excluding specific elements from a module. It is useful when most elements are needed and only a few are unnecessary.
Criteria for choosing between include and exclude:
- When few elements are needed: use
include - When few elements are unnecessary: use
exclude - When clarity is important: use
include(makes it clear what is being used)
However, even with this approach, related model classes and attribute classes are automatically included, making it impossible to fully minimize the schema.
2. Improved Approach: Deleting Unnecessary Elements
Next, we explicitly deleted unnecessary classes:
classSpec ident="model.emphLike" type="model" mode="delete"/>
classSpec ident="model.highlighted" type="model" mode="delete"/>
classSpec ident="att.datable" type="atts" mode="delete"/>
classSpec ident="att.editLike" type="atts" mode="delete"/>
3. Final Approach: Minimal Definition
Ultimately, we adopted an approach of explicitly defining only the necessary elements and attributes:
schemaSpec ident="ndl_koten_ocr_minimal" start="TEI" prefix="tei_" docLang="ja">
classSpec ident="att.global" type="atts" mode="add">
attList>
attDef ident="xml:id" mode="add">
desc xml:lang="ja">一意識別子desc>
datatype>dataRef key="ID"/>datatype>
attDef>
attDef ident="xml:lang" mode="add">
desc xml:lang="ja">言語コードdesc>
datatype>dataRef key="teidata.language"/>datatype>
attDef>
attList>
classSpec>
schemaSpec>
Implementation Details
Managing Coordinate Information
To manage OCR coordinate information, we defined a dedicated attribute class:
classSpec ident="att.coordinated" type="atts" mode="add">
desc xml:lang="ja">座標属性desc>
attList>
attDef ident="ulx" mode="add">
desc xml:lang="ja">左上X座標desc>
datatype>dataRef key="teidata.numeric"/>datatype>
attDef>
attDef ident="uly" mode="add">
desc xml:lang="ja">左上Y座標desc>
datatype>dataRef key="teidata.numeric"/>datatype>
attDef>
attDef ident="lrx" mode="add">
desc xml:lang="ja">右下X座標desc>
datatype>dataRef key="teidata.numeric"/>datatype>
attDef>
attDef ident="lry" mode="add">
desc xml:lang="ja">右下Y座標desc>
datatype>dataRef key="teidata.numeric"/>datatype>
attDef>
attList>
classSpec>
IIIF Integration
To link with IIIF manifests, we added the sameAs attribute:
elementSpec ident="facsimile" mode="add">
desc xml:lang="ja">ファクシミリdesc>
attList>
attDef ident="sameAs" mode="add">
desc xml:lang="ja">IIIFマニフェストURLdesc>
datatype>dataRef key="teidata.pointer"/>datatype>
attDef>
attList>
elementSpec>
Line Number Format Constraints
We used Schematron to constrain the line number format:
constraintSpec ident="page-numbering" scheme="schematron">
constraint>
sch:rule context="tei:lb[@n]">
sch:assert test="matches(@n, '^\d+\.\d+$')">
Line numbers must follow this format: page.line (e.g., 1.1, 2.3)
sch:assert>
sch:rule>
constraint>
constraintSpec>
Writing Examples
Basic Structure of exemplum and egXML
In ODD, you use the exemplum and egXML elements to describe usage examples:
elementSpec ident="pb" mode="change">
desc xml:lang="ja">ページ区切りdesc>
exemplum>
egXML xmlns="http://www.tei-c.org/ns/Examples">
pb n="1" facs="https://catalog.lib.kyushu-u.ac.jp/image/iiif/1/820/411193/368828.tiff/full/max/0/default.jpg"/>
egXML>
exemplum>
elementSpec>
Writing Complex Examples
When showing examples containing multiple elements:
elementSpec ident="teiHeader" mode="change">
exemplum>
egXML xmlns="http://www.tei-c.org/ns/Examples">
teiHeader>
fileDesc>
titleStmt>
title>OCR処理結果title>
respStmt>
resp>Automated Transcriptionresp>
name ref="https://github.com/ndl-lab/ndlkotenocr-lite">
NDL古典籍OCR-Liteアプリケーション
name>
respStmt>
titleStmt>
publicationStmt>
p>Converted from IIIF Manifestp>
publicationStmt>
sourceDesc>
p>https://catalog.lib.kyushu-u.ac.jp/image/manifest/1/820/411193.jsonp>
sourceDesc>
fileDesc>
teiHeader>
egXML>
exemplum>
elementSpec>
Namespace Issues and Solutions
Problem: TEI Elements Not Recognized
When using TEI elements (especially root elements) inside egXML, namespace issues can occur:
exemplum>
egXML xmlns="http://www.tei-c.org/ns/Examples">
TEI xmlns="http://www.tei-c.org/ns/1.0">
teiHeader>...teiHeader>
TEI>
egXML>
exemplum>
Solution 1: Use Namespace Prefixes
exemplum>
egXML xmlns="http://www.tei-c.org/ns/Examples" xmlns:tei="http://www.tei-c.org/ns/1.0">
tei:TEI>
tei:teiHeader>...tei:teiHeader>
tei:text>...tei:text>
tei:TEI>
egXML>
exemplum>
Solution 2: Simplify with Comments
exemplum>
egXML xmlns="http://www.tei-c.org/ns/Examples">
egXML>
exemplum>
Solution 3: Omit the Example
To avoid validation errors, completely omitting problematic examples is also an option.
Providing Language-Specific Examples
When providing multilingual examples:
elementSpec ident="zone" mode="change">
exemplum xml:lang="ja">
egXML xmlns="http://www.tei-c.org/ns/Examples">
zone xml:id="zone-1-1" ulx="453" uly="55" lrx="492" lry="744"/>
egXML>
exemplum>
exemplum xml:lang="en">
egXML xmlns="http://www.tei-c.org/ns/Examples">
zone xml:id="zone-1-1" ulx="100" uly="100" lrx="500" lry="200"/>
egXML>
exemplum>
elementSpec>
Demonstrating Attribute Usage
When showing various attribute values:
elementSpec ident="lb" mode="change">
exemplum>
egXML xmlns="http://www.tei-c.org/ns/Examples">
lb n="1.1" type="line" corresp="#zone-1-1"/>
lb n="2.5" type="line" corresp="#zone-2-5"/>
lb n="3.10" type="line" corresp="#zone-3-10"/>
egXML>
exemplum>
elementSpec>
Display in Roma
These examples are automatically included in the HTML documentation generated by the Roma tool. Having examples:
- Clarifies how elements are used
- Shows actual attribute values
- Makes it easier for schema users to implement
Japanese Language Support
Multilingual Descriptions
Providing descriptions in both Japanese and English within the ODD file:
elementSpec ident="TEI" mode="change">
desc xml:lang="ja">NDL古典籍OCR TEIドキュメントのルート要素desc>
desc xml:lang="en">Root element for NDL Koten OCR TEI documentsdesc>
elementSpec>
Setting the Document Language
To use the Japanese interface in the Roma tool:
schemaSpec ident="ndl_koten_ocr" start="TEI" prefix="tei_" docLang="ja">
Sample Output
An example of TEI XML generated from this ODD:
xml version="1.0" encoding="UTF-8"?>
TEI xmlns="http://www.tei-c.org/ns/1.0">
teiHeader>
fileDesc>
titleStmt>
title>OCR処理結果title>
respStmt>
resp>Automated Transcriptionresp>
name ref="https://github.com/ndl-lab/ndlkotenocr-lite">
NDL古典籍OCR-Liteアプリケーション
name>
respStmt>
titleStmt>
publicationStmt>
p>Converted from IIIF Manifestp>
publicationStmt>
sourceDesc>
p>https://catalog.lib.kyushu-u.ac.jp/image/manifest/1/820/411193.jsonp>
sourceDesc>
fileDesc>
teiHeader>
text>
body>
p>
pb n="1" facs="https://example.com/image1.jpg"/>
lb n="1.1" type="line" corresp="#zone-1-1"/>
いつれの御時により女御更花あまたさふらひ給ける
lb n="1.2" type="line" corresp="#zone-1-2"/>
中に。いとやんことなきはにはあらぬか。すくれ
p>
body>
text>
facsimile sameAs="https://catalog.lib.kyushu-u.ac.jp/image/manifest/1/820/411193.json">
surface sameAs="https://example.com/image1.jpg" ulx="0" uly="0" lrx="563" lry="790">
graphic url="https://example.com/image1.jpg" width="563px" height="790px"/>
zone xml:id="zone-1-1" ulx="453" uly="55" lrx="492" lry="744"/>
zone xml:id="zone-1-2" ulx="412" uly="56" lrx="447" lry="733"/>
surface>
facsimile>
TEI>
Using the Roma Tool
Loading the ODD File
- Access Roma
- Upload your ODD file from “Upload ODD”
- Perform additional customizations as needed
Generating Schemas
Roma can generate schemas in the following formats:
- RelaxNG Schema
- W3C Schema (XSD)
- DTD
- Schematron
Generating HTML Documentation
From the “Documentation” tab in Roma, you can generate HTML documentation. With the minimal configuration, only the elements and attributes actually used are documented.
Troubleshooting
Common Issues and Solutions
Error with TEI elements inside egXML
- Problem: The
<TEI>element causes an error inside<egXML> - Solution: Use namespace prefixes or simplify the example
- Problem: The
mode=“keep” is invalid
- Problem:
mode="keep"is not recognized inattDef - Solution: Use
mode="change"instead
- Problem:
Too many unnecessary classes
- Problem: Using standard modules includes unnecessary classes
- Solution: Use
mode="add"to define only what is needed
Summary
There are multiple approaches to TEI ODD customization:
- Using standard modules: Easy but includes many unnecessary elements
- Deletion approach: Remove unnecessary items from the standard
- Addition approach: Explicitly add only what is needed (recommended)
It is important to choose the appropriate method based on project requirements. For NDL Classical Book OCR, the minimal definition approach achieved a clear and manageable schema.
References
- TEI Guidelines
- ODD: One Document Does it all
- Roma: ODD customization tool
- NDL Classical Book OCR
- IIIF (International Image Interoperability Framework)
License
This ODD file is provided under the Creative Commons Attribution 4.0 International License.