Introduction

In the OpenITI (Open Islamicate Texts Initiative) project, which handles historical texts from the Islamicate world, texts can be tagged using a lightweight notation called mARkdown instead of TEI/XML.

While TEI/XML is a powerful international standard for structuring texts, it has problems with right-to-left (RTL) languages like Arabic, where mixing XML tags causes display issues in editors. mARkdown was designed to solve this problem.

In this article, we will try running oitei, a Python tool that automatically converts mARkdown texts to TEI XML.

What is oitei?

  • A Python library for converting OpenITI mARkdown to TEI XML
  • Outputs XML conforming to the OpenITI TEI Schema
  • Published on PyPI and installable via pip install
  • Dependencies: oimdp (mARkdown parser), lxml

https://github.com/OpenITI/oitei

Installation

pip install oitei

Python 3.8 or later is required. oimdp (OpenITI mARkdown Parser) and lxml are automatically installed as dependencies.

OpenITI mARkdown Notation

mARkdown files consist of three parts:

  1. Magic value (line 1): ######OpenITI#
  2. Metadata: Lines starting with #META#
  3. Body text: Written after #META#Header#End#

Main Tags

NotationMeaning
`###`
`###
### $Biographical entry
#Start of paragraph
@P02 namePerson name (includes the following 2 words)
@T11 placePlace name (includes the following 1 word)
@YB732Birth year (Hijri year 732)
@YD808Death year (Hijri year 808)
%~%Hemistich (verse line) separator

The two-digit number after named entity tags (@P, @T, etc.) specifies: the first digit is the entity number, and the second digit indicates “how many subsequent words to include in the name.” For example, @P02 Ibn Khaldun means “include the following 2 words (Ibn Khaldun) as a person name.”

Creating a Sample File

Create a file named sample_markdown.md with the following content.

######OpenITI#

#META# Title: Sample Text for OpenITI mARkdown Demo
#META# Author: Demo Author
#META# Language: Arabic
#META#Header#End#

### | Chapter One: Introduction

# This is the first paragraph of the sample text. It demonstrates how OpenITI mARkdown works for tagging biographical and geographical information.

### $ Ibn Khaldun

# @P02 Ibn Khaldun was born in @T11 Tunis in @YB732 and died in @YD808 in @T11 Cairo . He wrote the Muqaddimah.

### || Section on Geography

# The city of @T11 Damascus was an important center of learning. Many scholars traveled from @T11 Baghdad to @T11 Damascus .

### $ Abu Bakr al-Razi

# @P03 Abu Bakr al-Razi was a renowned physician. He was born in @T11 Rayy and later moved to @T11 Baghdad . He died in @YD313 .

### | Chapter Two: Poetry

# The following is a sample verse:

# %~% The morning light shines on the desert sand %~% And caravan bells echo through the land

# This concludes our sample text.

Running the Conversion

import oitei

md = open("sample_markdown.md", "r").read()
tei_string = oitei.convert(md).tostring()

with open("sample_tei.xml", "w") as writer:
    writer.write(tei_string)

The conversion is completed in just 4 lines.

Conversion Results

The generated TEI XML is as follows.

<?xml version='1.0' encoding='UTF-8'?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
     xmlns:xi="http://www.w3.org/2001/XInclude">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title/>
        <author/>
      </titleStmt>
      <publicationStmt>
        <publisher>Open Islamicate Texts Initiative (OpenITI)</publisher>
        <availability>
          <p>Creative Commons Attribution Non Commercial Share Alike
             4.0 International</p>
        </availability>
      </publicationStmt>
      <sourceDesc>
        <bibl/>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <calendarDesc>
        <calendar xml:id="ah">
          <p>Anno Hegirae</p>
        </calendar>
      </calendarDesc>
    </profileDesc>
    <xenoData xml:space="preserve">
Title: Sample Text for OpenITI mARkdown Demo
Author: Demo Author
Language: Arabic
</xenoData>
  </teiHeader>
  <text>
    <body>
      <div>
        <head>Chapter One: Introduction</head>
        <p>
          <lb/>This is the first paragraph of the sample text. ...
        </p>
        <div type="biography" subtype="man">
          <head>Ibn Khaldun</head>
          <p>
            <lb/>
            <persName>Ibn Khaldun </persName> was born in
            T<placeName>unis </placeName> in
            <date type="birth" calendar="#ah" when-custom="732"/>
            and died in
            <date type="death" calendar="#ah" when-custom="808"/>
            in C<placeName>airo </placeName> .
            He wrote the Muqaddimah.
          </p>
        </div>
        <!-- ... abbreviated below ... -->
      </div>
    </body>
  </text>
</TEI>

Key Conversion Points

Each mARkdown tag is appropriately converted to TEI elements.

mARkdownTEI XML
`###` chapter heading
### $ biography<div type="biography" subtype="man">
@P02 Ibn Khaldun<persName>Ibn Khaldun</persName>
@T11 Tunis<placeName>unis</placeName>
@YB732<date type="birth" calendar="#ah" when-custom="732"/>
@YD808<date type="death" calendar="#ah" when-custom="808"/>
%~% verse separator<caesura/>
#META# metadata<xenoData>

Hijri calendar years are automatically assigned calendar="#ah", making the calendar system explicit.

Applying to Japanese Text

Although oitei is designed for Islamicate texts, let’s try applying it to Japanese text.

Note: The Space-Delimited Problem

Named entity tags in mARkdown (@P, @T, etc.) use a mechanism that captures the following N words delimited by spaces as names. Since Japanese does not separate words with spaces, some workarounds are needed.

  • @P02 Ibn Khaldun -> following 2 words = “Ibn Khaldun” (works for English and Arabic)
  • @P02 源 頼朝 -> following 2 words = “源” “頼朝” … results in <persName>源 頼朝</persName> but looks unnatural

Workaround: Combine Japanese names into a single word without spaces and use @P01 (following 1 word).

@P01 源頼朝    → <persName>源頼朝</persName> ✅
@T01 鎌倉      → <placeName>鎌倉</placeName> ✅

Japanese Sample

Let’s try a simple example with fictional people and places.

######OpenITI#

#META# Title: 日本語テキストサンプル
#META# Author: デモ著者
#META# Language: Japanese
#META#Header#End#

### | 第一章 人物紹介

# @P01 太郎 は @T01 東京 に住んでいる。

# @P01 花子 は @T01 京都 で生まれ、現在は @T01 大阪 に住んでいる。

### | 第二章 詩

# 以下はサンプルの詩である。

# %~% 春の風が街を吹き抜け %~% 桜の花びらが舞い散る

# これでサンプルテキストは終わりである。

Japanese Conversion Results

<div>
  <head>第一章 人物紹介</head>
  <p>
    <lb/>
    <persName>太郎 </persName><placeName>東京 </placeName> に住んでいる。
  </p>
  <p>
    <lb/>
    <persName>花子 </persName><placeName>京都 </placeName> で生まれ、
    現在は <placeName>大阪 </placeName> に住んでいる。
  </p>
</div>
<div>
  <head>第二章 詩</head>
  <p>
    <lb/>以下はサンプルの詩である。
  </p>
  <lg>
    <l>
      <caesura/> 春の風が街を吹き抜け
      <caesura/> 桜の花びらが舞い散る
    </l>
  </lg>
</div>

Person names, place names, and verse lines are converted to TEI elements, and I confirmed that this XML also passes validation with tei_all.rng.

!

Since oitei is a tool designed for Islamicate texts, the following points require attention:

  • The teiHeader always outputs calendar="#ah" (Hijri calendar) by default
  • The publisher is fixed as “Open Islamicate Texts Initiative”
  • When using with Japanese, spaces are required before and after named entity tags

For serious TEI encoding of Japanese texts, consider modifying the oitei output headers or using a different tool.

TEI Schema Validation

Let’s verify whether the generated XML conforms to the TEI standard.

# Download the TEI All schema
curl -sL -o tei_all.rng \
  https://tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng

# Validate with xmllint
xmllint --relaxng tei_all.rng sample_tei.xml
sample_tei.xml validates

I confirmed that the output passes validation with the official TEI RelaxNG schema (tei_all.rng).

!

I also attempted validation with the OpenITI custom schema (tei_openiti.rng), but the compilation of the 434KB schema took an extremely long time and did not complete in the local environment. Since tei_all is a superset of tei_openiti, passing tei_all validation confirms basic conformance.

Summary

  • Using oitei, OpenITI mARkdown can be converted to TEI XML in just a few lines of Python
  • No need to hand-write XML tags, avoiding editor confusion especially when dealing with RTL languages
  • The generated XML passes TEI standard schema validation
  • Named entity tags such as <persName>, <placeName>, and <date> are automatically assigned

Writing in a lightweight notation and converting to TEI XML when needed is a particularly useful workflow for researchers working with historical texts from the Islamicate world.

References