Overview

I had an opportunity to convert Word files to TEI/XML files. Upon investigation, in addition to official TEI tools such as TEIGarage Conversion, I found a conversion example in TEI Publisher:

https://teipublisher.com/exist/apps/tei-publisher/test/test.docx.xml

The above example appeared to convert Word style information into TEI tags, so I tried this approach. For this project, I used the python-docx library with the goal of using it independently of TEI Publisher.

Word File

I created a prototype Word file like the one below. All styles are provisional, but I created styles such as “tei:persName” and “tei:warichu” and changed their visual styling such as color. The mechanism works by applying styles to perform simple structuring.

Conversion to TEI/XML

I created a script that takes the above Word file as input and converts it to TEI/XML, primarily based on style information. I plan to share it via pip or similar in the future.

An example of the converted TEI/XML is below. There is still much room for improvement, but I was able to convert it into a valid TEI/XML file.

    <lb/>
    <seg>
     ワードの入力サンプル
    </seg>
    <lb/>
    <lb/>
    <seg type="dateline">
     日付の行にスタイル「dateLine」を使用してください。先頭に2文字の空白が入ります。
    </seg>
    <lb/>
    <seg type="personline">
     名前の行にスタイル「personLine」を使用してください。末尾に2文字の空白が入ります。
    </seg>
    <lb/>
    <seg>
     <ruby>
      <rb>
       中村
      </rb>
      <rt>
       なかむら
      </rt>
      <rt place="left">
       さとる
      </rt>
     </ruby>
     の形で両側ルビを記述します。緑色が左ルビです。
    </seg>
    <lb/>
    <lb/>
    <seg>
     <seg type="red">
      朱書
     </seg>     はスタイル「
     <seg type="red">
      red
     </seg>     」を使用してください。
    </seg>
    <lb/>
    <lb/>
    <seg>
     文字のサイズについては検討中です。
    </seg>
    <lb/>
    <lb/>
    <seg>
     <persName>
      中村覚
     </persName>     のような人名には、スタイル「
     <persName>
      persName
     </persName>     」を使用してください。
    </seg>
    <lb/>
    <lb/>
    <seg>
     割注は
     <note type="割書">
      あああああ
      <milestone unit="wbr"/>
      いいい
     </note>
     のように入力してください。正しく改行されるまで、全角スペースを入力してください。
     <note type="割書">
      こんな
      <milestone unit="wbr"/>
      スタイル
     </note>
     もあります。「こんな」の後に全角スペースを入れています。
    </seg>
    <lb/>
    <lb/>
    <seg type="dateline">
     二〇二三年一月十七日
    </seg>
    <lb/>
    <seg type="personline">
     作成:中村覚
    </seg>
    <lb/>

Below is an example displayed in a TEI/XML viewer I am developing separately. Styles such as <rt place="left"> and red text have not yet been applied, but person names and interlinear notes have been successfully reproduced.

Summary

While complex structures may be difficult, I believe that being able to convert text created in Word to TEI/XML in a reasonably intended form could help lower the barrier to adopting TEI/XML. I plan to continue experimenting with this approach.