Overview

I created a program to calculate edit distance for TEI/XML files containing app elements.

You can use it from the following Google Colab notebook:

https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/編集距離を算出するプログラム.ipynb

Upload an XML file and the program will calculate the similarity between witnesses.

Example

Let’s upload the following XML file:

https://tei-eaj.github.io/koui/data/nakamura.xml

The result is an Excel file like the following, which provides an overview of the similarity between witnesses.

indexname1name2distanceratio
0中村式五十音中村式五十音又様100.85
1中村式五十音中村式五十音欠損本70.8947368421052632
2中村式五十音又様中村式五十音欠損本80.868421052631579

The following library is used for calculating similarity:

https://pypi.org/project/python-Levenshtein/

Summary

There is room for further consideration on text comparison methods, but I hope this serves as a useful reference as an example of quantitative comparison between witnesses.

Reference

The functionality has also been added to the “program for extracting differences between two texts” introduced below: