Overview
I created a program to calculate edit distance for TEI/XML files containing app elements.
You can use it from the following Google Colab notebook:
https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/編集距離を算出するプログラム.ipynb
Upload an XML file and the program will calculate the similarity between witnesses.
Example
Let’s upload the following XML file:
https://tei-eaj.github.io/koui/data/nakamura.xml
The result is an Excel file like the following, which provides an overview of the similarity between witnesses.
| index | name1 | name2 | distance | ratio |
|---|---|---|---|---|
| 0 | 中村式五十音 | 中村式五十音又様 | 10 | 0.85 |
| 1 | 中村式五十音 | 中村式五十音欠損本 | 7 | 0.8947368421052632 |
| 2 | 中村式五十音又様 | 中村式五十音欠損本 | 8 | 0.868421052631579 |
The following library is used for calculating similarity:
https://pypi.org/project/python-Levenshtein/
Summary
There is room for further consideration on text comparison methods, but I hope this serves as a useful reference as an example of quantitative comparison between witnesses.
Reference
The functionality has also been added to the “program for extracting differences between two texts” introduced below: