Overview

“Digital Genji Monogatari” is a site that aims to propose an environment to support research on The Tale of Genji as well as education and research activities using classical texts, by collecting and creating various related data about The Tale of Genji and linking them together.

https://genji.dl.itc.u-tokyo.ac.jp/

One of the features provided by this site is the “alignment of the Collated Tale of Genji with modern Japanese translations.” As shown below, the corresponding sections between the “Collated Tale of Genji” and Yosano Akiko’s translation published on Aozora Bunko are highlighted.

This article explains the procedure for implementing the above functionality.

Data

The following type of data is created.

https://genji.dl.itc.u-tokyo.ac.jp/data/tei/koui/54.xml

anchor tags are used to map pairs of files and IDs from Yosano Akiko’s translation to the text data of the “Collated Tale of Genji.”

<text>
  <body>
    <p>
      <lb/>
      <pb facs="#zone_2055" n="2055"/>
      <lb/>
      <seg corresp="https://w3id.org/kouigenjimonogatari/api/items/2055-01.json">
        <anchor corresp="https://genji.dl.itc.u-tokyo.ac.jp/api/items/tei/yosano/56.xml#YG5600000300"/>
        やまにおはしてれいせさせ給やうに経仏なとくやうせさせ給
        <anchor corresp="https://genji.dl.itc.u-tokyo.ac.jp/api/items/tei/yosano/56.xml#YG5600000400"/>
        又の日はよかはに
      </seg>
      <lb/>
      ...

The following tool was developed and used for creating this data.

https://github.com/tei-eaj/parallel_text_editor

Unfortunately, it is not functional as of 2024-01-07, but you can see how it works in the following video. I plan to improve this tool in the future.

https://youtu.be/hOp_PxYUrZk

As a result of the above work, Google Documents like the following are created.

https://docs.google.com/document/d/1DxKItyugUIR3YYUxlwH5-SRVA_eTj7Gh_LJ4A0mzCg8/edit?usp=sharing

For each line of the “Collated Tale of Genji,” the corresponding Yosano Akiko translation ID is inserted in the format \[YG(\d+)\].

2055-01 [YG5600000300]やまにおはしてれいせさせ給やうに経仏なとくやうせさせ給[YG5600000400]又の日はよかはに
2055-02 おはしたれはそうつおとろきかしこまりきこえ給[YG5600000500]としころ御いのりなとつけか
2055-03 たらひたまひけれとことにいとしたしきことはなかりけるをこのたひ一品の宮
2055-04 の御心ちのほとにさふらひ給へるにすくれたまへるけん物し給けりとみたまひ
2055-05 てよりこよなうたうとひたまひていますこしふかきちきりくはへ給てけれはお
2055-06 も〱しくおはするとのゝかくわさとおはしましたることゝもてさはきゝこえ
2055-07 給[YG5600000600]御物かたりなとこまやかにしておはすれは御ゆつけなとまいり給[YG5600000700]すこし人
2055-08 〱しつまりぬるに[YG5600000800]をのゝわたりにしり給へるやとりや侍とゝひ給へはしか侍
...

Google Documents for each volume of The Tale of Genji are saved in Google Drive.

https://drive.google.com/drive/folders/1QgS4z_5vk8AEz95iA3q7j41-U3oDdfpx

Processing

Retrieving the List of File Names and IDs from Google Drive

Connecting to Google Drive

#| export
import os.path

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

#| export
class GoogleDriveClient:

    def __init__(self, credential_path):

        # If modifying these scopes, delete the file token.json.
        SCOPES = [
            "https://www.googleapis.com/auth/drive.metadata.readonly"
        ]

        creds = None
        # The file token.json stores the user's access and refresh tokens, and is
        # created automatically when the authorization flow completes for the first
        # time.
        if os.path.exists("token.json"):
            creds = Credentials.from_authorized_user_file("token.json", SCOPES)
        # If there are no (valid) credentials available, let the user log in.
        if not creds or not creds.valid:
            if creds and creds.expired and creds.refresh_token:
                creds.refresh(Request())
            else:
                flow = InstalledAppFlow.from_client_secrets_file(
                    credential_path, SCOPES
                )
                creds = flow.run_local_server(port=0)
            # Save the credentials for the next run
            with open("token.json", "w") as token:
                token.write(creds.to_json())

        try:
            service = build('drive', 'v3', credentials=creds)
            self.drive_service = service

        except HttpError as e:
            print(e)
            print("Error while creating API client")
            raise e

Retrieving the list

import json
client = GoogleDriveClient(credential_path)

service = client.drive_service

# Call the Drive v3 API
results = service.files().list(
    q="'1QgS4z_5vk8AEz95iA3q7j41-U3oDdfpx' in parents",
    pageSize=100, fields="nextPageToken, files(id, name)").execute()
items = results.get('files', [])

config = {}

if not items:
    print('No files found.')
else:
    for item in items:
        config[item['name']] = item['id']

with open("data/config.json", "w") as f:
    json.dump(config, f, indent=4)

Processing Each Google Document

Based on the file names (volume numbers) and IDs obtained above, processing is performed on each Google Document.

This processing creates the XML data of the Collated Tale of Genji with the anchor tags introduced at the beginning of this article.

The original XML data of the Collated Tale of Genji is published at the following location.

https://kouigenjimonogatari.github.io/

(Reference) Formatting XML Data

The following function was created for formatting the XML data.

def pretty(self, xml_string: str) -> str:
        """
        Pretty prints the XML string.
        :param xml_string: XML string to pretty print.
        :return: Pretty printed XML string.
        """

        # 文字列からDOMツリーを構築します。
        dom = minidom.parseString(xml_string)

        # 整形されたXMLを取得します。
        pretty_xml_as_string = dom.toprettyxml()

        # 空行を削除します。
        pretty_xml_as_string = '\n'.join([line for line in pretty_xml_as_string.split('\n') if line.strip()])

        return pretty_xml_as_string

When using Beautiful Soup’s prettify() method as shown below, unnecessary line breaks appeared to be included.

# BeautifulSoupオブジェクトを作成
soup = BeautifulSoup(html_content, "html.parser")

# 整形されたHTMLを取得して表示
pretty_html = soup.prettify()
print(pretty_html)

Summary

I have documented the steps needed for aligning the Collated Tale of Genji with modern Japanese translations in Digital Genji Monogatari.

Overview#

Data#

Processing#

Retrieving the List of File Names and IDs from Google Drive#

Processing Each Google Document#

(Reference) Formatting XML Data#

Summary#