Building an API Server for Searching the Koui Genji Monogatari Text DB

Overview

I built an API server for searching the Koui Genji Monogatari (Collated Tale of Genji) Text DB, so here are my notes.

https://genji-api.aws.ldas.jp/

Background

The following page publishes the text data of "Koui Genji Monogatari" in a TEI/XML-compliant format.

https://kouigenjimonogatari.github.io/

This text data is registered in Elasticsearch to create an API that enables searching by text segments.

Usage

The usage documentation page using OpenAPI and Swagger is accessible at the following URL:

https://genji-api.aws.ldas.jp/

Key Features

Query Expansion

For example, the following URL is an example with "Yugao" as the search keyword. The input/output format follows JSON:API.

https://genji-api.aws.ldas.jp/search?q=夕顔&page[limit]=20&page[offset]=0&sort=page&filter[expandRepeatMarks]=true&filter[unifyKanjiKana]=true&filter[unifyHistoricalKana]=true&filter[unifyPhoneticChanges]=true&filter[unifyDakuon]=true&filter[vol_str]=04 夕顔

The following result is returned. Variations are generated from the input keyword "夕顔" (Yugao), and the search is performed based on these.

{
  "data": [],
  "meta": {
    "query": "夕顔",
    "transformedQueries": [
      "夕顔",
      "ゆうかお",
      "ゆふかお",
      "ゆふかほ",
      "ゆうかほ",
      "夕かお",
      "夕かほ",
      "ゆう顔",
      "ゆふ顔"
    ],
    "transformOptions": {
      "expandRepeatMarks": true,
      "unifyKanjiKana": true,
      "unifyHistoricalKana": true,
      "unifyPhoneticChanges": true,
      "unifyDakuon": true
    },
    "filters": {
      "expandRepeatMarks": true,
      "unifyKanjiKana": true,
      "unifyHistoricalKana": true,
      "unifyPhoneticChanges": true,
      "unifyDakuon": true,
      "vol_str": "04 夕顔"
    },
    "sort": "page",
    "limit": 20,
    "offset": 0,
    "total": 7,
    "aggregations": {
      "vol_str": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "04 夕顔",
            "doc_count": 7
          }
        ]
      }
    }
  }
}

As a result, occurrences of "ゆふかほ," "夕かほ," and "夕顔" appearing in the body text can all be searched at once.

The search keyword expansion allows toggling search options ON/OFF. For details, please check the Swagger UI mentioned above.

The following OR search query is sent to Elasticsearch:

{
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "original_text_lines.keyword": "*夕顔*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*ゆうかお*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*ゆふかお*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*ゆふかほ*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*ゆうかほ*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*夕かお*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*夕かほ*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*ゆう顔*"
          }
        },
        {
          "wildcard": {
            "original_text_lines.keyword": "*ゆふ顔*"
          }
        }
      ],
      "minimum_should_match": 1,
      "filter": {
        "terms": {
          "vol_str": [
            "04 夕顔"
          ]
        }
      }
    }
  },
  "size": 20,
  "from": 0,
  "sort": [
    {
      "page": {
        "order": "asc"
      }
    }
  ]
}

The rules used for conversion can be checked at the following URL:

https://genji-api.aws.ldas.jp/normalization/rules

{
  "data": {
    "type": "normalization-rules",
    "attributes": {
      "rules": {
        "historicalKana": {
          "ゐ": "い",
          "ゑ": "え",
          "を": "お",
          "ワ": "ワ",
          "ヰ": "イ",
          "ヱ": "エ",
          "ヲ": "オ",
          "くゎ": "か",
          "ぐゎ": "が",
          "クヮ": "カ",
          "グヮ": "ガ"
        },
        "dakuon": {
          "が": "か",
          "ぎ": "き",
          "ぐ": "く",
          "げ": "け",
          "ご": "こ",
          "ざ": "さ",
          "じ": "し",
          "ず": "す",
          "ぜ": "せ",
          "ぞ": "そ",
          "だ": "た",
          "ぢ": "ち",
          "づ": "つ",
          "で": "て",
          "ど": "と",
          "ば": "は",
          "び": "ひ",
          "ぶ": "ふ",
          "べ": "へ",
          "ぼ": "ほ",
          "ぱ": "は",
          "ぴ": "ひ",
          "ぷ": "ふ",
          "ぺ": "へ",
          "ぽ": "ほ",
          "ガ": "カ",
          "ギ": "キ",
          "グ": "ク",
          "ゲ": "ケ",
          "ゴ": "コ",
          "ザ": "サ",
          "ジ": "シ",
          "ズ": "ス",
          "ゼ": "セ",
          "ゾ": "ソ",
          "ダ": "タ",
          "ヂ": "チ",
          "ヅ": "ツ",
          "デ": "テ",
          "ド": "ト",
          "バ": "ハ",
          "ビ": "ヒ",
          "ブ": "フ",
          "ベ": "ヘ",
          "ボ": "ホ",
          "パ": "ハ",
          "ピ": "ヒ",
          "プ": "フ",
          "ペ": "ヘ",
          "ポ": "ホ"
        },
        "kanjiKana": {
          "桐壺": "きりつほ",
          "帚木": "ははきぎ",
          "空蝉": "うつせみ",
          "夕顔": "ゆうがお",
          "若紫": "わかむらさき",
          "末摘花": "すえつむはな",
          "紅葉賀": "もみじのが",
          "花宴": "はなのえん",
          "葵": "あおい",
          "賢木": "さかき",
          "花散里": "はなちるさと",
          "須磨": "すま",
          "明石": "あかし",
          "澪標": "みおつくし",
          "蓬生": "よもきふ",
          "関屋": "せきや",
          "絵合": "えあわせ",
          "松風": "まつかせ",
          "薄雲": "うすくも",
          "朝顔": "あさかお",
          "少女": "おとめ",
          "玉鬘": "たまかづら",
          "初音": "はつね",
          "胡蝶": "こちよう",
          "螢": "ほたる",
          "蛍": "ほたる",
          "常夏": "とこなつ",
          "篝火": "かかりひ",
          "野分": "のわき",
          "行幸": "みゆき",
          "藤袴": "ふちはかま",
          "真木柱": "まきはしら",
          "梅枝": "うめかえ",
          "藤裏葉": "ふちのうらは",
          "若菜上": "わかなじょう",
          "若菜下": "わかなげ",
          "若菜": "わかな",
          "柏木": "かしわき",
          "横笛": "よこふえ",
          "鈴虫": "すすむし",
          "夕霧": "ゆうきり",
          "御法": "みのり",
          "幻": "まほろし",
          "匂宮": "におうみや",
          "紅梅": "こうはい",
          "竹河": "たけかわ",
          "橋姫": "はしひめ",
          "椎本": "しいかもと",
          "総角": "あけまき",
          "早蕨": "さわらひ",
          "宿木": "やとりき",
          "東屋": "あすまや",
          "浮舟": "うきふね",
          "蜻蛉": "かけろう",
          "手習": "てならい",
          "夢浮橋": "ゆめのうきはし",
          "雲隠": "くもかくれ",
          "玉": "たま",
          "鬘": "かつら",
          "夕": "ゆう",
          "顔": "かお",
          "紫": "むらさき",
          "紅葉": "もみち",
          "朱雀": "すさく",
          "藤壺": "ふちつほ",
          "惟光": "これみつ",
          "源氏": "げんじ",
          "物語": "ものがたり",
          "紫式部": "むらさきしきぶ",
          "光源氏": "ひかるげんじ",
          "桐壺帝": "きりつぼてい",
          "更衣": "こうい",
          "御息所": "みやすどころ",
          "入道": "にゅうどう",
          "大臣": "だいじん",
          "中宮": "ちゅうぐう",
          "女院": "にょういん",
          "宮": "みや",
          "君": "きみ",
          "上": "うえ",
          "殿": "との",
          "御前": "おまえ",
          "姫君": "ひめぎみ",
          "若君": "わかぎみ",
          "内裏": "だいり",
          "御所": "ごしょ",
          "里": "さと",
          "六条": "ろくじょう",
          "二条": "にじょう",
          "三条": "さんじょう",
          "四条": "しじょう",
          "五条": "ごじょう",
          "七条": "しちじょう",
          "八条": "はちじょう",
          "九条": "くじょう",
          "十条": "じゅうじょう"
        },
        "kanaKanji": {
          "きりつほ": "桐壺",
          "ははきぎ": "帚木",
          "うつせみ": "空蝉",
          "ゆうがお": "夕顔",
          "わかむらさき": "若紫",
          "すえつむはな": "末摘花",
          "もみじのが": "紅葉賀",
          "はなのえん": "花宴",
          "あおい": "葵",
          "さかき": "賢木",
          "はなちるさと": "花散里",
          "すま": "須磨",
          "あかし": "明石",
          "みおつくし": "澪標",
          "よもきふ": "蓬生",
          "せきや": "関屋",
          "えあわせ": "絵合",
          "まつかせ": "松風",
          "うすくも": "薄雲",
          "あさかお": "朝顔",
          "おとめ": "少女",
          "たまかづら": "玉鬘",
          "はつね": "初音",
          "こちよう": "胡蝶",
          "ほたる": "蛍",
          "とこなつ": "常夏",
          "かかりひ": "篝火",
          "のわき": "野分",
          "みゆき": "行幸",
          "ふちはかま": "藤袴",
          "まきはしら": "真木柱",
          "うめかえ": "梅枝",
          "ふちのうらは": "藤裏葉",
          "わかなじょう": "若菜上",
          "わかなげ": "若菜下",
          "わかな": "若菜",
          "かしわき": "柏木",
          "よこふえ": "横笛",
          "すすむし": "鈴虫",
          "ゆうきり": "夕霧",
          "みのり": "御法",
          "まほろし": "幻",
          "におうみや": "匂宮",
          "こうはい": "紅梅",
          "たけかわ": "竹河",
          "はしひめ": "橋姫",
          "しいかもと": "椎本",
          "あけまき": "総角",
          "さわらひ": "早蕨",
          "やとりき": "宿木",
          "あすまや": "東屋",
          "うきふね": "浮舟",
          "かけろう": "蜻蛉",
          "てならい": "手習",
          "ゆめのうきはし": "夢浮橋",
          "くもかくれ": "雲隠",
          "たま": "玉",
          "かつら": "鬘",
          "ゆう": "夕",
          "かお": "顔",
          "むらさき": "紫",
          "もみち": "紅葉",
          "すさく": "朱雀",
          "ふちつほ": "藤壺",
          "これみつ": "惟光",
          "げんじ": "源氏",
          "ものがたり": "物語",
          "むらさきしきぶ": "紫式部",
          "ひかるげんじ": "光源氏",
          "きりつぼてい": "桐壺帝",
          "こうい": "更衣",
          "みやすどころ": "御息所",
          "にゅうどう": "入道",
          "だいじん": "大臣",
          "ちゅうぐう": "中宮",
          "にょういん": "女院",
          "みや": "宮",
          "きみ": "君",
          "うえ": "上",
          "との": "殿",
          "おまえ": "御前",
          "ひめぎみ": "姫君",
          "わかぎみ": "若君",
          "だいり": "内裏",
          "ごしょ": "御所",
          "さと": "里",
          "ろくじょう": "六条",
          "にじょう": "二条",
          "さんじょう": "三条",
          "しじょう": "四条",
          "ごじょう": "五条",
          "しちじょう": "七条",
          "はちじょう": "八条",
          "くじょう": "九条",
          "じゅうじょう": "十条"
        },
        "phoneticChange": {
          "ふ": "う",
          "む": "ん",
          "つ": "っ",
          "は": "わ",
          "へ": "え",
          "を": "お",
          "ひ": "い",
          "く": "う",
          "ぬ": "ん",
          "フ": "ウ",
          "ム": "ン",
          "ツ": "ッ",
          "ハ": "ワ",
          "ヘ": "エ",
          "ヲ": "オ",
          "ヒ": "イ",
          "ク": "ウ",
          "ヌ": "ン"
        }
      },
      "stats": {
        "historicalKanaRules": 11,
        "dakuonRules": 50,
        "kanjiKanaRules": 96,
        "kanaKanjiRules": 95,
        "phoneticChangeRules": 18,
        "totalRules": 270
      },
      "options": {
        "unifyHistoricalKana": "Historical kana unification (ゑ->え, ゐ->い)",
        "unifyDakuon": "Voiced consonant unification (が->か, ず->す)",
        "unifyKanjiKana": "Kanji-kana unification (玉->たま)",
        "unifyPhoneticChanges": "Phonetic change unification (ふ->う, は->わ)"
      },
      "description": {
        "historicalKana": "Unify historical kana usage to modern kana usage",
        "dakuon": "Unify voiced and semi-voiced consonants to voiceless consonants",
        "kanjiKana": "Convert kanji to corresponding kana",
        "kanaKanji": "Convert kana to corresponding kanji",
        "phoneticChange": "Unify phonetic changes (particles, etc.)"
      }
    }
  },
  "meta": {
    "version": "1.0.0",
    "lastUpdated": "2025-06-25T07:08:42.608Z"
  }
}

Summary

While there may be some incomplete aspects, I have introduced an example of building a search API server that includes a mechanism for absorbing orthographic variations in the original text.

I hope this serves as a useful reference.

💨Building an API Server for Searching the Koui Genji Monogatari Text DB

Overview

Background

Usage

Key Features

Query Expansion

Summary

🐥Specifying the Initial Specification to Display in Swagger UI Demo via GET Parameter

👌GitHub Repository for DTS API for TEI/XML Files Published in the Koui Genji Monogatari Text DB

🙌Trying cwrc's wikidata-entity-lookup

Overview

Background

Usage

Key Features

Query Expansion

Summary

Related Articles

🐥Specifying the Initial Specification to Display in Swagger UI Demo via GET Parameter

👌GitHub Repository for DTS API for TEI/XML Files Published in the Koui Genji Monogatari Text DB

🙌Trying cwrc's wikidata-entity-lookup