Hypothes.is APIでWebアノテーションをエクスポートしてTEI/XMLに変換する

はじめに

Hypothes.isは、Webページ上にハイライトやコメントを付けられるオープンソースのアノテーションツールです。ブラウザ拡張やJavaScriptの埋め込みで手軽に使えますが、蓄積したアノテーションをバックアップしたい、あるいはTEI/XMLなど別の形式で活用したいケースもあります。

本記事では、Hypothes.is APIを使ってアノテーションをエクスポートし、TEI/XMLに変換する方法を紹介します。

APIキーの取得

Hypothes.isにログイン
Developer settings にアクセス
「Generate your API token」でAPIキーを生成

取得したキーを.envファイルに保存します。

cp .env.example .env
# .env を編集してAPIキーを設定

HYPOTHESIS_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

アノテーションのエクスポート

APIの基本

Hypothes.is APIのベースURLは https://api.hypothes.is/api です。認証はAuthorization: Bearer <API_KEY>ヘッダーで行います。

主要なエンドポイント:

エンドポイント	用途
`GET /api/profile`	認証ユーザーのプロフィール取得
`GET /api/search`	アノテーション検索
`GET /api/annotations/{id}`	個別アノテーション取得

スクリプト

エクスポートからTEI/XML変換までを1つのスクリプト hypothes_export.py にまとめています。

https://github.com/nakamura196/hypothes-export/blob/main/hypothes_export.py

以下、主要な処理を抜粋して説明します。

.envの読み込みとAPI呼び出し

def load_env():
    env_path = Path(__file__).parent / ".env"
    with open(env_path) as f:
        for line in f:
            line = line.strip()
            if line and not line.startswith("#") and "=" in line:
                k, v = line.split("=", 1)
                os.environ[k.strip()] = v.strip()


def api_get(endpoint, params=None):
    api_key = os.environ["HYPOTHESIS_API_KEY"]
    url = f"https://api.hypothes.is/api/{endpoint}"
    if params:
        url += "?" + urllib.parse.urlencode(params)
    req = urllib.request.Request(url)
    req.add_header("Authorization", f"Bearer {api_key}")
    with urllib.request.urlopen(req) as resp:
        return json.loads(resp.read().decode())

全アノテーションの取得（ページネーション対応）

Search APIは1リクエストあたり最大200件なので、offsetをずらして全件取得します。

def fetch_all_annotations():
    profile = api_get("profile")
    user = profile["userid"]

    all_annotations = []
    limit = 200
    offset = 0

    result = api_get("search", {"user": user, "limit": limit, "offset": 0})
    total = result["total"]
    all_annotations.extend(result["rows"])
    offset += limit

    while offset < total:
        result = api_get("search", {"user": user, "limit": limit, "offset": offset})
        all_annotations.extend(result["rows"])
        offset += limit

    return all_annotations

実行

# JSON + TEI/XML を出力
python hypothes_export.py

# JSONのみ出力
python hypothes_export.py --json-only

# 既存JSONからTEI/XML変換のみ
python hypothes_export.py --tei-only

User: acct:your_username@hypothes.is
Total: 6 annotations
Saved JSON: output/annotations.json (6 annotations)
Saved TEI/XML: output/annotations.xml

アノテーションのデータ構造

エクスポートされたJSONの各アノテーションは、W3C Web Annotation Data Modelに基づいた構造を持っています。

{
  "id": "a1lBUhPdEfG-Lk8iV7GT3w",
  "created": "2026-02-27T13:08:33.427772+00:00",
  "user": "acct:your_username@hypothes.is",
  "uri": "https://example.com/page",
  "text": "これは正しい？",
  "tags": ["メモ"],
  "target": [
    {
      "source": "https://example.com/page",
      "selector": [
        {
          "type": "RangeSelector",
          "startContainer": "/main[1]/div[1]/p[1]",
          "startOffset": 335,
          "endContainer": "/main[1]/div[1]/p[1]/span[4]",
          "endOffset": 0
        },
        {
          "type": "TextPositionSelector",
          "start": 1663,
          "end": 1667
        },
        {
          "type": "TextQuoteSelector",
          "exact": "此詩乃是",
          "prefix": "人樂太平無事日　　　鶯花無限日高眠 \n　　　　　　　　",
          "suffix": "宋太祖朝中一個名儒姓邵諱尭堯夫道號康節先生所作為"
        }
      ]
    }
  ]
}

3種類のセレクタ

Hypothes.isは、アノテーション対象のテキスト位置を3種類のセレクタで記録しています。

セレクタ	仕組み	頑健性
RangeSelector	DOM上のXPathで位置を指定	△ HTML構造の変更に弱い
TextPositionSelector	文字のオフセット位置で指定	△ テキストの増減でズレる
TextQuoteSelector	対象テキスト＋前後の文脈で指定	◎ fuzzy matchで再アンカリング可能

元文書が変更された場合、Hypothes.isはこれらのセレクタをフォールバックとして順に試みます。TextQuoteSelectorはprefix/suffixを含むfuzzy matchを行うため最も頑健ですが、対象テキスト自体が削除・大幅変更された場合はアノテーションが「orphaned（孤立）」状態になります。

TEI/XMLへの変換

エクスポートしたJSONをTEI/XML形式に変換します。

マッピング方針

Hypothes.is	TEI/XML
対象文書（URI・タイトル）	`<sourceDesc><bibl>`
文書ごとのグループ	`<div>`
各アノテーション	`<ab>`
ハイライトテキスト（`TextQuoteSelector.exact`）	`<quote>`
コメント本文	`<note type="annotation">`
タグ	`<note type="tag">`

変換ロジック

TextQuoteSelectorから引用テキストを抽出し、TEI要素にマッピングします。

def get_text_quote(annotation):
    """TextQuoteSelectorからexact/prefix/suffixを取得"""
    for target in annotation.get("target", []):
        for sel in target.get("selector", []):
            if sel.get("type") == "TextQuoteSelector":
                return sel
    return None

アノテーションをURIごとにグループ化し、<div> → <ab> → <quote> / <note> の構造で出力します。詳細はソースコードを参照してください。

出力例

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Hypothes.is Annotations Export</title>
      </titleStmt>
      <publicationStmt>
        <p>Exported from Hypothes.is API</p>
      </publicationStmt>
      <sourceDesc>
        <bibl xml:id="src-0">
          <title>巻首題：新刻全像水滸傳</title>
          <ref target="https://example.com/page">https://example.com/page</ref>
        </bibl>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <div corresp="#src-0">
        <head>巻首題：新刻全像水滸傳</head>
        <ab xml:id="ann-a1lBUhPdEfG">
          <quote>此詩乃是</quote>
          <note type="annotation"
                corresp="https://hypothes.is/a/a1lBUhPdEfG"
                when="2026-02-27T13:08:33.427772+00:00">
            これは正しい？
          </note>
        </ab>
      </div>
    </body>
  </text>
</TEI>

元文書の変更とアノテーションの整合性

Hypothes.isのアノテーションは「スタンドオフ注釈」方式で、元文書とは別に保存されます。そのため、元文書が変更されるとアノテーションの位置がズレる可能性があります。

軽微な変更 : TextQuoteSelectorのfuzzy matchにより再アンカリングされることが多い
大幅な変更 : アノテーションが「orphaned」状態になり、対象箇所に紐づかなくなる

TEI/XMLにエクスポートしておけば、<quote>要素にハイライト対象テキストが記録されるため、元文書との対応関係は少なくとも記録として残ります。

まとめ

Hypothes.is APIを使えば、自分のアノテーションをプログラムから取得できる
TextQuoteSelectorのexact/prefix/suffixが、アノテーション対象テキストの特定に最も重要
TEI/XMLへの変換により、人文学研究で広く使われるフォーマットで保存・活用できる
ただし元文書の変更によるアンカリングのズレには注意が必要

ソースコードはGitHubで公開しています。

はじめに#

APIキーの取得#

アノテーションのエクスポート#

APIの基本#

スクリプト#

.envの読み込みとAPI呼び出し#

全アノテーションの取得（ページネーション対応）#

実行#

アノテーションのデータ構造#

3種類のセレクタ#

TEI/XMLへの変換#

マッピング方針#

変換ロジック#

出力例#

元文書の変更とアノテーションの整合性#

まとめ#