This article is co-authored with generative AI. While I have cross-checked facts against official documentation where possible, errors may remain. Please verify primary sources before making important decisions.

While poking around the Derge Tibetan Tripitaka database hosted on the Tohoku University Digital Archives (touda.tohoku.ac.jp/collection), I wondered whether there was any path that returned JSON, and ended up checking the available public APIs one by one. In the end OAI-PMH turned out to be the workable route, so this post records the procedure for harvesting per-setSpec into Excel files. The whole approach avoids HTML scraping.

Endpoints I checked

The status I observed (as of 2026-04-30) is summarised below.

TypeEndpointStatus
OAI-PMHhttps://touda.tohoku.ac.jp/collection/oaiAvailable (3 metadata formats)
IIIF Presentation v3 manifesthttps://touda.tohoku.ac.jp/collection/iiif/scripture/{ID}/manifest.jsonAvailable (per-record)
IIIF Image API v2service URL inside the manifestAvailable
Sitemaphttps://touda.tohoku.ac.jp/collection/sitemap.xmlAvailable (partial coverage, see below)
Drupal JSON:API (/jsonapi)Not exposed in what I could see
Drupal REST (?_format=json)Returns 500
OpenSearch / RSS / Atom / IIIF Search / direct SolrAll 404

The site is built on Drupal: a request with ?_format=hal_json came back as a 406 with application/hal+json. The ?_format=json / ?_format=hal_json query is provided by the Drupal Core RESTful Web Services (rest) module, and /jsonapi is provided by the Core JSON:API module — both are part of standard Drupal. However, since Drupal 8.4 these are off by default and the administrator must enable them and configure REST resources and permissions per entity. The behaviour here (406 / 500 / no /jsonapi) reads as them not being enabled for end users. For bulk retrieval, OAI-PMH ends up being the practical route.

Sitemap coverage

sitemap.xml is a sitemapindex split into 64 pages, listing 126,771 individual URLs in total. The sitemap index <lastmod> is fixed at 2024-05-30T02:06:54+09:00 for every one of the 64 pages, suggesting regeneration stopped on 2024-05-30 (the per-URL <lastmod> values inside each page mix dates from 2023–2024).

The path breakdown is dominated by database/library and database/archives:

PathCount
database/library66,369
database/archives60,336
database/medlib51
en/database/* (English)12
database/tibet/*0

Against the 178,664 records exposed via OAI-PMH, the sitemap holds only 126,771 URLs, and none of them are under database/tibet/{ID} — the very area I was originally interested in. So using the sitemap as a way to enumerate Tibetan Tripitaka IDs is not viable in the current state.

OAI-PMH overview

What verb=Identify returns:

baseURL                : https://touda.tohoku.ac.jp/collection/
repositoryName         : Tohoku University Digital Archives Collection Database
earliestDatestamp      : 2013-01-01
deletedRecord          : persistent
granularity            : YYYY-MM-DDThh:mm:ssZ

The standard verbs Identify / ListMetadataFormats / ListSets / ListRecords / GetRecord are all accepted.

Comparing metadataPrefix values

Three formats are available. I compared them on the same record (identifier=10060010000001, a libshin book).

prefixSizeElement countNotes
oai_dc1.4 KB6DC15 only. Minimal.
jpcoar1.9 KB9Structured jpcoar:publisher, datacite:geoLocation
dcndl_simple2.3 KB13Carries xml:lang, multi-valued dcterms:description / extent / spatial

dcndl_simple follows the NDL (National Diet Library) vocabulary family and preserves notes (dcterms:description), physical form (multi-valued dcterms:extent like “B5 size” / “168p”), and bilingual place names (dcterms:spatial xml:lang="ja|en") without dropping information. For working with Japanese-language sources in Excel, dcndl_simple looked the easiest to handle in terms of information density.

setSpec behaviour — declared vs. actually retrievable

verb=ListSets returns 22 sets, but when you actually attach set=..., 7 of them come back empty as noRecordsMatch.

The sets that actually returned records (as of 2026-04-30):

setSpecCountContents
libkano46,217Kano Bunko
arckoji34,105Personal and affiliated organisation documents
arcreki29,101Historical official records
arckank22,080University in-house publications
libwasa16,015Japanese mathematics (wasan) materials
arcstdm7,936“The Long 1960s” Digital Archive
arcshas7,875University-related photographs
libshin5,189Online Earthquake Library
libhonk3,966Main Library classical materials
libsose3,144Soseki Bunko
libakit1,480Akita family historical documents
tokiwa593Daijo Tokiwa former collection
libhonkandoc442Main Library old documents
mlmaterials280Books, old documents, etc.
maibun181Sendai Castle ruins

Sets that are declared but returned nothing:

  • takayanagi (declared only, 0 records)
  • housawa (declared only, 0 records)
  • engold (declared only, 0 records)
  • canon / derge_scripture / derge_iconography (Derge Tibetan Tripitaka — noRecordsMatch)

The Tibetan Tripitaka database that motivated this whole exercise is, at the moment, not retrievable through OAI-PMH. The IIIF manifest at /collection/iiif/scripture/{ID}/manifest.json works once you have an ID, but I could not find a public API for enumerating IDs. As noted above, database/tibet/* URLs are also missing from the sitemap, so within an “API-only, no scraping” scope I have not found a route to enumerate IDs.

If you want to stay in the browser

For the same touda OAI-PMH endpoint (and a few others), I have a separate web tool that lets you pick a setSpec in the browser and download the result as CSV.

It goes through a bundled CORS proxy, but if you don’t have Python ready or just want to peek at a single set, this is the lightest path. The Python script below is more useful when you want fine control over Excel-side handling (xml:lang suffixes, joining multi-values with newlines, etc.).

Implementation — one xlsx per setSpec

I framed the requirements as follows.

  • Use the API only (no scraping)
  • Harvest with dcndl_simple
  • Follow resumptionToken paging to the end
  • Join multiple occurrences of the same element with newlines
  • Reflect xml:lang attributes as a column-name suffix (e.g. dc:title@ja)
  • Allow narrowing by setSpec via CLI arguments

Dependencies

pip install openpyxl

The script uses the standard library urllib and xml.etree.ElementTree to talk to OAI, and openpyxl in write-only mode to flush rows into Excel. No pandas.

Script

"""
Generate one Excel file per setSpec from the OAI-PMH endpoint of the
Tohoku University Digital Archives (touda.tohoku.ac.jp/collection).

  pip install openpyxl
  python oai_to_xlsx.py                 # all known sets
  python oai_to_xlsx.py libkano libsose # specified sets only
"""

from __future__ import annotations

import re
import sys
import time
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET
from pathlib import Path

from openpyxl import Workbook

BASE = "https://touda.tohoku.ac.jp/collection/oai"
PREFIX = "dcndl_simple"
OUT_DIR = Path("xlsx")
SLEEP_SEC = 0.3
TIMEOUT = 60

NS = {"oai": "http://www.openarchives.org/OAI/2.0/"}
XML_LANG = "{http://www.w3.org/XML/1998/namespace}lang"

HARVESTABLE_SETS = [
    "maibun", "tokiwa", "libkano", "libwasa", "libsose", "libakit",
    "libhonk", "libhonkandoc", "libshin", "arcreki", "arckoji",
    "arckank", "arcshas", "arcstdm", "mlmaterials",
]

NS_SHORT = {
    "http://purl.org/dc/elements/1.1/": "dc",
    "http://purl.org/dc/terms/": "dcterms",
    "http://ndl.go.jp/dcndl/terms/": "dcndl",
    "http://ndl.go.jp/dcndl/dcndl_simple/": "dcndl_simple",
    "http://xmlns.com/foaf/0.1/": "foaf",
    "http://www.w3.org/2002/07/owl#": "owl",
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
    "http://www.w3.org/2000/01/rdf-schema#": "rdfs",
    "https://irdb.nii.ac.jp/schema/jpcoar/1.0/": "jpcoar",
    "http://datacite.org/schema/kernel-4": "datacite",
}


def short_name(qname: str) -> str:
    m = re.match(r"\{([^}]+)\}(.+)", qname)
    if not m:
        return qname
    ns, local = m.group(1), m.group(2)
    return f"{NS_SHORT.get(ns, ns)}:{local}"


def fetch(url: str) -> bytes:
    req = urllib.request.Request(url, headers={"User-Agent": "oai-harvest/1.0"})
    with urllib.request.urlopen(req, timeout=TIMEOUT) as r:
        return r.read()


def harvest_set(set_spec: str) -> list[dict]:
    rows: list[dict] = []
    params = {"verb": "ListRecords", "metadataPrefix": PREFIX, "set": set_spec}
    url = f"{BASE}?{urllib.parse.urlencode(params)}"
    page = 0
    while url:
        page += 1
        body = fetch(url)
        root = ET.fromstring(body)

        err = root.find("oai:error", NS)
        if err is not None:
            print(f"  [{set_spec}] error: {err.get('code')} {err.text}", file=sys.stderr)
            return rows

        for rec in root.findall("oai:ListRecords/oai:record", NS):
            rows.append(record_to_dict(rec))

        token_el = root.find("oai:ListRecords/oai:resumptionToken", NS)
        total = token_el.get("completeListSize") if token_el is not None else None
        print(f"  [{set_spec}] page {page} acquired={len(rows)} total={total}")

        if token_el is not None and (token_el.text or "").strip():
            params = {"verb": "ListRecords", "resumptionToken": token_el.text.strip()}
            url = f"{BASE}?{urllib.parse.urlencode(params)}"
            time.sleep(SLEEP_SEC)
        else:
            url = None
    return rows


def record_to_dict(rec: ET.Element) -> dict:
    out: dict[str, list[str]] = {}

    header = rec.find("oai:header", NS)
    if header is not None:
        out["_status"] = [header.get("status") or ""]
        ident = header.find("oai:identifier", NS)
        out["_identifier"] = [ident.text or ""] if ident is not None else [""]
        ds = header.find("oai:datestamp", NS)
        out["_datestamp"] = [ds.text or ""] if ds is not None else [""]
        sets = [s.text or "" for s in header.findall("oai:setSpec", NS)]
        out["_setSpec"] = sets

    metadata = rec.find("oai:metadata", NS)
    if metadata is not None:
        for wrapper in list(metadata):
            for child in list(wrapper):
                key = short_name(child.tag)
                lang = child.get(XML_LANG)
                if lang:
                    key = f"{key}@{lang}"
                value = (child.text or "").strip()
                if not value and len(list(child)) > 0:
                    value = " ".join(
                        (g.text or "").strip() for g in child.iter() if (g.text or "").strip()
                    )
                if value:
                    out.setdefault(key, []).append(value)
    return {k: "\n".join(v) for k, v in out.items()}


def write_xlsx(set_spec: str, rows: list[dict]) -> Path:
    OUT_DIR.mkdir(exist_ok=True)
    path = OUT_DIR / f"{set_spec}.xlsx"

    columns: list[str] = []
    seen: set[str] = set()
    for r in rows:
        for k in r:
            if k not in seen:
                seen.add(k)
                columns.append(k)
    header_cols = [c for c in columns if c.startswith("_")]
    body_cols = [c for c in columns if not c.startswith("_")]
    columns = header_cols + body_cols

    wb = Workbook(write_only=True)
    ws = wb.create_sheet(set_spec[:31] or "data")
    ws.append(columns)
    for r in rows:
        ws.append([r.get(c, "") for c in columns])
    wb.save(path)
    return path


def main(argv: list[str]) -> int:
    targets = argv[1:] if len(argv) > 1 else HARVESTABLE_SETS
    for s in targets:
        if s not in HARVESTABLE_SETS:
            print(f"warn: '{s}' is not in the known harvestable list; trying anyway")
        print(f"==> {s}")
        rows = harvest_set(s)
        if not rows:
            print(f"  [{s}] no records, skipping")
            continue
        path = write_xlsx(s, rows)
        print(f"  [{s}] wrote {len(rows)} rows -> {path}")
    return 0


if __name__ == "__main__":
    sys.exit(main(sys.argv))

record_to_dict puts everything under header (identifier / datestamp / setSpec / status="deleted" etc.) into _-prefixed columns, and each element under metadata into a prefix:localname[@lang] column. Multiple values are joined with \n.

How columns are formed

For example, one libshin record produces a column layout like the following.

_identifier  _datestamp  _setSpec  _status
dc:identifier
dcndl:materialType
dcterms:language
dc:title@ja
dcndl:alternative@ja
dcterms:publisher@ja
dcndl:publicationPlace@ja
dcterms:date
dcterms:extent          ← multiple values per cell, e.g. "B5 size\n168p"
dcterms:spatial@ja
dcterms:spatial@en
dcterms:description@ja  ← notes joined with newlines if multiple
dcterms:rights

The header for the whole set is the union of every column that appeared in at least one record. That way the source dcndl_simple information lands in Excel without losing anything.

Running it

# All 15 sets
python oai_to_xlsx.py

# Specific sets only
python oai_to_xlsx.py libsose libhonkandoc

This produces xlsx/{setSpec}.xlsx.

Rough timing

The total is 178,664 records. With a 0.3 s sleep between requests and the default 100 records/page:

  • The largest set, libkano, is 46,217 / 100 = 463 pages, roughly 2.5 minutes
  • Going through all 15 sets in sequence takes roughly 10–15 minutes

The resumptionToken is good for around 12 hours, so even if a run is interrupted you can resume it later (just save the token).

Things that tripped me up

  • set=canon is empty — you cannot tell from ListSets alone whether a set is actually retrievable, so the safest thing is to hit verb=ListIdentifiers&set=... and look at completeListSize directly.
  • identifier is a bare number — record IDs come back without an oai: prefix (e.g. 10060010000001). This deviates from the OAI identifier convention, but GetRecord&identifier=... accepts the bare value.
  • Handling xml:lang — sibling elements like dcterms:spatial can appear as both ja and en. Without an @ja / @en suffix in the column name, the values collide.
  • Deleted records — there is some test data near the start, e.g. <header status="deleted"><identifier>test</identifier>. Filtering by the _status column drops it.
  • The Tibetan Tripitaka case — not covered by this approach. Since the sitemap also has no database/tibet/* URLs, I have not yet found a way to enumerate IDs without scraping.

Wrap-up

  • The public APIs in active use are mostly OAI-PMH and the IIIF Presentation/Image API
  • For listing, OAI-PMH is the practical choice, and dcndl_simple is the richest of the three metadataPrefixes
  • ListSets declarations and what’s actually retrievable do not always match — the canon family (Tibetan Tripitaka) is currently outside what OAI-PMH returns
  • The script above produces one xlsx per setSpec
  • If you’d rather stay in the browser, the Digital Archive Downloader lets you download a CSV per setSpec