This article is co-authored with generative AI. While I have cross-checked facts against official documentation where possible, errors may remain. Please verify primary sources before making important decisions.
While poking around the Derge Tibetan Tripitaka database hosted on the Tohoku University Digital Archives (touda.tohoku.ac.jp/collection), I wondered whether there was any path that returned JSON, and ended up checking the available public APIs one by one. In the end OAI-PMH turned out to be the workable route, so this post records the procedure for harvesting per-setSpec into Excel files. The whole approach avoids HTML scraping.
Endpoints I checked
The status I observed (as of 2026-04-30) is summarised below.
| Type | Endpoint | Status |
|---|---|---|
| OAI-PMH | https://touda.tohoku.ac.jp/collection/oai | Available (3 metadata formats) |
| IIIF Presentation v3 manifest | https://touda.tohoku.ac.jp/collection/iiif/scripture/{ID}/manifest.json | Available (per-record) |
| IIIF Image API v2 | service URL inside the manifest | Available |
| Sitemap | https://touda.tohoku.ac.jp/collection/sitemap.xml | Available (partial coverage, see below) |
Drupal JSON:API (/jsonapi) | — | Not exposed in what I could see |
Drupal REST (?_format=json) | — | Returns 500 |
| OpenSearch / RSS / Atom / IIIF Search / direct Solr | — | All 404 |
The site is built on Drupal: a request with ?_format=hal_json came back as a 406 with application/hal+json. The ?_format=json / ?_format=hal_json query is provided by the Drupal Core RESTful Web Services (rest) module, and /jsonapi is provided by the Core JSON:API module — both are part of standard Drupal. However, since Drupal 8.4 these are off by default and the administrator must enable them and configure REST resources and permissions per entity. The behaviour here (406 / 500 / no /jsonapi) reads as them not being enabled for end users. For bulk retrieval, OAI-PMH ends up being the practical route.
Sitemap coverage
sitemap.xml is a sitemapindex split into 64 pages, listing 126,771 individual URLs in total. The sitemap index <lastmod> is fixed at 2024-05-30T02:06:54+09:00 for every one of the 64 pages, suggesting regeneration stopped on 2024-05-30 (the per-URL <lastmod> values inside each page mix dates from 2023–2024).
The path breakdown is dominated by database/library and database/archives:
| Path | Count |
|---|---|
database/library | 66,369 |
database/archives | 60,336 |
database/medlib | 51 |
en/database/* (English) | 12 |
database/tibet/* | 0 |
Against the 178,664 records exposed via OAI-PMH, the sitemap holds only 126,771 URLs, and none of them are under database/tibet/{ID} — the very area I was originally interested in. So using the sitemap as a way to enumerate Tibetan Tripitaka IDs is not viable in the current state.
OAI-PMH overview
What verb=Identify returns:
baseURL : https://touda.tohoku.ac.jp/collection/
repositoryName : Tohoku University Digital Archives Collection Database
earliestDatestamp : 2013-01-01
deletedRecord : persistent
granularity : YYYY-MM-DDThh:mm:ssZ
The standard verbs Identify / ListMetadataFormats / ListSets / ListRecords / GetRecord are all accepted.
Comparing metadataPrefix values
Three formats are available. I compared them on the same record (identifier=10060010000001, a libshin book).
| prefix | Size | Element count | Notes |
|---|---|---|---|
oai_dc | 1.4 KB | 6 | DC15 only. Minimal. |
jpcoar | 1.9 KB | 9 | Structured jpcoar:publisher, datacite:geoLocation |
dcndl_simple | 2.3 KB | 13 | Carries xml:lang, multi-valued dcterms:description / extent / spatial |
dcndl_simple follows the NDL (National Diet Library) vocabulary family and preserves notes (dcterms:description), physical form (multi-valued dcterms:extent like “B5 size” / “168p”), and bilingual place names (dcterms:spatial xml:lang="ja|en") without dropping information. For working with Japanese-language sources in Excel, dcndl_simple looked the easiest to handle in terms of information density.
setSpec behaviour — declared vs. actually retrievable
verb=ListSets returns 22 sets, but when you actually attach set=..., 7 of them come back empty as noRecordsMatch.
The sets that actually returned records (as of 2026-04-30):
| setSpec | Count | Contents |
|---|---|---|
| libkano | 46,217 | Kano Bunko |
| arckoji | 34,105 | Personal and affiliated organisation documents |
| arcreki | 29,101 | Historical official records |
| arckank | 22,080 | University in-house publications |
| libwasa | 16,015 | Japanese mathematics (wasan) materials |
| arcstdm | 7,936 | “The Long 1960s” Digital Archive |
| arcshas | 7,875 | University-related photographs |
| libshin | 5,189 | Online Earthquake Library |
| libhonk | 3,966 | Main Library classical materials |
| libsose | 3,144 | Soseki Bunko |
| libakit | 1,480 | Akita family historical documents |
| tokiwa | 593 | Daijo Tokiwa former collection |
| libhonkandoc | 442 | Main Library old documents |
| mlmaterials | 280 | Books, old documents, etc. |
| maibun | 181 | Sendai Castle ruins |
Sets that are declared but returned nothing:
takayanagi(declared only, 0 records)housawa(declared only, 0 records)engold(declared only, 0 records)canon/derge_scripture/derge_iconography(Derge Tibetan Tripitaka —noRecordsMatch)
The Tibetan Tripitaka database that motivated this whole exercise is, at the moment, not retrievable through OAI-PMH. The IIIF manifest at /collection/iiif/scripture/{ID}/manifest.json works once you have an ID, but I could not find a public API for enumerating IDs. As noted above, database/tibet/* URLs are also missing from the sitemap, so within an “API-only, no scraping” scope I have not found a route to enumerate IDs.
If you want to stay in the browser
For the same touda OAI-PMH endpoint (and a few others), I have a separate web tool that lets you pick a setSpec in the browser and download the result as CSV.
It goes through a bundled CORS proxy, but if you don’t have Python ready or just want to peek at a single set, this is the lightest path. The Python script below is more useful when you want fine control over Excel-side handling (xml:lang suffixes, joining multi-values with newlines, etc.).
Implementation — one xlsx per setSpec
I framed the requirements as follows.
- Use the API only (no scraping)
- Harvest with
dcndl_simple - Follow
resumptionTokenpaging to the end - Join multiple occurrences of the same element with newlines
- Reflect
xml:langattributes as a column-name suffix (e.g.dc:title@ja) - Allow narrowing by
setSpecvia CLI arguments
Dependencies
pip install openpyxl
The script uses the standard library urllib and xml.etree.ElementTree to talk to OAI, and openpyxl in write-only mode to flush rows into Excel. No pandas.
Script
"""
Generate one Excel file per setSpec from the OAI-PMH endpoint of the
Tohoku University Digital Archives (touda.tohoku.ac.jp/collection).
pip install openpyxl
python oai_to_xlsx.py # all known sets
python oai_to_xlsx.py libkano libsose # specified sets only
"""
from __future__ import annotations
import re
import sys
import time
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET
from pathlib import Path
from openpyxl import Workbook
BASE = "https://touda.tohoku.ac.jp/collection/oai"
PREFIX = "dcndl_simple"
OUT_DIR = Path("xlsx")
SLEEP_SEC = 0.3
TIMEOUT = 60
NS = {"oai": "http://www.openarchives.org/OAI/2.0/"}
XML_LANG = "{http://www.w3.org/XML/1998/namespace}lang"
HARVESTABLE_SETS = [
"maibun", "tokiwa", "libkano", "libwasa", "libsose", "libakit",
"libhonk", "libhonkandoc", "libshin", "arcreki", "arckoji",
"arckank", "arcshas", "arcstdm", "mlmaterials",
]
NS_SHORT = {
"http://purl.org/dc/elements/1.1/": "dc",
"http://purl.org/dc/terms/": "dcterms",
"http://ndl.go.jp/dcndl/terms/": "dcndl",
"http://ndl.go.jp/dcndl/dcndl_simple/": "dcndl_simple",
"http://xmlns.com/foaf/0.1/": "foaf",
"http://www.w3.org/2002/07/owl#": "owl",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
"http://www.w3.org/2000/01/rdf-schema#": "rdfs",
"https://irdb.nii.ac.jp/schema/jpcoar/1.0/": "jpcoar",
"http://datacite.org/schema/kernel-4": "datacite",
}
def short_name(qname: str) -> str:
m = re.match(r"\{([^}]+)\}(.+)", qname)
if not m:
return qname
ns, local = m.group(1), m.group(2)
return f"{NS_SHORT.get(ns, ns)}:{local}"
def fetch(url: str) -> bytes:
req = urllib.request.Request(url, headers={"User-Agent": "oai-harvest/1.0"})
with urllib.request.urlopen(req, timeout=TIMEOUT) as r:
return r.read()
def harvest_set(set_spec: str) -> list[dict]:
rows: list[dict] = []
params = {"verb": "ListRecords", "metadataPrefix": PREFIX, "set": set_spec}
url = f"{BASE}?{urllib.parse.urlencode(params)}"
page = 0
while url:
page += 1
body = fetch(url)
root = ET.fromstring(body)
err = root.find("oai:error", NS)
if err is not None:
print(f" [{set_spec}] error: {err.get('code')} {err.text}", file=sys.stderr)
return rows
for rec in root.findall("oai:ListRecords/oai:record", NS):
rows.append(record_to_dict(rec))
token_el = root.find("oai:ListRecords/oai:resumptionToken", NS)
total = token_el.get("completeListSize") if token_el is not None else None
print(f" [{set_spec}] page {page} acquired={len(rows)} total={total}")
if token_el is not None and (token_el.text or "").strip():
params = {"verb": "ListRecords", "resumptionToken": token_el.text.strip()}
url = f"{BASE}?{urllib.parse.urlencode(params)}"
time.sleep(SLEEP_SEC)
else:
url = None
return rows
def record_to_dict(rec: ET.Element) -> dict:
out: dict[str, list[str]] = {}
header = rec.find("oai:header", NS)
if header is not None:
out["_status"] = [header.get("status") or ""]
ident = header.find("oai:identifier", NS)
out["_identifier"] = [ident.text or ""] if ident is not None else [""]
ds = header.find("oai:datestamp", NS)
out["_datestamp"] = [ds.text or ""] if ds is not None else [""]
sets = [s.text or "" for s in header.findall("oai:setSpec", NS)]
out["_setSpec"] = sets
metadata = rec.find("oai:metadata", NS)
if metadata is not None:
for wrapper in list(metadata):
for child in list(wrapper):
key = short_name(child.tag)
lang = child.get(XML_LANG)
if lang:
key = f"{key}@{lang}"
value = (child.text or "").strip()
if not value and len(list(child)) > 0:
value = " ".join(
(g.text or "").strip() for g in child.iter() if (g.text or "").strip()
)
if value:
out.setdefault(key, []).append(value)
return {k: "\n".join(v) for k, v in out.items()}
def write_xlsx(set_spec: str, rows: list[dict]) -> Path:
OUT_DIR.mkdir(exist_ok=True)
path = OUT_DIR / f"{set_spec}.xlsx"
columns: list[str] = []
seen: set[str] = set()
for r in rows:
for k in r:
if k not in seen:
seen.add(k)
columns.append(k)
header_cols = [c for c in columns if c.startswith("_")]
body_cols = [c for c in columns if not c.startswith("_")]
columns = header_cols + body_cols
wb = Workbook(write_only=True)
ws = wb.create_sheet(set_spec[:31] or "data")
ws.append(columns)
for r in rows:
ws.append([r.get(c, "") for c in columns])
wb.save(path)
return path
def main(argv: list[str]) -> int:
targets = argv[1:] if len(argv) > 1 else HARVESTABLE_SETS
for s in targets:
if s not in HARVESTABLE_SETS:
print(f"warn: '{s}' is not in the known harvestable list; trying anyway")
print(f"==> {s}")
rows = harvest_set(s)
if not rows:
print(f" [{s}] no records, skipping")
continue
path = write_xlsx(s, rows)
print(f" [{s}] wrote {len(rows)} rows -> {path}")
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
record_to_dict puts everything under header (identifier / datestamp / setSpec / status="deleted" etc.) into _-prefixed columns, and each element under metadata into a prefix:localname[@lang] column. Multiple values are joined with \n.
How columns are formed
For example, one libshin record produces a column layout like the following.
_identifier _datestamp _setSpec _status
dc:identifier
dcndl:materialType
dcterms:language
dc:title@ja
dcndl:alternative@ja
dcterms:publisher@ja
dcndl:publicationPlace@ja
dcterms:date
dcterms:extent ← multiple values per cell, e.g. "B5 size\n168p"
dcterms:spatial@ja
dcterms:spatial@en
dcterms:description@ja ← notes joined with newlines if multiple
dcterms:rights
The header for the whole set is the union of every column that appeared in at least one record. That way the source dcndl_simple information lands in Excel without losing anything.
Running it
# All 15 sets
python oai_to_xlsx.py
# Specific sets only
python oai_to_xlsx.py libsose libhonkandoc
This produces xlsx/{setSpec}.xlsx.
Rough timing
The total is 178,664 records. With a 0.3 s sleep between requests and the default 100 records/page:
- The largest set,
libkano, is 46,217 / 100 = 463 pages, roughly 2.5 minutes - Going through all 15 sets in sequence takes roughly 10–15 minutes
The resumptionToken is good for around 12 hours, so even if a run is interrupted you can resume it later (just save the token).
Things that tripped me up
set=canonis empty — you cannot tell fromListSetsalone whether a set is actually retrievable, so the safest thing is to hitverb=ListIdentifiers&set=...and look atcompleteListSizedirectly.identifieris a bare number — record IDs come back without anoai:prefix (e.g.10060010000001). This deviates from the OAI identifier convention, butGetRecord&identifier=...accepts the bare value.- Handling
xml:lang— sibling elements likedcterms:spatialcan appear as bothjaanden. Without an@ja/@ensuffix in the column name, the values collide. - Deleted records — there is some test data near the start, e.g.
<header status="deleted"><identifier>test</identifier>. Filtering by the_statuscolumn drops it. - The Tibetan Tripitaka case — not covered by this approach. Since the sitemap also has no
database/tibet/*URLs, I have not yet found a way to enumerate IDs without scraping.
Wrap-up
- The public APIs in active use are mostly OAI-PMH and the IIIF Presentation/Image API
- For listing, OAI-PMH is the practical choice, and
dcndl_simpleis the richest of the three metadataPrefixes ListSetsdeclarations and what’s actually retrievable do not always match — thecanonfamily (Tibetan Tripitaka) is currently outside what OAI-PMH returns- The script above produces one xlsx per setSpec
- If you’d rather stay in the browser, the Digital Archive Downloader lets you download a CSV per setSpec