Setting up a dedicated GitHub Pages repository for a temporary CODH-tool mirror — including the missing kuromoji dictionary in Soan

This article is co-authored with generative AI. While I have cross-checked facts against official documentation where possible, errors may remain. Please verify primary sources before making important decisions.

Background and disclaimer

The ROIS-DS Center for Open Data in the Humanities (hereafter "CODH") website at codh.rois.ac.jp is currently suspended for long-term maintenance (official notice from ROIS-DS dated 2026-02-24, no announced reopen date). Browser-based tools that were served from that host stopped working when sites embedded them directly:

vdiff.js / vdiff-seq.js
IIIF Curation Viewer / Manager / Editor / Player / Board
Soan

A separate post — Restoring CODH's vdiff.js from the Wayback Machine to Temporarily Bring Back The Tale of Genji's Face Comparison Feature — covers the basic technique: using Wayback Machine's id_ flag to fetch raw bytes without wombat injection, and analyzing each tool's minimal install. This post is the follow-up: what I ran into when packaging the extracted set as a standalone GitHub Pages repository called codh-mirror.

To repeat: this is a stop-gap until CODH service resumes. CODH's software is MIT-licensed (re-distribution permitted), but freezing it on a Wayback snapshot means later upstream fixes won't be picked up. The plan is to retire the mirror once CODH is back.

Repository layout

I went with the simplest possible structure: one directory per tool.

codh-mirror/
├── README.md
├── .gitignore
├── vdiff/
├── vdiff-seq/
├── iiif-curation-viewer/
├── iiif-curation-manager/
├── iiif-curation-editor/
├── iiif-curation-player/
├── iiif-curation-board/
└── soan/
    ├── index.html
    ├── index.js
    ├── soan/{soan.bundle.css, soan.bundle.min.js}
    ├── dataset/001.json
    ├── dataset/001/<36869 PNG>
    └── kuromoji/dict/<12 .dat.gz>

The working tree is about 370 MB total (with Soan's 36,869 movable-type PNGs accounting for 186 MB; the served tree without .git is about 210 MB). Comfortably under the GitHub Pages 1 GB limit, and within GitHub's recommended repo-size guideline.

GitHub Pages can be enabled in one CLI call:

gh api repos/<owner>/codh-mirror/pages -X POST \
  -F "source[branch]=main" -F "source[path]=/"

Build takes 1–2 minutes. Once gh api repos/<owner>/codh-mirror/pages reports "status":"built", you're live.

Pitfall 1: Soan's 36,869 movable-type PNGs reference absolute CODH URLs

Soan's dataset/001.json carries a URL for each character's movable-type PNG. The CODH demo version's JSON had absolute URLs pointing at codh.rois.ac.jp:

{
  "attribution": "『徒然草』 holding of the National Diet Library",
  "data": [
    {"url": "https://codh.rois.ac.jp/soan/dataset/001/001_001_2_01_01_000001.png",
     "char": "つれ〱", "jibo": "徒連〱"},
    ...
  ]
}

There are 36,869 entries. Pulling that many PNGs out of the Wayback Machine while CODH is down is impractical (the rate limit is harsh, and the Wayback doesn't archive binary images comprehensively anyway). I needed another route.

Fortunately, Soan's core contributor Jun HOMMA (@2SC1815J) runs a personal "Soan Pro" instance at https://dev.2sc1815j.net/soan/, and the same movable-type PNGs are there with the same filenames (using relative paths dataset/001/...png). I downloaded the image binaries only — 36,869 of them — from there:

# Build the URL list from the JSON (rewriting absolute → relative as we go)
grep -oE '"url":"dataset/001/[^"]+\.png"' dataset/001.json \
  | sed 's|"url":"|https://dev.2sc1815j.net/soan/|;s|"$||' \
  > /tmp/soan_image_urls.txt

# Parallel download with xargs -P 5
cat /tmp/soan_image_urls.txt | xargs -P 5 -I {} bash -c '
  out="dataset/001/$(basename "$1")"
  [[ -s "$out" ]] && exit 0
  curl -sL --compressed --max-time 30 "$1" -o "$out"
' _ {}

5 in parallel, ~30 minutes for all 36,869 (~186 MB). Going through a live third-party mirror instead of Wayback is a judgment call you have to make based on the volume.

To prevent the JSON's image URLs from still reaching out to CODH, strip the host prefix https://codh.rois.ac.jp/soan/ to make them relative:

import json, re
with open('demo_001.json') as f:
    d = json.load(f)
for item in d['data']:
    item['url'] = re.sub(r'^https://codh\.rois\.ac\.jp/soan/', '', item['url'])
with open('codh-mirror/soan/dataset/001.json', 'w') as f:
    json.dump(d, f, ensure_ascii=False, separators=(',', ':'))

Now the JSON references dataset/001/xxx.png relative paths, served from the same site (GitHub Pages).

Pitfall 2: A runtime dependency invisible from HTML (kuromoji dictionary) — the loading spinner that never stops

After UI / bundle / dataset / images were all in place and deployed, I hit a symptom in the browser: the loading spinner kept spinning forever.

curl confirmed that index.html, soan.bundle.min.js, and dataset/001.json all returned 200 and the JSON was valid. Yet the JS-side initialization never finished. Re-reading index.html showed nothing suspicious in <script> or <link> — the dependencies looked satisfied.

The answer was in the DevTools Network panel. Multiple requests to kuromoji/dict/base.dat.gz etc. were 404'ing, and that was clearly the cause. Looking inside the bundle:

let e;
if (isNode) e = i;
else try {
  e = new URL("../kuromoji/dict/", SoanSrc).href;
} catch (e) { v(e); ... }
e && (f.fn.spin && h && f(h).spin(),
  r.builder({ dicPath: e }).build(function(e, n) {
    f.fn.spin && h && f(h).spin(!1);
    ...
  }))

Soan internally uses kuromoji.js (a pure-JS Japanese morphological analyzer), and the bundled kuromoji.builder({dicPath}).build(...) fetches the dictionary files from ../kuromoji/dict/ at runtime. $(h).spin() is stopped in the dictionary-load callback, so if the dict 404s, the callback never fires and the spinner runs forever.

The lesson is simple: HTML / <script> / <link> alone aren't enough to find every dependency. CSS pulls images via url(), and JS fetches data from dynamic URLs — if you stop your dependency analysis at the HTML, you miss things. The shortest path is the Network panel showing 404s / pendings.

The dictionary itself (12 standard kuromoji-js files, ~17.8 MB total) was originally hosted under https://codh.rois.ac.jp/soan/kuromoji/dict/. Wayback doesn't archive binary .dat.gz files comprehensively — CDX search returned zero hits. Same as the 36k PNGs, I grabbed them from the same path on dev.2sc1815j.net.

Pitfall 3: A flood of missing images and fonts referenced via CSS `url(...)`

Especially for IIIF Curation Viewer-family tools that combine Bootstrap / jQuery UI / Leaflet / Leaflet.draw / Leaflet.fullscreen / FontAwesome / etc., there are many image and font files referenced by CSS url(...). A simple dependency analysis starting from index.html's (href|src)="..." won't catch them.

/* bootstrap.min.css */
@font-face {
  ...
  src: url(../fonts/glyphicons-halflings-regular.eot);
  src: url(../fonts/glyphicons-halflings-regular.eot?#iefix) format('embedded-opentype'),
       url(../fonts/glyphicons-halflings-regular.woff2) format('woff2'), ...;
}
/* jquery-ui.css */
.ui-state-default { ... background-image: url("images/ui-icons_555555_256x240.png"); ... }
/* leaflet.css */
.leaflet-control-layers-toggle { background-image: url(images/layers.png); ... }

These need to be resolved relative to each CSS file's location. So I wrote a script that extracts each CSS file's url() references, normalizes paths relative to the CSS file, and downloads them:

# Extract url() refs from CSS, resolved against each CSS file's location
find . -name "*.css" -print0 | while IFS= read -r -d '' css; do
  css_dir=$(dirname "$css" | sed 's|^\./||')
  grep -oE "url\([^)]+\)" "$css" | while read line; do
    raw=$(echo "$line" | sed -E "s|^url\(['\"]?||;s|['\"]?\)$||")
    case "$raw" in
      data:*|http://*|https://*|\#*) continue ;;
    esac
    raw_clean="${raw%%\?*}"; raw_clean="${raw_clean%%#*}"
    target="${css_dir:+$css_dir/}$raw_clean"
    target=$(python3 -c "import os.path,sys; print(os.path.normpath(sys.argv[1]))" "$target")
    [[ "$target" == /* || "$target" == ../* ]] && continue
    echo "$target"
  done
done | sort -u

Wayback id_ re-fetches against this list often 404 for .eot .woff files (Wayback doesn't archive them). Workarounds:

For files shared across tools (Bootstrap glyphicons, Leaflet draw spritesheet), copy from a tool that already has them (e.g., IIIF Curation Viewer)
For widely-distributed files still on alive CDNs (FontAwesome 5 webfonts, Leaflet 1.6.0 markers), grab from jsDelivr / unpkg

Combination:

# Bootstrap fonts: copy from ICV
cp -n iiif-curation-viewer/bootstrap/bootstrap/fonts/* iiif-curation-manager/bootstrap/bootstrap/fonts/

# FontAwesome 5: from jsDelivr
for f in fa-{brands-400,regular-400,solid-900}.{eot,svg,ttf,woff,woff2}; do
  curl -sL --compressed -o iiif-curation-board/fontawesome/webfonts/$f \
    "https://cdn.jsdelivr.net/npm/@fortawesome/fontawesome-free@5.15.4/webfonts/$f"
done

# Leaflet 1.6.0 markers: from unpkg
for f in layers.png layers-2x.png marker-icon.png marker-icon-2x.png marker-shadow.png; do
  curl -sL --compressed -o iiif-curation-board/leaflet/leaflet-1.6.0/images/$f \
    "https://unpkg.com/leaflet@1.6.0/dist/images/$f"
done

Pitfall 4: Wayback's `000` floods (rate limiting)

When you sequentially fetch dozens to ~100 files per tool (IIIF Curation Viewer is about 86 files), midway through you start getting HTTP 000 (connection failures). Wayback Machine rate limit, presumably.

I started with no sleep and got 19 / 66 for one tool (everything 000 after a point); retries didn't help. Adding a 2–3 second sleep between requests is enough to get clean runs.

for asset in "${ASSETS[@]}"; do
  mkdir -p "$(dirname "$asset")"
  http_code=$(curl -sL --compressed --max-time 90 -w "%{http_code}" \
    "${BASE}/${asset}" -o "${asset}")
  if [[ "$http_code" != "200" ]]; then
    echo "FAIL ($http_code): $asset"
    rm -f "${asset}"
  fi
  sleep 3   # ← Without this, requests start failing midway
done

Soan's 36,869 PNGs were the exception: those came from a personal server (dev.2sc1815j.net), so I used xargs -P 5 for parallel download (5 parallel, max-time 30, ~30 minutes total). The right concurrency depends on the source server's capacity and rate-limit policy, not just on what you'd like.

Pitfall 5: The auth/storage backends don't actually work

IIIF Curation Viewer / Manager / Editor / Board internally depend on Firebase Authentication and JSONkeeper (the Flask app CODH ran for storing curation JSON). The mirror is a static-asset-only stop-gap, so these don't work:

Firebase Auth: The bundle hardcodes CODH's Firebase project ID codh-81041 and authDomain codh-81041.firebaseapp.com. nakamura196.github.io isn't in Firebase's authorized domains list, so sign-in popups from a different domain fail with auth/unauthorized-domain-style errors
Save (JSONkeeper): Even if auth somehow worked, the storage API host is on CODH, which is down

"Could I just stand up my own Firebase project and swap authFirebase.js?" Technically yes, but:

Sign-in alone isn't useful without JSONkeeper (you'd need to deploy rois-codh/JSONkeeper to your own VM / Cloud Run, plus wire up Firebase Admin SDK)
That goes well beyond a stop-gap mirror's scope

So I deliberately don't ship working auth/save in this mirror.

On the upside, read-only usage works fine:

# View an existing curation JSON in the Viewer (no auth needed)
https://nakamura196.github.io/codh-mirror/iiif-curation-viewer/?curation=<curation_json_url>

# Display a Manifest with a highlighted region (no auth needed)
https://nakamura196.github.io/codh-mirror/iiif-curation-viewer/?manifest=<manifest_url>&canvas=<canvas_id>&xywh=<x,y,w,h>&xywh_highlight=border

So linking from a consumer site for "view-only" usage is fine. If you need write functionality, your options are: wait for CODH to come back, or stand up the full backend yourself.

The README has the same caveat.

License and attribution

The mirror's contents have non-uniform rights, so I split the license note by component. License/copyright headers in each file are kept intact.

Component	Origin	License
vdiff / vdiff-seq / IIIF Curation tools, plus Soan's UI / bundle / index.js / dataset/001.json	CODH-provided software	MIT Copyright CODH (core contributor: Jun HOMMA (@2SC1815J))
Soan's 36,869 movable-type PNGs (`dataset/001/*.png`)	CODH "Old Movable Type Dataset" (source: 『徒然草』 NDL holding)	Per the dataset's license (the detail page is offline during CODH's maintenance; check again when service resumes)
Soan's 12 kuromoji dictionary files (`kuromoji/dict/*.dat.gz`)	Standard dictionary bundled with kuromoji.js (source: mecab-ipadic)	Apache 2.0 (kuromoji.js) / NAIST custom license + ICOT Free Software conditions (mecab-ipadic itself)

Soan's image binaries and kuromoji dictionary were obtained from dev.2sc1815j.net (Jun HOMMA's personal host) as a workaround during CODH's outage. That host is acting as a relay; the underlying rights remain with CODH (images) and the kuromoji.js project (dict).

The Soan Pro bundle at dev.2sc1815j.net (Copyright 2023 Jun HOMMA) doesn't ship an explicit re-distribution clause, so the Pro bundle itself is not included in this mirror. The bundle in use is purely the CODH demo version under MIT.

Closing

Putting a 200-MB-ish mirror repo on GitHub Pages sounds simple, but in reality each tool needs a dependency analysis, and several gotchas hide where you can't see them just by reading the HTML — runtime-loaded data like the kuromoji dictionary, or chained CSS url() references. After CODH service resumes, I plan to archive this repository. It's not the kind of repo that solicits Issues / PRs, but I hope the procedure helps anyone doing similar mirror work.

Last but not least: this mirror only exists thanks to CODH's many years of open publication, and to core contributor Jun HOMMA (@2SC1815J) for keeping the related resources alive on a personal host as well. My sincere thanks.

References

About the CODH (Center for Open Data in the Humanities) website | ROIS-DS — Official long-term-maintenance notice
Restoring CODH's vdiff.js from the Wayback Machine to Temporarily Bring Back The Tale of Genji's Face Comparison Feature — Wayback id_ flag usage and per-tool minimum-install analysis
Embedded apps under static/ get overwritten with the 404 page by Nuxt 2 generate, and how to retire the old Service Worker — Framework-specific gotchas on the consumer site
vdiff.js | CODH
Soan | CODH
kuromoji.js
The repository

🪞Setting up a dedicated GitHub Pages repository for a temporary CODH-tool mirror — including the missing kuromoji dictionary in Soan

Background and disclaimer

Repository layout

Pitfall 1: Soan's 36,869 movable-type PNGs reference absolute CODH URLs

Pitfall 2: A runtime dependency invisible from HTML (kuromoji dictionary) — the loading spinner that never stops

Pitfall 3: A flood of missing images and fonts referenced via CSS `url(...)`

Pitfall 4: Wayback's `000` floods (rate limiting)

Pitfall 5: The auth/storage backends don't actually work

License and attribution

Closing

References

🪞Restoring CODH's vdiff.js from the Wayback Machine to Temporarily Bring Back The Tale of Genji's Face Comparison Feature

🔑Retrofitting Firebase auth and a JSONkeeper-compatible API onto codh-mirror so IIIF Curation editing actually works — walking back the earlier `won't work` claim

🪣Deploying JSONkeeper on PythonAnywhere with HTTP API Only — A Self-Hosted Save Backend for the IIIF Curation Viewer

Background and disclaimer

Repository layout

Pitfall 1: Soan's 36,869 movable-type PNGs reference absolute CODH URLs

Pitfall 2: A runtime dependency invisible from HTML (kuromoji dictionary) — the loading spinner that never stops

Pitfall 3: A flood of missing images and fonts referenced via CSS url(...)

Pitfall 4: Wayback's 000 floods (rate limiting)

Pitfall 5: The auth/storage backends don't actually work

License and attribution

Closing

References

Related Articles

🪞Restoring CODH's vdiff.js from the Wayback Machine to Temporarily Bring Back The Tale of Genji's Face Comparison Feature

🔑Retrofitting Firebase auth and a JSONkeeper-compatible API onto codh-mirror so IIIF Curation editing actually works — walking back the earlier `won't work` claim

🪣Deploying JSONkeeper on PythonAnywhere with HTTP API Only — A Self-Hosted Save Backend for the IIIF Curation Viewer

Pitfall 3: A flood of missing images and fonts referenced via CSS `url(...)`

Pitfall 4: Wayback's `000` floods (rate limiting)