This article is co-authored with a generative AI. Facts have been cross-checked against official documentation where possible, but errors may remain. Please verify against primary sources before making any important decisions.
Background: “We need to deliver IIIF for images we cannot publish openly”
Digital archives of historical materials almost always include a non-trivial slice of images that cannot be released to the general public — for reasons of copyright, personality rights, contracts with the holding institution, or ethical considerations. At the same time, internal use cases such as
- review by researchers and editorial staff,
- limited delivery to partner institutions, and
- proofreading or curation work
still create a strong demand for “don’t publish, but keep the IIIF benefits”. Concretely:
- High-resolution viewing (OpenSeadragon / Mirador) for detail inspection
- IIIF Presentation API manifests usable from IIIF-compliant tools (within an authenticated session)
- Persistent URLs for citation (assuming the host name and routing remain stable)
Note: the architecture in this article assumes same-organization, authenticated browser sessions for IIIF use. Non-interactive cross-host interoperability (crawlers, harvesters, etc.) requires a separate implementation of IIIF Auth API 2.0 (see the section “Direction for extension” near the end).
A fully public corpus could just be CloudFront + S3, but combining “access control” with “IIIF” takes a bit of design care. This article is the implementation log.
Architecture
┌──────────────┐
user ──TLS──────▶ Cloudflare │
│ ┌────────┐ │
│ │ Access │── auth gate (Email OTP / SSO)
│ └────────┘ │
│ │ │
│ ┌────────┐ │
│ │Tunnel │ │
│ └────┬───┘ │
└───────│─────────┘
│ outbound only
┌───────┴────────────────────────────┐
│ Origin (VPS / academic cloud / …) │
│ ┌────────────┐ ┌──────────────┐ │
│ │ cloudflared│──│ Next.js (web)│──┐
│ └────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │Elasticsearch │ │
│ │ └──────────────┘ │
│ │ ▲ │
│ └─path:/iiif───┐ │
│ ┌───────▼──────┐ │
│ │ Cantaloupe │────┼─→ S3-compatible
│ └──────────────┘ │ object storage
└────────────────────────────────────┘
Layer responsibilities:
| Layer | What it stores | When it is generated |
|---|---|---|
| S3-compatible storage | JPEG bytes only | Static (one-time sync) |
| Elasticsearch | Bibliographic metadata, search index | Static (one-time indexer) |
Cantaloupe /iiif/3/<id>/info.json | Dimensions, tile layout, size list | Dynamic (HEAD against S3, JSON assembled, derivative cache) |
Cantaloupe /iiif/3/<id>/<region>/<size>/.../default.jpg | Cropped tiles | Dynamic (S3 → decode → crop → encode) |
Next.js /api/iiif/[id]/manifest.json | Presentation API 3.0 manifest | Dynamic (_doc from ES, JIT assembly, CDN cache) |
In other words, manifest and info.json are never pre-generated or stored — they are assembled on each request from ES + Cantaloupe (a JIT model).
The pivot of access control: single host + path-based routing
Why we did not split into subdomains
Our first cut used two subdomains: archive.example.org (Web UI) and iiif.archive.example.org (IIIF). That separation feels natural at first, but two issues showed up:
- TLS certificate constraint: Cloudflare Universal SSL covers
*.example.orgonly at one level deep.iiif.archive.example.org(two levels deep) is not in the SAN list and the TLS handshake fails immediately. Advanced Certificate Manager (paid) covers*.archive.example.org, but adds a monthly cost. - CORS / cookie distribution headaches: with cross-origin URLs, OpenSeadragon needs CORS headers to fetch tiles, and the
CF_Authorizationcookie that Cloudflare Access issues becomes harder to share.
Switch to path-based routing
The final shape is a single host (archive.example.org) split by path, which clears both problems:
# Cloudflare Tunnel ingress (configure in dashboard or via the API)
- hostname: archive.example.org
path: ^/iiif(/.*)?$ # IIIF paths → Cantaloupe
service: http://cf-cantaloupe:8182
- hostname: archive.example.org
service: http://cf-web:3000 # everything else → Next.js
- service: http_status:404
Benefits:
- Same origin → no CORS, automatic cookie sharing
- Fits inside Universal SSL’s one-level wildcard → no extra cost
- One Access policy covers every path
- The path passes through verbatim to Cantaloupe’s URL scheme (
/iiif/3/<id>/...), so no rewriting logic is needed
Cloudflare Access setup
In Zero Trust → Access → Applications, create a Self-hosted app:
- Application Domain:
archive.example.org(single host) - Identity Provider: One-time PIN (six-digit OTP via email)
- Policy:
Allowaction, selectorEmailsorEmail ends in(for example,*.example.ac.jp) - Session duration: e.g., 24 hours
On first access the user authenticates with an OTP, and the CF_Authorization cookie is issued for the entire archive.example.org host. From then on both the Web UI and /iiif/... ride on the same session. Standard browser behavior carries the cookie on the viewer’s tile sub-requests as well, so no extra code is needed.
Why Cloudflare Tunnel (cf-cloudflared)
Whether the origin is a VPS or an academic cloud, you can publish without opening any inbound port.
Cloudflare Tunnel works by having the origin keep an outbound persistent connection to the edge. As a result:
- No public IP required (NAT, private network — fine)
- Inbound traffic stays sealed while still being publicly reachable
- DNS is just a CNAME to
*.cfargotunnel.com
Replicating this on AWS would require CloudFront + Cognito (auth) + a private path (Direct Connect / Site-to-Site VPN / VPC Origin + PrivateLink). The fact that CDN + auth gate + private tunnel are bundled into a single SaaS is one of Cloudflare’s advantages for this kind of architecture. The feature coverage isn’t strictly equivalent across the stack, so the right choice depends on requirements.
Cantaloupe + S3-compatible storage
Why Cantaloupe
The IIIF Image API server space has IIPImageServer (C++, lightweight), Cantaloupe (Java), Loris (Python, easy to extend), and others; the right pick depends on use case. We adopted Cantaloupe for:
- Built-in S3Source: reads directly from S3-compatible storage and generates JPEG tiles (no extra proxy)
- Single properties file for configuration: keeps Docker Compose self-contained
- Derivative cache: tiles, once generated, are served as fast as static files thereafter
We used the Docker image distributed by the Islandora project, which is widely adopted in the community (islandora/cantaloupe:<version> on Docker Hub). It uses confd templates to render cantaloupe.properties from CANTALOUPE_* environment variables, which keeps everything inside Docker Compose.
Key configuration
# Mode
source.static = S3Source
endpoint.iiif.3.enabled = true
# S3-compatible storage
S3Source.endpoint = ${S3_ENDPOINT}
S3Source.region = ${S3_REGION}
S3Source.access_key_id = ${S3_ACCESS_KEY}
S3Source.secret_access_key = ${S3_SECRET_KEY}
S3Source.path_style_access = true # important for many S3-compatibles incl. MinIO
S3Source.lookup_strategy = BasicLookupStrategy
S3Source.BasicLookupStrategy.bucket.name = ${S3_BUCKET}
Without path_style_access = true, requests are formed against AWS S3’s virtual-hosted-style URLs (<bucket>.<host>/key), which fail on many S3-compatible implementations.
A pothole: AWS SDK credentials chain
When using a custom S3 endpoint with Cantaloupe 5.x we observed cases where S3Source.access_key_id was effectively ignored and the AWS SDK fell through to its default credentials chain (likely environment-dependent — we did not isolate the precise repro conditions). The properties alone produced [endpoint: null] [accessKeyID: null] in the logs and the requests failed with SignatureDoesNotMatch.
Workaround: also pass the standard AWS environment variables, redundantly.
environment:
- CANTALOUPE_S3SOURCE_ACCESS_KEY_ID=${S3_ACCESS_KEY}
- CANTALOUPE_S3SOURCE_SECRET_ACCESS_KEY=${S3_SECRET_KEY}
# belt-and-suspenders
- AWS_ACCESS_KEY_ID=${S3_ACCESS_KEY}
- AWS_SECRET_ACCESS_KEY=${S3_SECRET_KEY}
- AWS_REGION=${S3_REGION}
Take canvas dimensions from info.json
Canvas dimensions inside the manifest should come from info.json, not from the bibliographic record, for IIIF spec consistency. Cantaloupe decodes the actual image and reports dimensions — that’s the source of truth:
// Next.js /api/iiif/[id]/manifest.json
async function fetchImageInfo(iiifId: string) {
const r = await fetch(`${IIIF_BASE}/${iiifId}/info.json`,
{ next: { revalidate: 300 } }); // Next.js fetch cache (5 min)
if (!r.ok) return null;
return await r.json(); // { width, height, ... }
}
const info = await fetchImageInfo(doc.iiif_id);
const w = info?.width ?? doc.width ?? 1000; // ES value as fallback
const h = info?.height ?? doc.height ?? 1000;
The three-layer cache (Cantaloupe derivative + Next.js fetch + Cloudflare edge) keeps the practical overhead near zero.
Implementation differences with S3-compatible storage
There are often small implementation differences between commercial AWS S3 and S3-compatible implementations. Concrete potholes we hit:
1. AWS CLI v2.23+ flexible checksums get rejected
Error:
InvalidArgument: x-amz-content-sha256 must be UNSIGNED-PAYLOAD,
STREAMING-AWS4-HMAC-SHA256-PAYLOAD, or a valid sha256 value.
Workaround:
export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
2. SignatureDoesNotMatch on keys containing CJK characters
PUT/HEAD on keys containing CJK characters can fail signature verification. The cause sits across both client behavior (older mc URL-encoding rules, historical boto3 issues, …) and server-side URL normalization differences. Reproducibility depends on the version combination.
→ The safe answer for any combination is to flatten keys to ASCII-only:
photos/<basename>.jpg
Round IDs and other classifications live as ES metadata such as extraction_round, while the S3 key is reserved for “the immutable address”. As a side effect:
iiif_idbecomes truly immutable- Re-digitizations of the same physical photo deduplicate naturally on S3 via last-write-wins
- The CJK compatibility problem disappears entirely
3. IsTruncated=true without NextContinuationToken
Some implementations of list-objects-v2 do not strictly follow the spec for pagination — listings can cap silently at 1000, and StartAfter can be ignored.
→ When managing many objects, a safe principle is “ES is the source of truth, S3 is just bytes, LIST is a last resort”. We track upload state with an s3_uploaded field in ES, populated by the sync script as a bulk update on each successful PUT:
# scripts/sync_images_to_s3.py
class ESUpdateQueue:
def enqueue_uploaded(self, basename, size):
# After a successful PUT, extract item_id from the basename
# and enqueue an ES update setting s3_uploaded=true.
...
A facet on s3_uploaded:false lets us instantly surface docs whose images have not landed yet — useful for garbage collection, incremental sync, and health checks.
Elasticsearch: Japanese substring search without plugins
Our production Elasticsearch cluster is shared with other teams, so we could not install the analysis-kuromoji plugin. We achieved Japanese substring search using an n-gram bigram analyzer instead:
INDEX_SETTINGS = {
"settings": {
"analysis": {
"tokenizer": {
"bigram_tokenizer": {
"type": "ngram",
"min_gram": 2, "max_gram": 2,
"token_chars": ["letter", "digit"],
}
},
"analyzer": {
"ja_text": {
"type": "custom",
"tokenizer": "bigram_tokenizer",
"filter": ["lowercase", "cjk_width"],
}
},
},
},
...
}
The query side uses match_phrase:
const should = Object.entries(searchFields).map(([field, opts]) => ({
match_phrase: { [field]: { query: term, boost: opts.weight ?? 1 } },
}));
n-gram + match_phrase is equivalent to substring search. Searching for 形態素解析 matches 形態素, 態素解, 素解析 — cjk_width normalization absorbs full-width / half-width drift.
Pros: no plugin, runs anywhere. Cons: single-character searches don’t match (because min_gram=2), but for CJK we treat that as “less noise” rather than a regression.
Data pipeline: raw → normalize → index → sync
When data arrives from a provider in multiple batches, the folder structure tends to drift batch by batch. For example:
- Batch identifier naming variation: with/without date suffix, full-width vs half-width digits, sub-numbers, naming convention changing mid-stream
- Image folder name variation: classification subfolders sometimes present, sometimes flat, full-width hyphens leaking in
- Excel file naming variation:
before-revision/revised,sorted, request-form files (which are metadata-management artifacts) mixed in
If we tried to absorb all of this in the indexer, normalization logic would balloon. Instead we extracted an intermediate normalize_rounds.py stage:
raw data (as-delivered, read-only)
│
▼ scripts/normalize_rounds.py
│ ・Normalize batch IDs to a unified format (full→half-width, zero-pad)
│ ・Pick exactly one Excel per category by priority
│ (revised > sorted > before-revision; drop request-forms)
│ ・Drop reference-only images (envelopes/sleeves) that aren't the photos
│ ・Emit meta.json
│ ・Everything is symlinks → no extra disk usage
▼
clean tree (symlink only, drift absorbed)
│
▼ indexer/index.py
│ ・Read bibliography Excel and bulk-upsert into ES
│ ・update + upsert: indexed_at set only on first insert,
│ updated_at refreshed every run (idempotent)
▼
Elasticsearch (metadata only)
clean tree
│
▼ scripts/sync_images_to_s3.py
│ ・boto3 + ThreadPoolExecutor (16 workers) for parallel PUT
│ ・Skip duplicate uploads via existing-key snapshot + size match
│ ・Bulk-update ES with s3_uploaded=true on each successful PUT
▼
S3-compatible storage (JPEGs only)
Building the staging stage out of symlinks lets us rebuild tens of thousands of files in seconds. Re-runs are idempotent and robust to drift in the delivery format.
Design point: don’t pull drift absorption into the indexer or sync. If you isolate raw → clean as its own stage, only the normalization rules need to change when delivery formats shift; the indexer and sync can keep relying on a “unified clean tree” and stay long-lived.
Common potholes
| Item | Symptom | Workaround |
|---|---|---|
| Cloudflare Universal SSL, two-deep subdomain | TLS handshake fail | Stay one level deep, or use path-based ingress |
| AWS CLI 2.23+ checksums | x-amz-content-sha256 must be … | AWS_REQUEST_CHECKSUM_CALCULATION=when_required |
| CJK in S3 keys | SignatureDoesNotMatch | Flat ASCII keys |
| 1000-cap on S3-compatible LIST | Cannot verify post-sync state | Make ES the source of truth |
| Cantaloupe v5 S3Source auth | [accessKeyID: null] | Also pass AWS_ACCESS_KEY_ID etc. as env |
Next.js NEXT_PUBLIC_* | Not reflected in client bundle | Pass via Dockerfile ARG + compose build.args |
mc mirror walking speed in bash | Tens of minutes to stat 73k symlinks | Parallelize with Python + boto3 (orders of magnitude faster) |
“Publishing without publishing”: an operational pattern
This architecture realizes a third option between
- fully public (anyone can access) and
- fully private (internal file server):
a member-limited delivery that still complies with IIIF. It supports operations such as
- editorial staff using the IIIF viewer to inspect detail during proofing,
- sharing manifest URLs with partner institutions for use in Mirador within authenticated sessions, and
- granting access dynamically by email domain in response to researcher requests.
A primary characteristic of this architecture is that it builds the “do not think of public-vs-private as binary” middle layer in a fully IIIF-compliant form.
The operational benefit: you do not need to maintain a separate “publishable subset” IIIF instance. The original data lives in a single pipeline and viewer access is regulated entirely by Access policies.
Direction for extension: relationship with IIIF Auth API 2.0
This section summarizes a design direction only — it is not implemented in the architecture above. Implementation steps and code examples are deferred to a follow-up article after actual verification.
The Cloudflare Access cookie authentication described above is essentially HTTP / network-layer control. The IIIF community has standardized IIIF Authorization Flow API 2.0 as an application-layer (IIIF) authentication spec. The two layers can be stacked, with each playing a different role.
Two-layer view
| Layer | What it protects | What it tells the client |
|---|---|---|
| HTTP / network(Cloudflare Access) | Requests to URL paths and host names | Only “is this request authenticated?”. Unauthenticated requests get login HTML / 302 |
| IIIF / application(Auth API 2.0) | IIIF resources (images, etc.) | A spec-compliant service[] block that tells the viewer where the auth endpoints are and how to authenticate |
Stacking patterns
| Pattern | Use case | Where this article sits |
|---|---|---|
| HTTP layer only | Same-organization browser use | This article = this |
| IIIF layer only | Public archive with member-only access on selected resources | (Different design) |
| Both layers stacked | Required when external IIIF clients need to interoperate | (Future extension, not implemented) |
Stacking both would broadly look like (details deferred to a follow-up):
- Split the Cloudflare Access policy into bypass (manifest / info.json / auth endpoints) + allow (image bytes)
- Embed an Auth API 2.0
service[]into info.json (Cantaloupe has a delegate-script mechanism for this — confirmed; the exact method names need verification) - Implement Probe / Access / Token endpoints in Next.js, internally validating the Cloudflare Access cookie / JWT
- Add a Bearer-token validation layer on the tile path (the delegate script can host this)
Scope of this article
This article covers the HTTP layer only (Cloudflare Access), which is sufficient and operating in production for same-organization browser use. Stacking the IIIF layer on top is the next step, planned as a separate article once implemented and verified.
Summary
| Component | Role |
|---|---|
| S3-compatible storage | JPEG byte store (flat ASCII keys) |
| Cantaloupe | IIIF Image API (tile generation, dynamic info.json) |
| Elasticsearch | Search + metadata SoR (also tracks sync state via s3_uploaded) |
| Next.js | UI + manifest API (Presentation 3.0 generated JIT) |
| Cloudflare Tunnel | No inbound, single outbound for publication |
| Cloudflare Access | Auth gate (Email OTP / SSO) — covers Web UI and IIIF on the same origin |
The defining characteristic of this stack is that access control reduces to “attach an Access policy to a single origin”, and one authentication session covers both the Web UI and IIIF delivery. As a result, even for historical materials that cannot be made fully public for copyright, contractual, or ethical reasons, research and editorial workflows can use the IIIF benefits (spec-compliant high-resolution viewer, manifest delivery) within an authorized member scope. Interoperability with external systems / non-interactive clients does not work with this stack alone — it requires also implementing IIIF Auth API 2.0.
This article uses Cloudflare Access cookie authentication, which suits same-organization, same-browser use. For scenarios requiring interoperability with external IIIF clients, the direction is to stack IIIF Auth API 2.0 (the application-layer spec) on top of this architecture. We plan to write that up in a follow-up after implementation and verification.
I hope this is useful as a reference.