Building an Access-Controlled IIIF Digital Archive — Cantaloupe + S3 + Elasticsearch + Next.js, Gated by Cloudflare Access

This article is co-authored with a generative AI. Facts have been cross-checked against official documentation where possible, but errors may remain. Please verify against primary sources before making any important decisions.

Background: “We need to deliver IIIF for images we cannot publish openly”

Digital archives of historical materials almost always include a non-trivial slice of images that cannot be released to the general public — for reasons of copyright, personality rights, contracts with the holding institution, or ethical considerations. At the same time, internal use cases such as

review by researchers and editorial staff,
limited delivery to partner institutions, and
proofreading or curation work

still create a strong demand for “don’t publish, but keep the IIIF benefits”. Concretely:

High-resolution viewing (OpenSeadragon / Mirador) for detail inspection
IIIF Presentation API manifests usable from IIIF-compliant tools (within an authenticated session)
Persistent URLs for citation (assuming the host name and routing remain stable)

Note: the architecture in this article assumes same-organization, authenticated browser sessions for IIIF use. Non-interactive cross-host interoperability (crawlers, harvesters, etc.) requires a separate implementation of IIIF Auth API 2.0 (see the section “Direction for extension” near the end).

A fully public corpus could just be CloudFront + S3, but combining “access control” with “IIIF” takes a bit of design care. This article is the implementation log.

Architecture

                     ┌──────────────┐
user ──TLS──────▶  Cloudflare         │
                   │  ┌────────┐     │
                   │  │ Access │── auth gate (Email OTP / SSO)
                   │  └────────┘     │
                   │      │           │
                   │  ┌────────┐     │
                   │  │Tunnel  │     │
                   │  └────┬───┘     │
                   └───────│─────────┘
                           │ outbound only
                   ┌───────┴────────────────────────────┐
                   │  Origin (VPS / academic cloud / …)  │
                   │  ┌────────────┐  ┌──────────────┐  │
                   │  │ cloudflared│──│ Next.js (web)│──┐
                   │  └────────────┘  └──────────────┘  │
                   │         │              │            │
                   │         │              ▼            │
                   │         │       ┌──────────────┐   │
                   │         │       │Elasticsearch │   │
                   │         │       └──────────────┘   │
                   │         │              ▲           │
                   │         └─path:/iiif───┐           │
                   │                ┌───────▼──────┐    │
                   │                │  Cantaloupe  │────┼─→ S3-compatible
                   │                └──────────────┘    │   object storage
                   └────────────────────────────────────┘

Layer responsibilities:

Layer	What it stores	When it is generated
S3-compatible storage	JPEG bytes only	Static (one-time `sync`)
Elasticsearch	Bibliographic metadata, search index	Static (one-time `indexer`)
Cantaloupe `/iiif/3/<id>/info.json`	Dimensions, tile layout, size list	Dynamic (HEAD against S3, JSON assembled, derivative cache)
Cantaloupe `/iiif/3/<id>/<region>/<size>/.../default.jpg`	Cropped tiles	Dynamic (S3 → decode → crop → encode)
Next.js `/api/iiif/[id]/manifest.json`	Presentation API 3.0 manifest	Dynamic (`_doc` from ES, JIT assembly, CDN cache)

In other words, manifest and info.json are never pre-generated or stored — they are assembled on each request from ES + Cantaloupe (a JIT model).

The pivot of access control: single host + path-based routing

Why we did not split into subdomains

Our first cut used two subdomains: archive.example.org (Web UI) and iiif.archive.example.org (IIIF). That separation feels natural at first, but two issues showed up:

TLS certificate constraint: Cloudflare Universal SSL covers *.example.org only at one level deep. iiif.archive.example.org (two levels deep) is not in the SAN list and the TLS handshake fails immediately. Advanced Certificate Manager (paid) covers *.archive.example.org, but adds a monthly cost.
CORS / cookie distribution headaches: with cross-origin URLs, OpenSeadragon needs CORS headers to fetch tiles, and the CF_Authorization cookie that Cloudflare Access issues becomes harder to share.

Switch to path-based routing

The final shape is a single host (archive.example.org) split by path, which clears both problems:

# Cloudflare Tunnel ingress (configure in dashboard or via the API)
- hostname: archive.example.org
  path: ^/iiif(/.*)?$        # IIIF paths → Cantaloupe
  service: http://cf-cantaloupe:8182
- hostname: archive.example.org
  service: http://cf-web:3000     # everything else → Next.js
- service: http_status:404

Benefits:

Same origin → no CORS, automatic cookie sharing
Fits inside Universal SSL’s one-level wildcard → no extra cost
One Access policy covers every path
The path passes through verbatim to Cantaloupe’s URL scheme (/iiif/3/<id>/...), so no rewriting logic is needed

Cloudflare Access setup

In Zero Trust → Access → Applications, create a Self-hosted app:

Application Domain: archive.example.org (single host)
Identity Provider: One-time PIN (six-digit OTP via email)
Policy: Allow action, selector Emails or Email ends in (for example, *.example.ac.jp)
Session duration: e.g., 24 hours

On first access the user authenticates with an OTP, and the CF_Authorization cookie is issued for the entire archive.example.org host. From then on both the Web UI and /iiif/... ride on the same session. Standard browser behavior carries the cookie on the viewer’s tile sub-requests as well, so no extra code is needed.

Why Cloudflare Tunnel (cf-cloudflared)

Whether the origin is a VPS or an academic cloud, you can publish without opening any inbound port.

Cloudflare Tunnel works by having the origin keep an outbound persistent connection to the edge. As a result:

No public IP required (NAT, private network — fine)
Inbound traffic stays sealed while still being publicly reachable
DNS is just a CNAME to *.cfargotunnel.com

Replicating this on AWS would require CloudFront + Cognito (auth) + a private path (Direct Connect / Site-to-Site VPN / VPC Origin + PrivateLink). The fact that CDN + auth gate + private tunnel are bundled into a single SaaS is one of Cloudflare’s advantages for this kind of architecture. The feature coverage isn’t strictly equivalent across the stack, so the right choice depends on requirements.

Cantaloupe + S3-compatible storage

Why Cantaloupe

The IIIF Image API server space has IIPImageServer (C++, lightweight), Cantaloupe (Java), Loris (Python, easy to extend), and others; the right pick depends on use case. We adopted Cantaloupe for:

Built-in S3Source: reads directly from S3-compatible storage and generates JPEG tiles (no extra proxy)
Single properties file for configuration: keeps Docker Compose self-contained
Derivative cache: tiles, once generated, are served as fast as static files thereafter

We used the Docker image distributed by the Islandora project, which is widely adopted in the community (islandora/cantaloupe:<version> on Docker Hub). It uses confd templates to render cantaloupe.properties from CANTALOUPE_* environment variables, which keeps everything inside Docker Compose.

Key configuration

# Mode
source.static = S3Source
endpoint.iiif.3.enabled = true

# S3-compatible storage
S3Source.endpoint = ${S3_ENDPOINT}
S3Source.region = ${S3_REGION}
S3Source.access_key_id = ${S3_ACCESS_KEY}
S3Source.secret_access_key = ${S3_SECRET_KEY}
S3Source.path_style_access = true   # important for many S3-compatibles incl. MinIO
S3Source.lookup_strategy = BasicLookupStrategy
S3Source.BasicLookupStrategy.bucket.name = ${S3_BUCKET}

Without path_style_access = true, requests are formed against AWS S3’s virtual-hosted-style URLs (<bucket>.<host>/key), which fail on many S3-compatible implementations.

A pothole: AWS SDK credentials chain

When using a custom S3 endpoint with Cantaloupe 5.x we observed cases where S3Source.access_key_id was effectively ignored and the AWS SDK fell through to its default credentials chain (likely environment-dependent — we did not isolate the precise repro conditions). The properties alone produced [endpoint: null] [accessKeyID: null] in the logs and the requests failed with SignatureDoesNotMatch.

Workaround: also pass the standard AWS environment variables, redundantly.

environment:
  - CANTALOUPE_S3SOURCE_ACCESS_KEY_ID=${S3_ACCESS_KEY}
  - CANTALOUPE_S3SOURCE_SECRET_ACCESS_KEY=${S3_SECRET_KEY}
  # belt-and-suspenders
  - AWS_ACCESS_KEY_ID=${S3_ACCESS_KEY}
  - AWS_SECRET_ACCESS_KEY=${S3_SECRET_KEY}
  - AWS_REGION=${S3_REGION}

Take canvas dimensions from info.json

Canvas dimensions inside the manifest should come from info.json, not from the bibliographic record, for IIIF spec consistency. Cantaloupe decodes the actual image and reports dimensions — that’s the source of truth:

// Next.js /api/iiif/[id]/manifest.json
async function fetchImageInfo(iiifId: string) {
  const r = await fetch(`${IIIF_BASE}/${iiifId}/info.json`,
    { next: { revalidate: 300 } });   // Next.js fetch cache (5 min)
  if (!r.ok) return null;
  return await r.json();   // { width, height, ... }
}

const info = await fetchImageInfo(doc.iiif_id);
const w = info?.width ?? doc.width ?? 1000;   // ES value as fallback
const h = info?.height ?? doc.height ?? 1000;

The three-layer cache (Cantaloupe derivative + Next.js fetch + Cloudflare edge) keeps the practical overhead near zero.

Implementation differences with S3-compatible storage

There are often small implementation differences between commercial AWS S3 and S3-compatible implementations. Concrete potholes we hit:

1. AWS CLI v2.23+ flexible checksums get rejected

Error:

InvalidArgument: x-amz-content-sha256 must be UNSIGNED-PAYLOAD,
STREAMING-AWS4-HMAC-SHA256-PAYLOAD, or a valid sha256 value.

Workaround:

export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required

2. `SignatureDoesNotMatch` on keys containing CJK characters

PUT/HEAD on keys containing CJK characters can fail signature verification. The cause sits across both client behavior (older mc URL-encoding rules, historical boto3 issues, …) and server-side URL normalization differences. Reproducibility depends on the version combination.

→ The safe answer for any combination is to flatten keys to ASCII-only:

photos/<basename>.jpg

Round IDs and other classifications live as ES metadata such as extraction_round, while the S3 key is reserved for “the immutable address”. As a side effect:

iiif_id becomes truly immutable
Re-digitizations of the same physical photo deduplicate naturally on S3 via last-write-wins
The CJK compatibility problem disappears entirely

3. `IsTruncated=true` without `NextContinuationToken`

Some implementations of list-objects-v2 do not strictly follow the spec for pagination — listings can cap silently at 1000, and StartAfter can be ignored.

→ When managing many objects, a safe principle is “ES is the source of truth, S3 is just bytes, LIST is a last resort”. We track upload state with an s3_uploaded field in ES, populated by the sync script as a bulk update on each successful PUT:

# scripts/sync_images_to_s3.py
class ESUpdateQueue:
    def enqueue_uploaded(self, basename, size):
        # After a successful PUT, extract item_id from the basename
        # and enqueue an ES update setting s3_uploaded=true.
        ...

A facet on s3_uploaded:false lets us instantly surface docs whose images have not landed yet — useful for garbage collection, incremental sync, and health checks.

Elasticsearch: Japanese substring search without plugins

Our production Elasticsearch cluster is shared with other teams, so we could not install the analysis-kuromoji plugin. We achieved Japanese substring search using an n-gram bigram analyzer instead:

INDEX_SETTINGS = {
    "settings": {
        "analysis": {
            "tokenizer": {
                "bigram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2, "max_gram": 2,
                    "token_chars": ["letter", "digit"],
                }
            },
            "analyzer": {
                "ja_text": {
                    "type": "custom",
                    "tokenizer": "bigram_tokenizer",
                    "filter": ["lowercase", "cjk_width"],
                }
            },
        },
    },
    ...
}

The query side uses match_phrase:

const should = Object.entries(searchFields).map(([field, opts]) => ({
  match_phrase: { [field]: { query: term, boost: opts.weight ?? 1 } },
}));

n-gram + match_phrase is equivalent to substring search. Searching for 形態素解析 matches 形態素, 態素解, 素解析 — cjk_width normalization absorbs full-width / half-width drift.

Pros: no plugin, runs anywhere. Cons: single-character searches don’t match (because min_gram=2), but for CJK we treat that as “less noise” rather than a regression.

Data pipeline: raw → normalize → index → sync

When data arrives from a provider in multiple batches, the folder structure tends to drift batch by batch. For example:

Batch identifier naming variation: with/without date suffix, full-width vs half-width digits, sub-numbers, naming convention changing mid-stream
Image folder name variation: classification subfolders sometimes present, sometimes flat, full-width hyphens leaking in
Excel file naming variation: before-revision/revised, sorted, request-form files (which are metadata-management artifacts) mixed in

If we tried to absorb all of this in the indexer, normalization logic would balloon. Instead we extracted an intermediate normalize_rounds.py stage:

raw data (as-delivered, read-only)
       │
       ▼ scripts/normalize_rounds.py
       │   ・Normalize batch IDs to a unified format (full→half-width, zero-pad)
       │   ・Pick exactly one Excel per category by priority
       │     (revised > sorted > before-revision; drop request-forms)
       │   ・Drop reference-only images (envelopes/sleeves) that aren't the photos
       │   ・Emit meta.json
       │   ・Everything is symlinks → no extra disk usage
       ▼
clean tree (symlink only, drift absorbed)
       │
       ▼ indexer/index.py
       │   ・Read bibliography Excel and bulk-upsert into ES
       │   ・update + upsert: indexed_at set only on first insert,
       │     updated_at refreshed every run (idempotent)
       ▼
Elasticsearch (metadata only)

clean tree
       │
       ▼ scripts/sync_images_to_s3.py
       │   ・boto3 + ThreadPoolExecutor (16 workers) for parallel PUT
       │   ・Skip duplicate uploads via existing-key snapshot + size match
       │   ・Bulk-update ES with s3_uploaded=true on each successful PUT
       ▼
S3-compatible storage (JPEGs only)

Building the staging stage out of symlinks lets us rebuild tens of thousands of files in seconds. Re-runs are idempotent and robust to drift in the delivery format.

Design point: don’t pull drift absorption into the indexer or sync. If you isolate raw → clean as its own stage, only the normalization rules need to change when delivery formats shift; the indexer and sync can keep relying on a “unified clean tree” and stay long-lived.

Common potholes

Item	Symptom	Workaround
Cloudflare Universal SSL, two-deep subdomain	TLS handshake fail	Stay one level deep, or use path-based ingress
AWS CLI 2.23+ checksums	`x-amz-content-sha256 must be …`	`AWS_REQUEST_CHECKSUM_CALCULATION=when_required`
CJK in S3 keys	`SignatureDoesNotMatch`	Flat ASCII keys
1000-cap on S3-compatible LIST	Cannot verify post-sync state	Make ES the source of truth
Cantaloupe v5 S3Source auth	`[accessKeyID: null]`	Also pass `AWS_ACCESS_KEY_ID` etc. as env
Next.js `NEXT_PUBLIC_*`	Not reflected in client bundle	Pass via `Dockerfile` `ARG` + compose `build.args`
`mc mirror` walking speed in bash	Tens of minutes to stat 73k symlinks	Parallelize with Python + boto3 (orders of magnitude faster)

“Publishing without publishing”: an operational pattern

This architecture realizes a third option between

fully public (anyone can access) and
fully private (internal file server):

a member-limited delivery that still complies with IIIF. It supports operations such as

editorial staff using the IIIF viewer to inspect detail during proofing,
sharing manifest URLs with partner institutions for use in Mirador within authenticated sessions, and
granting access dynamically by email domain in response to researcher requests.

A primary characteristic of this architecture is that it builds the “do not think of public-vs-private as binary” middle layer in a fully IIIF-compliant form.

The operational benefit: you do not need to maintain a separate “publishable subset” IIIF instance. The original data lives in a single pipeline and viewer access is regulated entirely by Access policies.

Direction for extension: relationship with IIIF Auth API 2.0

This section summarizes a design direction only — it is not implemented in the architecture above. Implementation steps and code examples are deferred to a follow-up article after actual verification.

The Cloudflare Access cookie authentication described above is essentially HTTP / network-layer control. The IIIF community has standardized IIIF Authorization Flow API 2.0 as an application-layer (IIIF) authentication spec. The two layers can be stacked, with each playing a different role.

Two-layer view

Layer	What it protects	What it tells the client
HTTP / network(Cloudflare Access)	Requests to URL paths and host names	Only “is this request authenticated?”. Unauthenticated requests get login HTML / 302
IIIF / application(Auth API 2.0)	IIIF resources (images, etc.)	A spec-compliant `service[]` block that tells the viewer where the auth endpoints are and how to authenticate

Stacking patterns

Pattern	Use case	Where this article sits
HTTP layer only	Same-organization browser use	This article = this
IIIF layer only	Public archive with member-only access on selected resources	(Different design)
Both layers stacked	Required when external IIIF clients need to interoperate	(Future extension, not implemented)

Stacking both would broadly look like (details deferred to a follow-up):

Split the Cloudflare Access policy into bypass (manifest / info.json / auth endpoints) + allow (image bytes)
Embed an Auth API 2.0 service[] into info.json (Cantaloupe has a delegate-script mechanism for this — confirmed; the exact method names need verification)
Implement Probe / Access / Token endpoints in Next.js, internally validating the Cloudflare Access cookie / JWT
Add a Bearer-token validation layer on the tile path (the delegate script can host this)

Scope of this article

This article covers the HTTP layer only (Cloudflare Access), which is sufficient and operating in production for same-organization browser use. Stacking the IIIF layer on top is the next step, planned as a separate article once implemented and verified.

Summary

Component	Role
S3-compatible storage	JPEG byte store (flat ASCII keys)
Cantaloupe	IIIF Image API (tile generation, dynamic info.json)
Elasticsearch	Search + metadata SoR (also tracks sync state via `s3_uploaded`)
Next.js	UI + manifest API (Presentation 3.0 generated JIT)
Cloudflare Tunnel	No inbound, single outbound for publication
Cloudflare Access	Auth gate (Email OTP / SSO) — covers Web UI and IIIF on the same origin

The defining characteristic of this stack is that access control reduces to “attach an Access policy to a single origin”, and one authentication session covers both the Web UI and IIIF delivery. As a result, even for historical materials that cannot be made fully public for copyright, contractual, or ethical reasons, research and editorial workflows can use the IIIF benefits (spec-compliant high-resolution viewer, manifest delivery) within an authorized member scope. Interoperability with external systems / non-interactive clients does not work with this stack alone — it requires also implementing IIIF Auth API 2.0.

This article uses Cloudflare Access cookie authentication, which suits same-organization, same-browser use. For scenarios requiring interoperability with external IIIF clients, the direction is to stack IIIF Auth API 2.0 (the application-layer spec) on top of this architecture. We plan to write that up in a follow-up after implementation and verification.

I hope this is useful as a reference.

Background: “We need to deliver IIIF for images we cannot publish openly”#

Architecture#

The pivot of access control: single host + path-based routing#

Why we did not split into subdomains#

Switch to path-based routing#

Cloudflare Access setup#

Why Cloudflare Tunnel (cf-cloudflared)#

Cantaloupe + S3-compatible storage#

Why Cantaloupe#

Key configuration#

A pothole: AWS SDK credentials chain#

Take canvas dimensions from info.json#

Implementation differences with S3-compatible storage#

1. AWS CLI v2.23+ flexible checksums get rejected#

2. SignatureDoesNotMatch on keys containing CJK characters#

3. IsTruncated=true without NextContinuationToken#

Elasticsearch: Japanese substring search without plugins#

Data pipeline: raw → normalize → index → sync#

Common potholes#

“Publishing without publishing”: an operational pattern#

Direction for extension: relationship with IIIF Auth API 2.0#

Two-layer view#

Stacking patterns#

Scope of this article#

Summary#