Migrating an IIIF Image Server from Cantaloupe to serverless-iiif

This article is co-written with generative AI. Facts have been verified against official documentation where possible, but errors may remain. Please check primary sources before making important decisions. Organization names, domain names, bucket names, and various identifiers are anonymized to focus on the essence of the architecture and procedure.

Original setup

The IIIF image delivery I was operating ran Cantaloupe inside Docker on a single AWS EC2 (t3.large) instance. Drupal, its MariaDB, and Traefik were co-tenants on the same EC2, so multiple services shared one host.

Source images lived in an S3 bucket in a different AWS account, accessed via Cantaloupe’s S3Source. CloudFront sat in front for TLS termination and JP edge caching.

Client → CloudFront (NRT edge) → EC2 (us-east-1, Traefik → Cantaloupe)
                                              ↓ S3Source (us-east-1)
                                       <legacy-source-bucket>

Looking at CloudWatch metrics, traffic over the past 30 days was about 260,000 requests / 8.4 GB transferred — peaks didn’t even reach 0.5 req/s.

Why migrate

Mainly two things: security and the maintenance burden of the Docker host (EC2).

Security overhead

As long as the EC2 is exposed to the internet, there’s a steady stream of things to keep up with: OS security patches, managing SSH access paths, tracking vulnerabilities in the Docker daemon and each container image, following JVM/Java CVEs, keeping Cantaloupe and the co-tenant Drupal/MariaDB up to date. A misconfiguration or missed update is directly exposed externally.

Going serverless moves the OS and Java runtime into AWS’s managed scope, and removes the SSH path entirely. The attack surface narrows to the Lambda function code, IAM policies, and CloudFront/WAF settings — fewer places to look.

Maintaining the Docker host

Cantaloupe, Drupal, MariaDB, and Traefik all shared the same t3.large, so a problem in one could spill into the others.

A kernel update requiring a reboot affects all four services
Disk pressure (cache directory, container logs, Docker overlay) needed self-monitoring
Memory leaks in Java or Drupal could trigger OOM in unrelated services
Backup, monitoring, and log shipping scripts were all written per-host, so adding a service was a heavy operation

A “no host” architecture makes most of this go away.

And as a bonus, cost

Running a t3.large 24/7 for traffic that never reached 0.5 req/s was overspec, so going serverless was expected to also lower cost as a side effect. The breakdown is in the cost section below.

Choosing serverless-iiif

Samvera’s serverless-iiif packages a Node.js + sharp (libvips) IIIF Image API server as a Lambda function. It’s distributed via the AWS Serverless Application Repository (SAR) and can also be used from SAM/CDK.

A single command from SAR is enough to deploy:

aws serverlessrepo create-cloud-formation-change-set \
  --application-id arn:aws:serverlessrepo:us-east-1:625046682746:applications/serverless-iiif \
  --semantic-version 7.0.1 \
  --stack-name serverless-iiif-apne1 \
  --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
  --parameter-overrides \
    '[{"Name":"SourceBucket","Value":"<images-tokyo-bucket>"},{"Name":"ResolverTemplate","Value":"%s"}]'

Setting ResolverTemplate=%s makes the IIIF ID map directly to the S3 key, which keeps the URL shape compatible with what Cantaloupe was serving (e.g. /iiif/2/clioimg%2Fsho.tif/info.json).

Lambda spec

The default Lambda configuration in the SAR template:

Field	Value
Memory	3008 MB
Timeout	10s
Runtime	Node.js 22
Architecture	x86_64

Lambda CPU scales linearly with allocated memory; 3008 MB gives roughly 2 vCPU. JPEG encoding and TIFF decoding via sharp/libvips are non-trivial, so dropping memory translates directly into perceptible slowdown. Conversely, the SAR template doesn’t expose Architecture as a parameter, so going arm64 (Graviton) requires forking the template.

Cantaloupe vs serverless-iiif: characteristics

Their design philosophies are quite different, so I’m leaving here the comparison I went through during evaluation. Beyond “it serves IIIF” the design assumptions and operational shape barely overlap.

Configurability and customization

Cantaloupe exposes 100+ parameters in cantaloupe.properties. You can tune processor selection, quality, memory, authorization hooks, etc., and you can inject per-request logic via a Ruby Delegate Script. It supports multiple sources (S3 / Filesystem / HTTP / JDBC / custom) so ID resolution and metadata transformations can be made dynamic.

serverless-iiif exposes about 10 SAR parameters and any deeper tuning needs code changes. Source resolution is a single S3 bucket plus a printf-style ResolverTemplate; complex routing isn’t possible. The engine is fixed to sharp (libvips) — TurboJpeg, Kakadu and friends aren’t options.

Caching strategy

Cantaloupe has independent Source, Derivative, and Heap caches, so the application can hold persistent caches itself. serverless-iiif is stateless by Lambda’s nature and assumes you put CloudFront in front for caching. First-hit experience therefore depends heavily on CloudFront Hit/Miss.

Cold starts and long-running processing

Lambda has cold starts of a few hundred ms to ~1s, and execution has hard limits (15 minutes, default 6 MB payload — relaxed via Function URL). Full-resolution conversion of a very large image can be a tight fit for serverless-iiif. Cantaloupe has no execution time cap, and once the JVM is warm it stays consistent.

Operations and scale

serverless-iiif is fully managed, autoscales horizontally up to the concurrency limit, and needs no OS patching. Cost approaches zero when idle. Cantaloupe needs an instance running at all times, sized for peak, and you keep paying with JVM tuning, cache disk management, and OS/Java/Cantaloupe version chasing. Availability is also limited on a single instance.

AuthZ and observability

Cantaloupe is more flexible for authorization — Delegate Script can implement IIIF Authentication API and similar. There’s a /admin UI for monitoring. To do similar things on serverless-iiif you write Lambda code or roll your own CloudFront Function / Lambda@Edge, and observation is mostly via CloudWatch Logs.

Putting it together

Axis	Cantaloupe	serverless-iiif
Philosophy	Image-conversion workstation	CDN-style delivery of static images via IIIF
Configurability	Deep (100+ params + Delegate)	Shallow (~10 SAR params)
Sources	S3 / FS / HTTP / JDBC / custom	Single S3 bucket
Processors	TurboJpeg / OpenJpeg / Kakadu / etc.	sharp/libvips fixed
Persistent cache	3 layers in-app	Externalized (CloudFront)
Cold start	None	Yes
Execution time cap	None	15 min
Operations load	OS/Java/JVM/cache	Near zero
Scaling	Manual	Automatic
Floor cost	EC2 always-on	Near zero

If you want fine control over source resolution, authorization, caching strategy, and output quality, Cantaloupe wins. If your goal is to publish images sitting in S3 over IIIF, with bursty/skewed access, serverless-iiif fits cleanly. The case here was firmly in the latter camp, so I went with serverless-iiif.

Deploying to both us-east-1 and Tokyo, and comparing

The source bucket was in us-east-1, so my first deploy was also us-east-1. Lambda doing cross-region S3 reads adds ~150 ms RTT every call, so I subsequently created a copy bucket in ap-northeast-1 and deployed a second stack in Tokyo to compare.

Cold timings (CloudFront bypassed, per-request cache busted) measured from Tokyo:

Operation	Cantaloupe origin (us-east-1)	serverless us-east-1	serverless Tokyo
info.json	550–580 ms	1207–1314 ms	790–900 ms
thumbnail 260w	740–1260 ms	1825–1928 ms	1310–1412 ms
mid 1024w	1117–3100 ms	2735–3245 ms	1453–1533 ms
tile 512 (region)	734–1823 ms	2026–2117 ms	1297–2509 ms

Pure rendering speed favors Cantaloupe (Java + TurboJpeg + internal caches), but once you add the JP→Virginia RTT, serverless Tokyo lands in roughly the same ballpark. Through the CloudFront Tokyo edge, warm responses to the same URL are about 50 ms for both — practically indistinguishable.

Watch out for the existing CloudFront

In the first comparison, Cantaloupe numbers came out unrealistically fast. Inspecting the Via header showed x-cache: Hit from cloudfront and x-amz-cf-pop: NRT20-P5 — there was a CloudFront in front of Cantaloupe and the Tokyo edge was answering directly, so I was just measuring the CDN. Querystring cache-busting didn’t help (this CloudFront didn’t include the querystring in the cache key); only varying size or region offsets per request let me actually measure the cold path.

Hardening

The Lambda Function URL is publicly accessible by default with AuthType=NONE. I locked it down step by step.

OAC for Lambda Function URL

CloudFront’s Origin Access Control has a Lambda variant (OriginAccessControlOriginType: lambda). With it, CloudFront SigV4-signs every request to the Lambda Function URL, so the URL can be set to require AWS_IAM auth and only CloudFront can reach it.

aws cloudfront create-origin-access-control \
  --origin-access-control-config '{
    "Name":"serverless-iiif-apne1-oac",
    "SigningProtocol":"sigv4",
    "SigningBehavior":"always",
    "OriginAccessControlOriginType":"lambda"
  }'

After flipping the Function URL’s AuthType to AWS_IAM, hitting https://<id>.lambda-url.ap-northeast-1.on.aws/... directly with curl returns HTTP 403.

Reserved concurrency as a cost ceiling

To cap any runaway scenario, I set reserved concurrency to 50. Anything exceeding 50 concurrent invocations fails fast with 503, putting a physical lid on cost runaway.

aws lambda put-function-concurrency \
  --function-name <name> \
  --reserved-concurrent-executions 50

50 was chosen as comfortable headroom over current peak (0.5 req/s × ~1.5 s ≈ 0.75 concurrent) and as a reasonable ceiling under attack.

WAF

A shared WAF inside the organization already had a host-scoped rate-based rule, so attaching it to the new CloudFront automatically applied the right policy.

Priority 0  : AWSManagedRulesAmazonIpReputationList
Priority 1  : AWSManagedRulesKnownBadInputsRuleSet
Priority 2  : AWSManagedRulesCommonRuleSet
Priority 10 : RateLimit-cantaloupe (host=<production-domain>, 5000 req/5min/IP)

Security headers via CloudFront

Attaching the managed SecurityHeadersPolicy (67f7725c-6f97-4210-82d7-5512b31e9d03) adds HSTS, X-Content-Type-Options, X-Frame-Options, Referrer-Policy and friends to responses.

Cost estimate

Using the past 30 days (~260k requests, 8.4 GB transferred) and assuming a conservative 80% CloudFront hit rate:

Item	Calculation	Monthly
Lambda compute (52,600 invocations × 1s × 3008MB)	158k GB-s × $0.0000166667	$2.63
Lambda requests	52,600 × $0.20/M	$0.01
CF requests (HTTPS Japan)	263k × $0.0090/10k	$0.24
CF data transfer (Japan tier)	8.35 GB × $0.114	$0.95
S3 GET	52,600 × $0.0004/1000	$0.02
S3 storage	8 GB × $0.025	$0.20
Total		~$4/mo

Compared with $60.74/mo for the t3.large alone, that’s an order of magnitude lower. 10× the traffic still lands at $30–40/mo, and 100× under the standard guards stays around $200–300/mo.

Where `associate-alias` got stuck during cutover

The plan was to use aws cloudfront associate-alias to atomically move the production hostname from the old CloudFront to the new one — that’s exactly the API designed for “minimize downtime when migrating an alias”.

But every attempt returned:

An error occurred (IllegalUpdate) when calling the AssociateAlias operation:
Invalid or missing alias DNS TXT records.

The plural records made me try several naming patterns — _cf-source-id.<alias>, _cf-target-id.<alias>, _cf-validation.<alias> — adding TXT records via Route53. None of them satisfied the API. The AWS docs even contain wording suggesting same-account moves don’t need DNS validation, which doesn’t match the observed behavior, and I couldn’t pin down the correct TXT format.

In the end I gave up on associate-alias and switched to manually issuing update-distribution against both old and new distributions:

1. Empty Aliases on the old CF and revert its cert to CloudFrontDefaultCertificate
2. Add the host to the new CF's Aliases

Step 2 then ran into another trap, repeatedly rejected with CNAMEAlreadyExists:

One or more aliases specified for the distribution includes an
incorrectly configured DNS record that points to another CloudFront distribution.
You must update the DNS record to correct the problem.

While DNS still resolves to the old CloudFront’s domain, CloudFront refuses to add the alias to a different distribution because the DNS “points elsewhere”. Switching DNS to the new CF first, then adding the alias, succeeded.

So the actual order ended up being:

1. Remove the alias from the old CF (CF takes a few minutes for edge propagation)
2. Route53: change the production hostname's CNAME to the new CF
3. Add the alias to the new CF (succeeds because DNS now points at the target)
4. Wait for the new CF to deploy (~1–3 minutes)

The actual user-visible outage was about 2–3 minutes. Lowering the DNS TTL from 300s to 60s in advance kept client-side caching short and made the window finish roughly when expected.

ForceHost and the info.json `@id`

serverless-iiif builds the @id URLs in info.json from the request’s Host header. By default that means the CloudFront domain dxxxx.cloudfront.net ends up in @id. The SAR ForceHost parameter (or the forceHost Lambda environment variable) overrides this with the canonical hostname.

After flipping it, the old @id is still in the CloudFront cache, so I issued an invalidation for /iiif/*.

aws cloudfront create-invalidation --distribution-id <dist-id> --paths "/iiif/*"

Cleanup after migration

After cutover, I stopped the Cantaloupe container and deleted the related orphan resources in order:

The old CloudFront distribution (no aliases, dead origin)
The old ACM certificate (no longer referenced by the old CF)
The us-east-1 SAR stack and S3 bucket I’d created for benchmarking
The old Cantaloupe source S3 bucket — including in the other AWS account: 8 GB / 1159 objects deleted
The IAM user, access key, and custom policy that the old Cantaloupe used
TXT records (3 variants) and the <origin-direct> A record left over from cutover validation
The interim WAF rate-limit rule (replaced by the permanent rule, no longer needed)

Deleting the old CloudFront required first redeploying with Enabled=false, which cost about 5–10 minutes of waiting. Disable, then delete-distribution:

# Update with Enabled=false → wait → delete
aws cloudfront wait distribution-deployed --id <old-id>
aws cloudfront delete-distribution --id <old-id> --if-match <etag>

The cleanup on the other AWS account (the source bucket and IAM user) naturally needs admin in that account; my hand had only a read-only key for the IAM user, so a separate profile was activated for those steps.

“The shared WAF is unattached, so you can delete it”

A different agent doing a separate inventory reported that “the shared WAF has zero attachments, so deleting it would save cost” — but cross-checking found the same Web ACL was attached to 20 active CloudFront distributions. Deleting it would have stripped IP Reputation, OWASP Common, and rate limits off 20 production services at once.

The agent likely ran the API below and saw an empty array, concluding “unused”:

# For CLOUDFRONT-scoped Web ACLs this returns ValidationException
aws wafv2 list-resources-for-web-acl --resource-type CLOUDFRONT ...

list-resources-for-web-acl doesn’t accept CLOUDFRONT as a resource type — the call errors out and the resource list looks empty. The reliable way is to count WebACLId from CloudFront’s side instead.

aws cloudfront list-distributions \
  --query 'DistributionList.Items[?WebACLId==`<arn>`].[Id,Aliases.Items[0]]'

The lesson: even when a report claims “no risk”, verify attachments yourself before issuing destructive commands.

Takeaways

An overlooked CloudFront skews benchmarks. Always check Via and x-cache to confirm the path.
Lambda Function URL + OAC for lambda exposes a Lambda to the public web securely without API Gateway.
Reserved concurrency is an effective physical guard against cost blowouts.
associate-alias doesn’t always go through. Keep a manual update-distribution fallback ready.
CloudFront verifies DNS consistency when adding an alias, so the migration order may need to be “switch DNS → add alias”.
If DNS still points to an old CF when only DNS has been moved, you’ll hit CloudFront errors for a few minutes. Lowering the DNS TTL in advance shortens the window.
“Unused, safe to delete” advice should be cross-checked against the actual attachments using a separate query before acting on it.

Original setup#

Why migrate#

Security overhead#

Maintaining the Docker host#

And as a bonus, cost#

Choosing serverless-iiif#

Lambda spec#

Cantaloupe vs serverless-iiif: characteristics#

Configurability and customization#

Caching strategy#

Cold starts and long-running processing#

Operations and scale#

AuthZ and observability#

Putting it together#

Deploying to both us-east-1 and Tokyo, and comparing#

Watch out for the existing CloudFront#

Hardening#

OAC for Lambda Function URL#

Reserved concurrency as a cost ceiling#

WAF#

Security headers via CloudFront#

Cost estimate#

Where associate-alias got stuck during cutover#

ForceHost and the info.json @id#

Cleanup after migration#

“The shared WAF is unattached, so you can delete it”#

Takeaways#