This article is co-written with generative AI. Facts have been verified against official documentation where possible, but errors may remain. Please check primary sources before making important decisions. Organization names, domain names, bucket names, and various identifiers are anonymized to focus on the essence of the architecture and procedure.
Original setup
The IIIF image delivery I was operating ran Cantaloupe inside Docker on a single AWS EC2 (t3.large) instance. Drupal, its MariaDB, and Traefik were co-tenants on the same EC2, so multiple services shared one host.
Source images lived in an S3 bucket in a different AWS account, accessed via Cantaloupe’s S3Source. CloudFront sat in front for TLS termination and JP edge caching.
Client → CloudFront (NRT edge) → EC2 (us-east-1, Traefik → Cantaloupe)
↓ S3Source (us-east-1)
<legacy-source-bucket>
Looking at CloudWatch metrics, traffic over the past 30 days was about 260,000 requests / 8.4 GB transferred — peaks didn’t even reach 0.5 req/s.
Why migrate
Mainly two things: security and the maintenance burden of the Docker host (EC2).
Security overhead
As long as the EC2 is exposed to the internet, there’s a steady stream of things to keep up with: OS security patches, managing SSH access paths, tracking vulnerabilities in the Docker daemon and each container image, following JVM/Java CVEs, keeping Cantaloupe and the co-tenant Drupal/MariaDB up to date. A misconfiguration or missed update is directly exposed externally.
Going serverless moves the OS and Java runtime into AWS’s managed scope, and removes the SSH path entirely. The attack surface narrows to the Lambda function code, IAM policies, and CloudFront/WAF settings — fewer places to look.
Maintaining the Docker host
Cantaloupe, Drupal, MariaDB, and Traefik all shared the same t3.large, so a problem in one could spill into the others.
- A kernel update requiring a reboot affects all four services
- Disk pressure (cache directory, container logs, Docker overlay) needed self-monitoring
- Memory leaks in Java or Drupal could trigger OOM in unrelated services
- Backup, monitoring, and log shipping scripts were all written per-host, so adding a service was a heavy operation
A “no host” architecture makes most of this go away.
And as a bonus, cost
Running a t3.large 24/7 for traffic that never reached 0.5 req/s was overspec, so going serverless was expected to also lower cost as a side effect. The breakdown is in the cost section below.
Choosing serverless-iiif
Samvera’s serverless-iiif packages a Node.js + sharp (libvips) IIIF Image API server as a Lambda function. It’s distributed via the AWS Serverless Application Repository (SAR) and can also be used from SAM/CDK.
A single command from SAR is enough to deploy:
aws serverlessrepo create-cloud-formation-change-set \
--application-id arn:aws:serverlessrepo:us-east-1:625046682746:applications/serverless-iiif \
--semantic-version 7.0.1 \
--stack-name serverless-iiif-apne1 \
--capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
--parameter-overrides \
'[{"Name":"SourceBucket","Value":"<images-tokyo-bucket>"},{"Name":"ResolverTemplate","Value":"%s"}]'
Setting ResolverTemplate=%s makes the IIIF ID map directly to the S3 key, which keeps the URL shape compatible with what Cantaloupe was serving (e.g. /iiif/2/clioimg%2Fsho.tif/info.json).
Lambda spec
The default Lambda configuration in the SAR template:
| Field | Value |
|---|---|
| Memory | 3008 MB |
| Timeout | 10s |
| Runtime | Node.js 22 |
| Architecture | x86_64 |
Lambda CPU scales linearly with allocated memory; 3008 MB gives roughly 2 vCPU. JPEG encoding and TIFF decoding via sharp/libvips are non-trivial, so dropping memory translates directly into perceptible slowdown. Conversely, the SAR template doesn’t expose Architecture as a parameter, so going arm64 (Graviton) requires forking the template.
Cantaloupe vs serverless-iiif: characteristics
Their design philosophies are quite different, so I’m leaving here the comparison I went through during evaluation. Beyond “it serves IIIF” the design assumptions and operational shape barely overlap.
Configurability and customization
Cantaloupe exposes 100+ parameters in cantaloupe.properties. You can tune processor selection, quality, memory, authorization hooks, etc., and you can inject per-request logic via a Ruby Delegate Script. It supports multiple sources (S3 / Filesystem / HTTP / JDBC / custom) so ID resolution and metadata transformations can be made dynamic.
serverless-iiif exposes about 10 SAR parameters and any deeper tuning needs code changes. Source resolution is a single S3 bucket plus a printf-style ResolverTemplate; complex routing isn’t possible. The engine is fixed to sharp (libvips) — TurboJpeg, Kakadu and friends aren’t options.
Caching strategy
Cantaloupe has independent Source, Derivative, and Heap caches, so the application can hold persistent caches itself. serverless-iiif is stateless by Lambda’s nature and assumes you put CloudFront in front for caching. First-hit experience therefore depends heavily on CloudFront Hit/Miss.
Cold starts and long-running processing
Lambda has cold starts of a few hundred ms to ~1s, and execution has hard limits (15 minutes, default 6 MB payload — relaxed via Function URL). Full-resolution conversion of a very large image can be a tight fit for serverless-iiif. Cantaloupe has no execution time cap, and once the JVM is warm it stays consistent.
Operations and scale
serverless-iiif is fully managed, autoscales horizontally up to the concurrency limit, and needs no OS patching. Cost approaches zero when idle. Cantaloupe needs an instance running at all times, sized for peak, and you keep paying with JVM tuning, cache disk management, and OS/Java/Cantaloupe version chasing. Availability is also limited on a single instance.
AuthZ and observability
Cantaloupe is more flexible for authorization — Delegate Script can implement IIIF Authentication API and similar. There’s a /admin UI for monitoring. To do similar things on serverless-iiif you write Lambda code or roll your own CloudFront Function / Lambda@Edge, and observation is mostly via CloudWatch Logs.
Putting it together
| Axis | Cantaloupe | serverless-iiif |
|---|---|---|
| Philosophy | Image-conversion workstation | CDN-style delivery of static images via IIIF |
| Configurability | Deep (100+ params + Delegate) | Shallow (~10 SAR params) |
| Sources | S3 / FS / HTTP / JDBC / custom | Single S3 bucket |
| Processors | TurboJpeg / OpenJpeg / Kakadu / etc. | sharp/libvips fixed |
| Persistent cache | 3 layers in-app | Externalized (CloudFront) |
| Cold start | None | Yes |
| Execution time cap | None | 15 min |
| Operations load | OS/Java/JVM/cache | Near zero |
| Scaling | Manual | Automatic |
| Floor cost | EC2 always-on | Near zero |
If you want fine control over source resolution, authorization, caching strategy, and output quality, Cantaloupe wins. If your goal is to publish images sitting in S3 over IIIF, with bursty/skewed access, serverless-iiif fits cleanly. The case here was firmly in the latter camp, so I went with serverless-iiif.
Deploying to both us-east-1 and Tokyo, and comparing
The source bucket was in us-east-1, so my first deploy was also us-east-1. Lambda doing cross-region S3 reads adds ~150 ms RTT every call, so I subsequently created a copy bucket in ap-northeast-1 and deployed a second stack in Tokyo to compare.
Cold timings (CloudFront bypassed, per-request cache busted) measured from Tokyo:
| Operation | Cantaloupe origin (us-east-1) | serverless us-east-1 | serverless Tokyo |
|---|---|---|---|
| info.json | 550–580 ms | 1207–1314 ms | 790–900 ms |
| thumbnail 260w | 740–1260 ms | 1825–1928 ms | 1310–1412 ms |
| mid 1024w | 1117–3100 ms | 2735–3245 ms | 1453–1533 ms |
| tile 512 (region) | 734–1823 ms | 2026–2117 ms | 1297–2509 ms |
Pure rendering speed favors Cantaloupe (Java + TurboJpeg + internal caches), but once you add the JP→Virginia RTT, serverless Tokyo lands in roughly the same ballpark. Through the CloudFront Tokyo edge, warm responses to the same URL are about 50 ms for both — practically indistinguishable.
Watch out for the existing CloudFront
In the first comparison, Cantaloupe numbers came out unrealistically fast. Inspecting the Via header showed x-cache: Hit from cloudfront and x-amz-cf-pop: NRT20-P5 — there was a CloudFront in front of Cantaloupe and the Tokyo edge was answering directly, so I was just measuring the CDN. Querystring cache-busting didn’t help (this CloudFront didn’t include the querystring in the cache key); only varying size or region offsets per request let me actually measure the cold path.
Hardening
The Lambda Function URL is publicly accessible by default with AuthType=NONE. I locked it down step by step.
OAC for Lambda Function URL
CloudFront’s Origin Access Control has a Lambda variant (OriginAccessControlOriginType: lambda). With it, CloudFront SigV4-signs every request to the Lambda Function URL, so the URL can be set to require AWS_IAM auth and only CloudFront can reach it.
aws cloudfront create-origin-access-control \
--origin-access-control-config '{
"Name":"serverless-iiif-apne1-oac",
"SigningProtocol":"sigv4",
"SigningBehavior":"always",
"OriginAccessControlOriginType":"lambda"
}'
After flipping the Function URL’s AuthType to AWS_IAM, hitting https://<id>.lambda-url.ap-northeast-1.on.aws/... directly with curl returns HTTP 403.
Reserved concurrency as a cost ceiling
To cap any runaway scenario, I set reserved concurrency to 50. Anything exceeding 50 concurrent invocations fails fast with 503, putting a physical lid on cost runaway.
aws lambda put-function-concurrency \
--function-name <name> \
--reserved-concurrent-executions 50
50 was chosen as comfortable headroom over current peak (0.5 req/s × ~1.5 s ≈ 0.75 concurrent) and as a reasonable ceiling under attack.
WAF
A shared WAF inside the organization already had a host-scoped rate-based rule, so attaching it to the new CloudFront automatically applied the right policy.
Priority 0 : AWSManagedRulesAmazonIpReputationList
Priority 1 : AWSManagedRulesKnownBadInputsRuleSet
Priority 2 : AWSManagedRulesCommonRuleSet
Priority 10 : RateLimit-cantaloupe (host=<production-domain>, 5000 req/5min/IP)
Security headers via CloudFront
Attaching the managed SecurityHeadersPolicy (67f7725c-6f97-4210-82d7-5512b31e9d03) adds HSTS, X-Content-Type-Options, X-Frame-Options, Referrer-Policy and friends to responses.
Cost estimate
Using the past 30 days (~260k requests, 8.4 GB transferred) and assuming a conservative 80% CloudFront hit rate:
| Item | Calculation | Monthly |
|---|---|---|
| Lambda compute (52,600 invocations × 1s × 3008MB) | 158k GB-s × $0.0000166667 | $2.63 |
| Lambda requests | 52,600 × $0.20/M | $0.01 |
| CF requests (HTTPS Japan) | 263k × $0.0090/10k | $0.24 |
| CF data transfer (Japan tier) | 8.35 GB × $0.114 | $0.95 |
| S3 GET | 52,600 × $0.0004/1000 | $0.02 |
| S3 storage | 8 GB × $0.025 | $0.20 |
| Total | ~$4/mo |
Compared with $60.74/mo for the t3.large alone, that’s an order of magnitude lower. 10× the traffic still lands at $30–40/mo, and 100× under the standard guards stays around $200–300/mo.
Where associate-alias got stuck during cutover
The plan was to use aws cloudfront associate-alias to atomically move the production hostname from the old CloudFront to the new one — that’s exactly the API designed for “minimize downtime when migrating an alias”.
But every attempt returned:
An error occurred (IllegalUpdate) when calling the AssociateAlias operation:
Invalid or missing alias DNS TXT records.
The plural records made me try several naming patterns — _cf-source-id.<alias>, _cf-target-id.<alias>, _cf-validation.<alias> — adding TXT records via Route53. None of them satisfied the API. The AWS docs even contain wording suggesting same-account moves don’t need DNS validation, which doesn’t match the observed behavior, and I couldn’t pin down the correct TXT format.
In the end I gave up on associate-alias and switched to manually issuing update-distribution against both old and new distributions:
1. Empty Aliases on the old CF and revert its cert to CloudFrontDefaultCertificate
2. Add the host to the new CF's Aliases
Step 2 then ran into another trap, repeatedly rejected with CNAMEAlreadyExists:
One or more aliases specified for the distribution includes an
incorrectly configured DNS record that points to another CloudFront distribution.
You must update the DNS record to correct the problem.
While DNS still resolves to the old CloudFront’s domain, CloudFront refuses to add the alias to a different distribution because the DNS “points elsewhere”. Switching DNS to the new CF first, then adding the alias, succeeded.
So the actual order ended up being:
1. Remove the alias from the old CF (CF takes a few minutes for edge propagation)
2. Route53: change the production hostname's CNAME to the new CF
3. Add the alias to the new CF (succeeds because DNS now points at the target)
4. Wait for the new CF to deploy (~1–3 minutes)
The actual user-visible outage was about 2–3 minutes. Lowering the DNS TTL from 300s to 60s in advance kept client-side caching short and made the window finish roughly when expected.
ForceHost and the info.json @id
serverless-iiif builds the @id URLs in info.json from the request’s Host header. By default that means the CloudFront domain dxxxx.cloudfront.net ends up in @id. The SAR ForceHost parameter (or the forceHost Lambda environment variable) overrides this with the canonical hostname.
After flipping it, the old @id is still in the CloudFront cache, so I issued an invalidation for /iiif/*.
aws cloudfront create-invalidation --distribution-id <dist-id> --paths "/iiif/*"
Cleanup after migration
After cutover, I stopped the Cantaloupe container and deleted the related orphan resources in order:
- The old CloudFront distribution (no aliases, dead origin)
- The old ACM certificate (no longer referenced by the old CF)
- The us-east-1 SAR stack and S3 bucket I’d created for benchmarking
- The old Cantaloupe source S3 bucket — including in the other AWS account: 8 GB / 1159 objects deleted
- The IAM user, access key, and custom policy that the old Cantaloupe used
- TXT records (3 variants) and the
<origin-direct>A record left over from cutover validation - The interim WAF rate-limit rule (replaced by the permanent rule, no longer needed)
Deleting the old CloudFront required first redeploying with Enabled=false, which cost about 5–10 minutes of waiting. Disable, then delete-distribution:
# Update with Enabled=false → wait → delete
aws cloudfront wait distribution-deployed --id <old-id>
aws cloudfront delete-distribution --id <old-id> --if-match <etag>
The cleanup on the other AWS account (the source bucket and IAM user) naturally needs admin in that account; my hand had only a read-only key for the IAM user, so a separate profile was activated for those steps.
“The shared WAF is unattached, so you can delete it”
A different agent doing a separate inventory reported that “the shared WAF has zero attachments, so deleting it would save cost” — but cross-checking found the same Web ACL was attached to 20 active CloudFront distributions. Deleting it would have stripped IP Reputation, OWASP Common, and rate limits off 20 production services at once.
The agent likely ran the API below and saw an empty array, concluding “unused”:
# For CLOUDFRONT-scoped Web ACLs this returns ValidationException
aws wafv2 list-resources-for-web-acl --resource-type CLOUDFRONT ...
list-resources-for-web-acl doesn’t accept CLOUDFRONT as a resource type — the call errors out and the resource list looks empty. The reliable way is to count WebACLId from CloudFront’s side instead.
aws cloudfront list-distributions \
--query 'DistributionList.Items[?WebACLId==`<arn>`].[Id,Aliases.Items[0]]'
The lesson: even when a report claims “no risk”, verify attachments yourself before issuing destructive commands.
Takeaways
- An overlooked CloudFront skews benchmarks. Always check
Viaandx-cacheto confirm the path. - Lambda Function URL + OAC for lambda exposes a Lambda to the public web securely without API Gateway.
- Reserved concurrency is an effective physical guard against cost blowouts.
associate-aliasdoesn’t always go through. Keep a manualupdate-distributionfallback ready.- CloudFront verifies DNS consistency when adding an alias, so the migration order may need to be “switch DNS → add alias”.
- If DNS still points to an old CF when only DNS has been moved, you’ll hit CloudFront errors for a few minutes. Lowering the DNS TTL in advance shortens the window.
- “Unused, safe to delete” advice should be cross-checked against the actual attachments using a separate query before acting on it.