This post documents the steps taken to improve tile delivery performance on a Cantaloupe IIIF image server running on AWS, starting from an initial cold tile fetch time of 8.8 seconds.

Environment

  • AWS EC2 us-east-1, 2 vCPU (t3.large), 7.6 GB RAM
  • Cantaloupe: islandora/cantaloupe:6.3.12 (Cantaloupe 5.0.7, final release)
  • Source storage: S3Source
  • Reverse proxy: Traefik with Let’s Encrypt TLS
  • Test image: clioimg/shyuga.tif (46825×28127 px)

Initial State

Cold tile requests were taking 8.8 seconds. Examining the setup revealed several contributing factors:

ItemState
JVM heap limit512 MB (default)
FilesystemCache386 MB / 13,950 files (active)
Cache volumeNot persisted — started with docker-compose.yml instead of docker-compose.prod.yml
Source image formatStrip TIFF (created with ImageMagick convert) — requires reading ~9 chunks (~2 MB each) from S3 per tile

Fix 1: Convert to Pyramid Tiled TIFF

The largest single improvement came from converting the source TIFF from strip layout to pyramid tiled format. A strip TIFF has no internal resolution levels; Cantaloupe must read multiple sequential chunks from S3 to reconstruct any given tile region. A pyramid tiled TIFF stores pre-computed resolution levels with 256×256 tiles, so each tile request reads only the relevant portion of the file.

vips tiffsave input.tif output.tif --tile --pyramid --tile-width 256 --tile-height 256 --compression jpeg --Q 85

Output size: 219 MB (Q85, pyramid tiled TIFF).

MetricStrip TIF (before)ptif Q85 (after)
First tile (cold)8.8s0.7–1.3s
Second tile (cold, different region)1.7s0.7s
Cached tile0.04s0.01s
S3 chunk reads~9few
Image processing time1073ms510ms

Fix 2: Increase JVM Heap

The container memory limit is 2 GB, but Cantaloupe was running with the JVM default heap of 512 MB. The islandora/cantaloupe:6.3.12 startup script supports CANTALOUPE_HEAP_MIN and CANTALOUPE_HEAP_MAX environment variables directly:

CANTALOUPE_HEAP_MIN: "512m"
CANTALOUPE_HEAP_MAX: "1536m"

Note that JAVA_OPTS is not referenced by the startup script and has no effect. In the older 2.0.10 image, these environment variables were not available, requiring a custom startup script to be mounted.

Fix 3: Persist the Cache Volume

docker-compose.prod.yml had cantaloupe_cache:/data configured correctly, but the service had been started with the plain docker-compose.yml, which omitted the named volume. As a result, the on-disk tile cache was lost on every container restart. Switching to the production compose file ensured the FilesystemCache persisted across restarts.

Fix 4: Upgrade the Cantaloupe Image

The server was initially running islandora/cantaloupe:2.0.10 (Cantaloupe 5.0.5). Upgrading to islandora/cantaloupe:6.3.12 (Cantaloupe 5.0.7) produced a measurable improvement in image processing time:

Metric2.0.10 (5.0.5)6.3.12 (5.0.7)
First tile (cold)1.3s0.84s
Image processing time510ms302ms

Fix 5: CloudFront CDN

The server is in us-east-1; most users access it from Japan (~146ms RTT). Adding CloudFront with a Tokyo edge location reduces perceived latency significantly for cache hits.

Before: User (Japan) ──146ms RTT──→ EC2 (us-east-1) → Cantaloupe → S3
After:  User (Japan) ──few ms──→ CloudFront Tokyo edge ──→ EC2 → Cantaloupe → S3
                                  ↑ cache hit: served here

CloudFront configuration:

  • Origin: origin domain on port 443, HTTPS only (via Traefik)
  • Cache TTL: min 1 day, default 30 days, max 1 year
  • Price Class: PriceClass_200
  • HTTP/2 + HTTP/3 enabled
  • ACM certificate: issued in us-east-1 (CloudFront requires certificates in that region)

info.json @id issue

Cantaloupe generates the @id field in info.json from the incoming Host header. When CloudFront forwards a request to the origin, the Host header is the origin domain rather than the CloudFront domain. This causes Mirador to read the @id value and then bypass CloudFront, sending subsequent tile requests directly to the origin.

The fix is to set CANTALOUPE_BASE_URI to the CloudFront domain:

CANTALOUPE_BASE_URI: "https://<public-domain>"

With this set, Cantaloupe always embeds the CloudFront domain in @id, and Mirador routes all requests through the CDN.

Parallel tile fetch results (Mirador simulation, 20 tiles)

ScenarioTotal time
CloudFront miss (cold)6.2s
CloudFront hit (cached)1.3s (from US); ~0.1–0.2s from Japan
Direct localhost (cold)4.4s

What Didn’t Help

CacheStrategy: Switching processor.stream_retrieval_strategy from StreamStrategy to CacheStrategy (downloads the full source file locally before processing) made cold requests slower — 4.8s vs. 0.84s. When the source is already a tiled TIFF on S3, downloading the entire file is unnecessary overhead. Reverted.

TurboJpegProcessor: The config was set to use TurboJpegProcessor, but it falls back to Java2dProcessor in practice. In the 2.0.10 image, the bundled Java bindings are incompatible with the OS-level libjpeg-turbo 2.1.5 API:

Failed to initialize TurboJpegProcessor
(error: 'org.libjpegturbo.turbojpeg.TJScalingFactor[] org.libjpegturbo.turbojpeg.TJ.getScalingFactors()')

In the 6.3.12 image, the library compatibility issue is resolved, but TurboJpegProcessor does not support TIF sources, so Java2dProcessor is used regardless.

Potential Further Improvements

  1. Cache pre-warm: pre-requesting all tiles for frequently accessed images would ensure most user requests are served from CloudFront edge. Once cached there, cold-start latency becomes a non-issue for those images.

  2. Larger instance: at 2 vCPU (t3.large), CPU reaches ~122% during Mirador parallel tile requests. Upgrading to t3.xlarge (4 vCPU / 16 GB) would roughly double cold parallel throughput.

  3. Tokyo region migration: moving EC2 and S3 to ap-northeast-1 would reduce origin latency on CloudFront misses. With a high CDN cache hit rate, the practical impact is limited.

Final Architecture

User → CloudFront (public domain)
          ↓ on miss
       Traefik (origin domain:443, Let's Encrypt TLS)
       Cantaloupe (islandora/cantaloupe:6.3.12, JVM heap 1536 MB)
       S3 (us-east-1)

Summary

MetricBeforeAfter
First tile (cold)8.8s0.84s
Mirador 20-tile parallel (cold)est. 15s+6.2s (CloudFront miss)
Cached tile (from Japan)~292ms RTT + processing~0.1–0.2s (edge)

The pyramid TIFF conversion had the largest effect on raw Cantaloupe processing time. CloudFront had the largest effect on perceived latency for users outside the origin region.