IIIF VR Viewer: Experiencing Cultural Resources at Real Scale with WebXR and A-Frame

This project started from the idea that it might be interesting to experience cultural resource images published via IIIF (International Image Interoperability Framework) at actual physical scale. The result is a viewer that places IIIF images inside a VR recreation of an Edo-period townhouse (machiya), viewable in both a browser and a VR headset.

The tech stack is A-Frame 1.5.0 + THREE.js 0.158.0 + WebXR. The 3D model is based on the Japanese Machiya Set Kit published on Sketchfab, split by component and reassembled as needed.

This article covers not only what worked, but also the failures encountered along the way, with their causes and fixes.

Project Overview

Item	Details
Renderer	A-Frame 1.5.0 / THREE.js 0.158.0
XR	WebXR (VR headset support)
Image standard	IIIF Presentation API v3 / Image API v2
3D model	Sketchfab: Japanese Machiya Set Kit (GLB)
Avatar	VRM + Mixamo retargeting
Supporting libraries	three-vrm v2, aframe-extras
Tools	gltf-transform, Blender CLI

Room Design

The room dimensions use TAT_SZ = 1.76 m (the short side of one tatami mat in the Edo-period standard) as the base unit. Tatami mats, walls, shoji screens, ceilings, and lanterns are arranged as tiled components, allowing the room size to scale flexibly based on the IIIF image dimensions.

When a IIIF Collection is specified, all images in the collection are displayed side by side and the room size is automatically expanded.

Splitting the GLB Model

The distributed GLB had the entire machiya bundled into a single file. To place walls, tatami, and window walls independently, each component was split out using gltf-transform.

# Extract a specific mesh
npx gltf-transform filter input.glb SM_tatami.glb \
  --node "SM_tatami"

After splitting, the bounding box of each part needs to be measured to align pivot positions.

const box = new THREE.Box3().setFromObject(mesh);
const center = box.getCenter(new THREE.Vector3());
console.log('offset:', center);

This measurement step made it possible to resolve misalignment issues with numbers rather than guesswork.

Avatar Integration Challenges

Mixamo → Blender → GLB Conversion

Mixamo only allows FBX downloads, so Blender was invoked via CLI to convert to GLB.

blender --background --python convert_fbx_to_glb.py -- \
  --input avatar_walk.fbx \
  --output avatar.glb

Failure: scale=100 Produces a 181m Giant

Setting scale="100 100 100" on an A-Frame entity caused a massive red character to fill the screen.

The cause is the bind matrix of the skinned mesh. Skinned animations use the InverseBindMatrix, and when scale=100 is applied to it:

SkinMatrix = GlobalTransform × InverseBindMatrix
           = Scale(100) × ... ≈ 181m

This causes the avatar to become enormous.

Fix: Revert to scale="1 1 1" (the default) and set the model’s units to meters in Blender.

Hips Bone Offset Correction

Mixamo characters have the Hips bone at Z = -1.04 m (the character’s center of mass is offset forward from the model origin). Without correction, the origin ends up “floating in mid-air,” so the entity is offset by +1.04 m in the Z direction:

<a-entity id="avatar" position="0 0 1.04" ...></a-entity>

IIIF Tile Dynamic Loading System

How Real-Scale Display Works

Images are placed at real-world scale in the VR space based on the physical dimension service (physDim) information contained in the IIIF Image API info.json. If undefined, the width falls back to 5 m.

LOD (Level of Detail)

Higher-resolution tiles are loaded progressively starting from the area closest to the camera. When crouching in a VR headset and leaning toward a floor map, that area is prioritized for high-resolution loading.

[Low-resolution base (Y=0.02)]  ← Always displayed
        ↑ Overlaid on top
[Tile grid (Y=0.025)]  ← Added to DOM after download completes

Base plane: Low-resolution overview image placed immediately. Users see a blurry full image right away.
Tile grid calculation: Grid computed from tiles definition in info.json. scaleFactor is auto-selected to stay within MAX_GRID_TILES = 150.
Distance-based download: Camera position checked every 500ms. Unloaded tiles sorted by distance, up to 6 downloaded in parallel.

Tile URL Format

{baseId}/{x},{y},{w},{h}/{outW},/0/default.jpg

The size is specified as width only ({outW},). Some servers return 404 when height is included (level0 static serving).

scaleFactor Selection Logic

ScaleFactors are tried in ascending order to stay within MAX_GRID_TILES = 150.

Example: 49797×28435px image, tileWidth=1024

sf=1 → 49×28 = 1,372 tiles (too many)
sf=2 → 25×14 = 350 tiles (too many)
sf=4 → 13×7 = 91 tiles ← selected

Configuration Constants

Constant	Value	Description
`LOD_CHECK_MS`	500	Camera distance check interval (ms)
`MAX_CONCURRENT`	6	Concurrent download limit
`MAX_GRID_TILES`	150	Maximum total tile count
`IIIF_MAX_PX`	2048	Maximum base image width (px)
`IMAGE_GAP_M`	0.5	Margin between images in a collection (m)
`ROOM_PADDING_M`	2.0	Padding from wall to image (m)

Issue: Base Image Disappears During Tile Download

Pre-adding tile <a-plane> elements to the DOM with visible: false caused them to interfere with base plane rendering even while invisible, making the base image disappear during download.

Fix: Tile elements are created but only appended to the DOM after the image has downloaded.

// Only add to DOM after download completes
img.onload = () => {
  t.el.setAttribute('material', `src: ${t.url}; side: double`);
  t.container.appendChild(t.el);
  t.state = 'loaded';
};

Third-Person Camera Implementation

This was the most troublesome part. It took three failed attempts before achieving a “camera that follows the avatar from behind” in A-Frame.

Failure 1: Camera and Avatar Under the Same Parent

<a-entity id="rig">
  <a-camera .../>
  <a-entity id="avatar" .../>
</a-entity>

This looks simple, but the camera and avatar have a fixed relative position, so the avatar appears to never actually move on screen even as the camera moves.

Failure 2: Syncing Position Every Frame

A separate #player and #camera-rig entity were used, with position copied every frame in tick(). However, this interfered with A-Frame’s internal state (quaternion management in look-controls, etc.) and caused unstable behavior.

Failure 3: look-controls Interfering with Custom Movement

The a-camera has the wasd-controls component enabled by default. Running alongside a custom player-move component caused the avatar and camera to drift apart gradually.

Working Design

Scene (world)
 ├── #avatar     ← Moves directly in world coordinates via WASD (independent)
 └── #cam-rig    ← Follows #avatar position every frame (independent)
       └── #cam  ← Camera (look-controls and wasd-controls both disabled)

The key is keeping the avatar and camera rig completely independent, with the rig following the avatar each frame. The a-camera default components must be explicitly disabled:

<a-camera wasd-controls="enabled: false" look-controls="enabled: false">

The follow logic is straightforward:

// Follow in tick() (simplified)
const avatarPos = avatar.object3D.position;
camRig.object3D.position.set(avatarPos.x, avatarPos.y, avatarPos.z);

The third-person camera offset is (0, 1.6, 2.5) (2.5m behind, 1.6m above).

Aligning Avatar Orientation with Movement Direction

When rotation.y = θ, the local -Z axis points in the direction (-sin θ, 0, -cos θ) in world coordinates. The WASD movement vector is calculated using the same formula, so the avatar always faces its direction of travel.

Minecraft-Style Controls

On PC, Minecraft-style strafe movement (WASD) is implemented. On smartphones, a virtual joystick on the left side handles movement and dragging the right half rotates the camera. Three posture states (standing → crouching → prone) allow leaning close to floor-placed images for high-resolution viewing.

VRM Avatar Support

Retargeting Mixamo Animations

VRM models do not include animations, so Mixamo Walk animations are retargeted to the VRM bone structure and applied.

avatar.glb (Mixamo)  ──── provides animation data (hidden)
       │
       │  retarget (bone name mapping + rest pose correction)
       ▼
avatar1.glb (VRM)  ──── displayed on screen

Because Mixamo and VRM have different bone names and rest poses, a simple name substitution does not work. Mathematically, the conversion is:

retargeted = W_parent × animation × inv(W_bone)

Root Motion Removal

The Mixamo Walk animation includes root motion, where the Hips bone position varies each frame. Measurement showed approximately 1.68 m of vertical variation on the Y axis, causing the avatar to “bounce up and down.”

Game engines like Unity/Unreal have built-in root motion control, but A-Frame does not. The fix was to directly rewrite the animation tracks inside the GLB file using the gltf-transform API.

import { Document, NodeIO } from '@gltf-transform/core';

const io = new NodeIO();
const document = await io.read('avatar.glb');

for (const anim of document.getRoot().listAnimations()) {
  for (const sampler of anim.listSamplers()) {
    const output = sampler.getOutput();
    const arr = output.getArray().slice();

    // Calculate median Y (natural standing height)
    const yValues = [];
    for (let i = 1; i < arr.length; i += 3) yValues.push(arr[i]);
    yValues.sort((a, b) => a - b);
    const medianY = yValues[Math.floor(yValues.length / 2)];

    const firstX = arr[0], firstZ = arr[2];

    // Fix all Hips positions
    for (let i = 0; i < arr.length; i += 3) {
      arr[i]     = firstX;    // X: fixed to initial value
      arr[i + 1] = medianY;   // Y: fixed to median
      arr[i + 2] = firstZ;    // Z: fixed to initial value
    }
    output.setArray(new Float32Array(arr));
  }
}
await io.write('avatar_fixed.glb', document);

Fixing Y to the median rather than 0 is important — fixing to 0 causes the avatar to clip through the floor or float above it.

VRM0 Orientation Correction

VRM0 format has Z+ as the forward direction, but A-Frame’s camera faces Z-. Rotating the scene 180° around Y also inverts the X and Z components of skinning.

Fix: Apply a 180° Y conjugate quaternion to retargeting results.

if (isVrm0) {
  // 180° Y conjugate: compensates for deformation inversion from scene.rotation.y = PI
  values[i]   = -q.x;  // X flipped
  values[i+1] =  q.y;  // Y unchanged
  values[i+2] = -q.z;  // Z flipped
  values[i+3] =  q.w;  // W unchanged
}

VRMLoaderPlugin Pitfall

Initially VRMLoaderPlugin from @pixiv/three-vrm was used, but a problem was encountered where even when gltf.userData.vrm returned null (plugin processing failure) on VRM0 files, the plugin had already partially rewritten the scene graph.

Specifically, wrapper nodes for bone normalization were inserted, causing AnimationMixer to find bones by name while SkinnedMesh.skeleton references different nodes — resulting in animations “playing” with no visible mesh movement.

Fix: Load VRM files as plain GLTF without VRMLoaderPlugin. Handle VRM-specific processing (orientation correction, retargeting) manually.

Future Extensions

Adding more animations (run, jump, bow, etc.) is straightforward: obtain additional GLBs from Mixamo and retarget them. Because animation sources and character models are decoupled, swapping the VRM model applies the same animations to a different character.

3D Model Compression

Unused Texture Problem

Each component GLB had 11 textures embedded, but only 3 were actually used. The Sketchfab export had included all material textures in each file.

Optimization

npx @gltf-transform/cli dedup input.glb output.glb
npx @gltf-transform/cli prune output.glb output.glb

File	Before	After	Reduction
SM_tatami.glb	2.09 MB	144 KB	93%
SM_wall.glb	2.13 MB	305 KB	86%
SM_floorBeam.glb	2.08 MB	111 KB	95%
SM_windowWallHigh.glb	2.19 MB	475 KB	78%
All parts (20 files)	~43 MB	~5 MB	88%

Caveats

Do not apply to VRM files: gltf-transform removes VRM extension metadata (extensionsUsed: ["VRM"] is stripped), breaking VRM0 detection (isVrm0 = extensionsUsed?.includes('VRM')) and causing the orientation correction and sign inversion to not apply — resulting in a reversed avatar.
WebP texture compression abandoned: A-Frame’s bundled GLTFLoader does not support the WebP extension; models stop displaying.

Environment Atmosphere: Sky and Fog

Procedural Sky

An attempt to display HDRI images (EXR → JPEG conversion) via a-sky was abandoned due to interference with fog and color mismatches. The final approach sets a canvas-drawn gradient as scene.background in THREE.js.

const canvas = document.createElement('canvas');
canvas.width = 1; canvas.height = 512;
const ctx = canvas.getContext('2d');
const grad = ctx.createLinearGradient(0, 0, 0, 512);
grad.addColorStop(0.0,  '#5a88b8');  // zenith
grad.addColorStop(0.75, '#c8d4b8');  // horizon = matches fog color
ctx.fillStyle = grad;
ctx.fillRect(0, 0, 1, 512);

const tex = new THREE.CanvasTexture(canvas);
tex.mapping = THREE.EquirectangularReflectionMapping;
scene.background = tex;

Using Fog for Visual Cohesion

The garden uses primitives (spheres, cylinders), which look out of place next to the detailed interior model. Exponential fog blurs the distant view, and matching the horizon color of the sky gradient to the fog color creates a more natural sense of depth.

<a-scene fog="type: exponential; color: #c8d4b8; density: 0.04">

Usage

URL Parameters

Parameter	Description
`collection`	IIIF Collection URL. Displays all images in the collection side by side.
`manifest`	IIIF Manifest URL. Displays a single image.
`avatar`	Avatar number. Starts in third-person mode.
`outside`	Start outside the building (for debugging).
`debug`	Skip the overlay.

Controls

PC	Smartphone	VR Headset
WASD: move	Virtual joystick: move	Left stick: move
Mouse drag: look	Right half drag: look	Right stick: horizontal rotation
V: first/third person toggle	—	—
C: crouch / prone / stand	Button	Physically crouch

Summary

A-Frame is appealing for how quickly you can write VR scenes in HTML, but once you venture into camera control, skinned meshes, and animation, you end up working directly with THREE.js at a lower level.

Combining IIIF physical dimension data with VR suggests the possibility of reproducing a “viewing the real object in a museum” experience in a browser. As more cultural resources become digitally available, experiencing them at true scale may represent a new mode of appreciation.

Tech Stack

Category	Tool / Library
Frontend	A-Frame 1.5.0, THREE.js 0.158.0
VR	WebXR API
Image standard	IIIF Presentation API v3, Image API v2
Avatar	three-vrm v2, Mixamo, aframe-extras
Model editing	gltf-transform
Model conversion	Blender CLI (FBX → GLB)

Project Overview#

Room Design#

Splitting the GLB Model#

Avatar Integration Challenges#

Mixamo → Blender → GLB Conversion#

Failure: scale=100 Produces a 181m Giant#

Hips Bone Offset Correction#

IIIF Tile Dynamic Loading System#

How Real-Scale Display Works#

LOD (Level of Detail)#

Tile URL Format#

scaleFactor Selection Logic#

Configuration Constants#

Issue: Base Image Disappears During Tile Download#

Third-Person Camera Implementation#

Failure 1: Camera and Avatar Under the Same Parent#

Failure 2: Syncing Position Every Frame#

Failure 3: look-controls Interfering with Custom Movement#

Working Design#

Aligning Avatar Orientation with Movement Direction#

Minecraft-Style Controls#

VRM Avatar Support#

Retargeting Mixamo Animations#

Root Motion Removal#

VRM0 Orientation Correction#

VRMLoaderPlugin Pitfall#

Future Extensions#

3D Model Compression#

Unused Texture Problem#

Optimization#

Caveats#

Environment Atmosphere: Sky and Fog#

Procedural Sky#

Using Fog for Visual Cohesion#

Usage#

URL Parameters#

Controls#

Summary#

Tech Stack#