This project started from the idea that it might be interesting to experience cultural resource images published via IIIF (International Image Interoperability Framework) at actual physical scale. The result is a viewer that places IIIF images inside a VR recreation of an Edo-period townhouse (machiya), viewable in both a browser and a VR headset.
The tech stack is A-Frame 1.5.0 + THREE.js 0.158.0 + WebXR. The 3D model is based on the Japanese Machiya Set Kit published on Sketchfab, split by component and reassembled as needed.
This article covers not only what worked, but also the failures encountered along the way, with their causes and fixes.
Project Overview
| Item | Details |
|---|---|
| Renderer | A-Frame 1.5.0 / THREE.js 0.158.0 |
| XR | WebXR (VR headset support) |
| Image standard | IIIF Presentation API v3 / Image API v2 |
| 3D model | Sketchfab: Japanese Machiya Set Kit (GLB) |
| Avatar | VRM + Mixamo retargeting |
| Supporting libraries | three-vrm v2, aframe-extras |
| Tools | gltf-transform, Blender CLI |
Room Design
The room dimensions use TAT_SZ = 1.76 m (the short side of one tatami mat in the Edo-period standard) as the base unit. Tatami mats, walls, shoji screens, ceilings, and lanterns are arranged as tiled components, allowing the room size to scale flexibly based on the IIIF image dimensions.
When a IIIF Collection is specified, all images in the collection are displayed side by side and the room size is automatically expanded.
Splitting the GLB Model
The distributed GLB had the entire machiya bundled into a single file. To place walls, tatami, and window walls independently, each component was split out using gltf-transform.
# Extract a specific mesh
npx gltf-transform filter input.glb SM_tatami.glb \
--node "SM_tatami"
After splitting, the bounding box of each part needs to be measured to align pivot positions.
const box = new THREE.Box3().setFromObject(mesh);
const center = box.getCenter(new THREE.Vector3());
console.log('offset:', center);
This measurement step made it possible to resolve misalignment issues with numbers rather than guesswork.
Avatar Integration Challenges
Mixamo → Blender → GLB Conversion
Mixamo only allows FBX downloads, so Blender was invoked via CLI to convert to GLB.
blender --background --python convert_fbx_to_glb.py -- \
--input avatar_walk.fbx \
--output avatar.glb
Failure: scale=100 Produces a 181m Giant
Setting scale="100 100 100" on an A-Frame entity caused a massive red character to fill the screen.
The cause is the bind matrix of the skinned mesh. Skinned animations use the InverseBindMatrix, and when scale=100 is applied to it:
SkinMatrix = GlobalTransform × InverseBindMatrix
= Scale(100) × ... ≈ 181m
This causes the avatar to become enormous.
Fix: Revert to scale="1 1 1" (the default) and set the model’s units to meters in Blender.
Hips Bone Offset Correction
Mixamo characters have the Hips bone at Z = -1.04 m (the character’s center of mass is offset forward from the model origin). Without correction, the origin ends up “floating in mid-air,” so the entity is offset by +1.04 m in the Z direction:
<a-entity id="avatar" position="0 0 1.04" ...></a-entity>
IIIF Tile Dynamic Loading System
How Real-Scale Display Works
Images are placed at real-world scale in the VR space based on the physical dimension service (physDim) information contained in the IIIF Image API info.json. If undefined, the width falls back to 5 m.
LOD (Level of Detail)
Higher-resolution tiles are loaded progressively starting from the area closest to the camera. When crouching in a VR headset and leaning toward a floor map, that area is prioritized for high-resolution loading.
[Low-resolution base (Y=0.02)] ← Always displayed
↑ Overlaid on top
[Tile grid (Y=0.025)] ← Added to DOM after download completes
- Base plane: Low-resolution overview image placed immediately. Users see a blurry full image right away.
- Tile grid calculation: Grid computed from
tilesdefinition ininfo.json.scaleFactoris auto-selected to stay withinMAX_GRID_TILES = 150. - Distance-based download: Camera position checked every 500ms. Unloaded tiles sorted by distance, up to 6 downloaded in parallel.
Tile URL Format
{baseId}/{x},{y},{w},{h}/{outW},/0/default.jpg
The size is specified as width only ({outW},). Some servers return 404 when height is included (level0 static serving).
scaleFactor Selection Logic
ScaleFactors are tried in ascending order to stay within MAX_GRID_TILES = 150.
Example: 49797×28435px image, tileWidth=1024
- sf=1 → 49×28 = 1,372 tiles (too many)
- sf=2 → 25×14 = 350 tiles (too many)
- sf=4 → 13×7 = 91 tiles ← selected
Configuration Constants
| Constant | Value | Description |
|---|---|---|
LOD_CHECK_MS | 500 | Camera distance check interval (ms) |
MAX_CONCURRENT | 6 | Concurrent download limit |
MAX_GRID_TILES | 150 | Maximum total tile count |
IIIF_MAX_PX | 2048 | Maximum base image width (px) |
IMAGE_GAP_M | 0.5 | Margin between images in a collection (m) |
ROOM_PADDING_M | 2.0 | Padding from wall to image (m) |
Issue: Base Image Disappears During Tile Download
Pre-adding tile <a-plane> elements to the DOM with visible: false caused them to interfere with base plane rendering even while invisible, making the base image disappear during download.
Fix: Tile elements are created but only appended to the DOM after the image has downloaded.
// Only add to DOM after download completes
img.onload = () => {
t.el.setAttribute('material', `src: ${t.url}; side: double`);
t.container.appendChild(t.el);
t.state = 'loaded';
};
Third-Person Camera Implementation
This was the most troublesome part. It took three failed attempts before achieving a “camera that follows the avatar from behind” in A-Frame.
Failure 1: Camera and Avatar Under the Same Parent
<a-entity id="rig">
<a-camera .../>
<a-entity id="avatar" .../>
</a-entity>
This looks simple, but the camera and avatar have a fixed relative position, so the avatar appears to never actually move on screen even as the camera moves.
Failure 2: Syncing Position Every Frame
A separate #player and #camera-rig entity were used, with position copied every frame in tick(). However, this interfered with A-Frame’s internal state (quaternion management in look-controls, etc.) and caused unstable behavior.
Failure 3: look-controls Interfering with Custom Movement
The a-camera has the wasd-controls component enabled by default. Running alongside a custom player-move component caused the avatar and camera to drift apart gradually.
Working Design
Scene (world)
├── #avatar ← Moves directly in world coordinates via WASD (independent)
└── #cam-rig ← Follows #avatar position every frame (independent)
└── #cam ← Camera (look-controls and wasd-controls both disabled)
The key is keeping the avatar and camera rig completely independent, with the rig following the avatar each frame. The a-camera default components must be explicitly disabled:
<a-camera wasd-controls="enabled: false" look-controls="enabled: false">
The follow logic is straightforward:
// Follow in tick() (simplified)
const avatarPos = avatar.object3D.position;
camRig.object3D.position.set(avatarPos.x, avatarPos.y, avatarPos.z);
The third-person camera offset is (0, 1.6, 2.5) (2.5m behind, 1.6m above).
Aligning Avatar Orientation with Movement Direction
When rotation.y = θ, the local -Z axis points in the direction (-sin θ, 0, -cos θ) in world coordinates. The WASD movement vector is calculated using the same formula, so the avatar always faces its direction of travel.
Minecraft-Style Controls
On PC, Minecraft-style strafe movement (WASD) is implemented. On smartphones, a virtual joystick on the left side handles movement and dragging the right half rotates the camera. Three posture states (standing → crouching → prone) allow leaning close to floor-placed images for high-resolution viewing.
VRM Avatar Support
Retargeting Mixamo Animations
VRM models do not include animations, so Mixamo Walk animations are retargeted to the VRM bone structure and applied.
avatar.glb (Mixamo) ──── provides animation data (hidden)
│
│ retarget (bone name mapping + rest pose correction)
▼
avatar1.glb (VRM) ──── displayed on screen
Because Mixamo and VRM have different bone names and rest poses, a simple name substitution does not work. Mathematically, the conversion is:
retargeted = W_parent × animation × inv(W_bone)
Root Motion Removal
The Mixamo Walk animation includes root motion, where the Hips bone position varies each frame. Measurement showed approximately 1.68 m of vertical variation on the Y axis, causing the avatar to “bounce up and down.”
Game engines like Unity/Unreal have built-in root motion control, but A-Frame does not. The fix was to directly rewrite the animation tracks inside the GLB file using the gltf-transform API.
import { Document, NodeIO } from '@gltf-transform/core';
const io = new NodeIO();
const document = await io.read('avatar.glb');
for (const anim of document.getRoot().listAnimations()) {
for (const sampler of anim.listSamplers()) {
const output = sampler.getOutput();
const arr = output.getArray().slice();
// Calculate median Y (natural standing height)
const yValues = [];
for (let i = 1; i < arr.length; i += 3) yValues.push(arr[i]);
yValues.sort((a, b) => a - b);
const medianY = yValues[Math.floor(yValues.length / 2)];
const firstX = arr[0], firstZ = arr[2];
// Fix all Hips positions
for (let i = 0; i < arr.length; i += 3) {
arr[i] = firstX; // X: fixed to initial value
arr[i + 1] = medianY; // Y: fixed to median
arr[i + 2] = firstZ; // Z: fixed to initial value
}
output.setArray(new Float32Array(arr));
}
}
await io.write('avatar_fixed.glb', document);
Fixing Y to the median rather than 0 is important — fixing to 0 causes the avatar to clip through the floor or float above it.
VRM0 Orientation Correction
VRM0 format has Z+ as the forward direction, but A-Frame’s camera faces Z-. Rotating the scene 180° around Y also inverts the X and Z components of skinning.
Fix: Apply a 180° Y conjugate quaternion to retargeting results.
if (isVrm0) {
// 180° Y conjugate: compensates for deformation inversion from scene.rotation.y = PI
values[i] = -q.x; // X flipped
values[i+1] = q.y; // Y unchanged
values[i+2] = -q.z; // Z flipped
values[i+3] = q.w; // W unchanged
}
VRMLoaderPlugin Pitfall
Initially VRMLoaderPlugin from @pixiv/three-vrm was used, but a problem was encountered where even when gltf.userData.vrm returned null (plugin processing failure) on VRM0 files, the plugin had already partially rewritten the scene graph.
Specifically, wrapper nodes for bone normalization were inserted, causing AnimationMixer to find bones by name while SkinnedMesh.skeleton references different nodes — resulting in animations “playing” with no visible mesh movement.
Fix: Load VRM files as plain GLTF without VRMLoaderPlugin. Handle VRM-specific processing (orientation correction, retargeting) manually.
Future Extensions
Adding more animations (run, jump, bow, etc.) is straightforward: obtain additional GLBs from Mixamo and retarget them. Because animation sources and character models are decoupled, swapping the VRM model applies the same animations to a different character.
3D Model Compression
Unused Texture Problem
Each component GLB had 11 textures embedded, but only 3 were actually used. The Sketchfab export had included all material textures in each file.
Optimization
npx @gltf-transform/cli dedup input.glb output.glb
npx @gltf-transform/cli prune output.glb output.glb
| File | Before | After | Reduction |
|---|---|---|---|
| SM_tatami.glb | 2.09 MB | 144 KB | 93% |
| SM_wall.glb | 2.13 MB | 305 KB | 86% |
| SM_floorBeam.glb | 2.08 MB | 111 KB | 95% |
| SM_windowWallHigh.glb | 2.19 MB | 475 KB | 78% |
| All parts (20 files) | ~43 MB | ~5 MB | 88% |
Caveats
- Do not apply to VRM files:
gltf-transformremoves VRM extension metadata (extensionsUsed: ["VRM"]is stripped), breaking VRM0 detection (isVrm0 = extensionsUsed?.includes('VRM')) and causing the orientation correction and sign inversion to not apply — resulting in a reversed avatar. - WebP texture compression abandoned: A-Frame’s bundled GLTFLoader does not support the WebP extension; models stop displaying.
Environment Atmosphere: Sky and Fog
Procedural Sky
An attempt to display HDRI images (EXR → JPEG conversion) via a-sky was abandoned due to interference with fog and color mismatches. The final approach sets a canvas-drawn gradient as scene.background in THREE.js.
const canvas = document.createElement('canvas');
canvas.width = 1; canvas.height = 512;
const ctx = canvas.getContext('2d');
const grad = ctx.createLinearGradient(0, 0, 0, 512);
grad.addColorStop(0.0, '#5a88b8'); // zenith
grad.addColorStop(0.75, '#c8d4b8'); // horizon = matches fog color
ctx.fillStyle = grad;
ctx.fillRect(0, 0, 1, 512);
const tex = new THREE.CanvasTexture(canvas);
tex.mapping = THREE.EquirectangularReflectionMapping;
scene.background = tex;
Using Fog for Visual Cohesion
The garden uses primitives (spheres, cylinders), which look out of place next to the detailed interior model. Exponential fog blurs the distant view, and matching the horizon color of the sky gradient to the fog color creates a more natural sense of depth.
<a-scene fog="type: exponential; color: #c8d4b8; density: 0.04">
Usage
URL Parameters
| Parameter | Description |
|---|---|
collection | IIIF Collection URL. Displays all images in the collection side by side. |
manifest | IIIF Manifest URL. Displays a single image. |
avatar | Avatar number. Starts in third-person mode. |
outside | Start outside the building (for debugging). |
debug | Skip the overlay. |
Controls
| PC | Smartphone | VR Headset |
|---|---|---|
| WASD: move | Virtual joystick: move | Left stick: move |
| Mouse drag: look | Right half drag: look | Right stick: horizontal rotation |
| V: first/third person toggle | — | — |
| C: crouch / prone / stand | Button | Physically crouch |
Summary
A-Frame is appealing for how quickly you can write VR scenes in HTML, but once you venture into camera control, skinned meshes, and animation, you end up working directly with THREE.js at a lower level.
Combining IIIF physical dimension data with VR suggests the possibility of reproducing a “viewing the real object in a museum” experience in a browser. As more cultural resources become digitally available, experiencing them at true scale may represent a new mode of appreciation.
Tech Stack
| Category | Tool / Library |
|---|---|
| Frontend | A-Frame 1.5.0, THREE.js 0.158.0 |
| VR | WebXR API |
| Image standard | IIIF Presentation API v3, Image API v2 |
| Avatar | three-vrm v2, Mixamo, aframe-extras |
| Model editing | gltf-transform |
| Model conversion | Blender CLI (FBX → GLB) |