Overview
I had the opportunity to display audio files with subtitles in an IIIF viewer, so this is a memo.
The target is “Accents and Intonation of the Japanese Language (Part 2)” published in the National Diet Library Historical Sound Archive. OpenAI’s Speech to text was used. Please note that the transcription results may contain errors.
The following is a display example in Ramp.

The following is a display example in Clover.

The following is a display example in Aviary. Unfortunately, with the manifest file format used this time, the transcription text could not be displayed.

Below, I introduce how to create these manifest files.
Preparing the mp4 File
Obtain the mp4 file referring to the following article.
Creating the VTT File
Perform transcription using the OpenAI API.
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
audio_file= open(output_mp4_path, "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="vtt")
with open(output_vtt_path, "w", encoding="utf-8") as file:
file.write(transcript)
Creating the Manifest File
The following program (incomplete code) creates the manifest file.
from iiif_prezi3 import Manifest, AnnotationPage, Annotation, ResourceItem, config
from moviepy.editor import VideoFileClip
def get_video_duration(filename):
with VideoFileClip(filename) as video:
return video.duration
config.configs['helpers.auto_fields.AutoLang'].auto_lang = "ja"
duration=get_video_duration(mp4_path)
manifest = Manifest(id=f"{prefix}/manifest.json", label=label)
canvas = manifest.make_canvas(id=f"{prefix}/canvas", duration=duration)
anno_body = ResourceItem(id=mp4_url,
type="Sound",
format="audio/mp4",
duration=duration)
anno_page = AnnotationPage(id=f"{prefix}/canvas/page")
anno = Annotation(id=f"{prefix}/canvas/page/annotation",
motivation="painting",
body=anno_body,
target=canvas.id)
anno_page.add_item(anno)
canvas.add_item(anno_page)
# Add VTT URL
vtt_body = ResourceItem(id=vtt_url, type="Text", format="text/vtt")
vtt_anno = Annotation(
id=f"{prefix}/canvas/annotation/webvtt",
motivation="supplementing",
body=vtt_body,
target=canvas.id,
label = "WebVTT Transcript (machine-generated)"
)
vtt_anno_page = AnnotationPage(id=f"{prefix}/canvas/page/2")
vtt_anno_page.add_item(vtt_anno)
canvas.annotations = [vtt_anno_page]
with open(output_path, "w") as f:
f.write(manifest.json(indent=2))
The iiif-prezi3 library is used. Please also refer to the following article.
Summary
We hope this serves as a useful reference for applying IIIF to video and audio content.