Overview

I had the opportunity to display audio files with subtitles in an IIIF viewer, so this is a memo.

The target is “Accents and Intonation of the Japanese Language (Part 2)” published in the National Diet Library Historical Sound Archive. OpenAI’s Speech to text was used. Please note that the transcription results may contain errors.

The following is a display example in Ramp.

https://ramp.avalonmediasystem.org/?iiif-content=https://nakamura196.github.io/ramp_data/demo/3571280/manifest.json

The following is a display example in Clover.

https://samvera-labs.github.io/clover-iiif/docs/viewer/demo?iiif-content=https://nakamura196.github.io/ramp_data/demo/3571280/manifest.json

The following is a display example in Aviary. Unfortunately, with the manifest file format used this time, the transcription text could not be displayed.

https://iiif.aviaryplatform.com/player?manifest=https://nakamura196.github.io/ramp_data/demo/3571280/manifest.json

Below, I introduce how to create these manifest files.

Preparing the mp4 File

Obtain the mp4 file referring to the following article.

Creating the VTT File

Perform transcription using the OpenAI API.

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
audio_file= open(output_mp4_path, "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="vtt")
with open(output_vtt_path, "w", encoding="utf-8") as file:
    file.write(transcript)

Creating the Manifest File

The following program (incomplete code) creates the manifest file.

from iiif_prezi3 import Manifest, AnnotationPage, Annotation, ResourceItem, config
from moviepy.editor import VideoFileClip

def get_video_duration(filename):
    with VideoFileClip(filename) as video:
        return video.duration

config.configs['helpers.auto_fields.AutoLang'].auto_lang = "ja"

duration=get_video_duration(mp4_path)

manifest = Manifest(id=f"{prefix}/manifest.json", label=label)
canvas = manifest.make_canvas(id=f"{prefix}/canvas", duration=duration)
anno_body = ResourceItem(id=mp4_url,
                        type="Sound",
                        format="audio/mp4",
                        duration=duration)
anno_page = AnnotationPage(id=f"{prefix}/canvas/page")
anno = Annotation(id=f"{prefix}/canvas/page/annotation",
                motivation="painting",
                body=anno_body,
                target=canvas.id)
anno_page.add_item(anno)
canvas.add_item(anno_page)

# Add VTT URL
vtt_body = ResourceItem(id=vtt_url, type="Text", format="text/vtt")
vtt_anno = Annotation(
    id=f"{prefix}/canvas/annotation/webvtt",
    motivation="supplementing",
    body=vtt_body,
    target=canvas.id,
    label = "WebVTT Transcript (machine-generated)"
    )

vtt_anno_page = AnnotationPage(id=f"{prefix}/canvas/page/2")
vtt_anno_page.add_item(vtt_anno)

canvas.annotations = [vtt_anno_page]

with open(output_path, "w") as f:
    f.write(manifest.json(indent=2))

The iiif-prezi3 library is used. Please also refer to the following article.

Summary

We hope this serves as a useful reference for applying IIIF to video and audio content.