Overview

This is a note on how to describe multiple VTT files for Audio/Visual materials using IIIF.

Here, we describe transcription text in both Japanese and English as shown below.

https://ramp.avalonmediasystem.org/?iiif-content=https://nakamura196.github.io/ramp_data/demo/3571280/manifest.json

Manifest File Description

An example is stored at the following location.

https://github.com/nakamura196/ramp_data/blob/main/docs/demo/3571280/manifest.json

Please also refer to the following article.

Specifically, by describing them as multiple annotations as shown below, they were correctly processed by the Ramp viewer.

...
"annotations": [
        {
          "id": "https://nakamura196.github.io/ramp_data/demo/3571280/canvas/page/2",
          "type": "AnnotationPage",
          "items": [
            {
              "id": "https://nakamura196.github.io/ramp_data/demo/3571280/canvas/annotation/webvtt",
              "type": "Annotation",
              "label": {
                "ja": [
                  "日本語 (machine-generated)"
                ]
              },
              "motivation": "supplementing",
              "body": {
                "id": "https://nakamura196.github.io/ramp_data/demo/3571280/3571280.vtt",
                "type": "Text",
                "format": "text/vtt",
                "label": {
                  "ja": [
                    "日本語 (machine-generated)"
                  ]
                }
              },
              "target": "https://nakamura196.github.io/ramp_data/demo/3571280/canvas"
            },
            {
              "id": "https://nakamura196.github.io/ramp_data/demo/3571280/canvas/annotation/webvtt/2",
              "type": "Annotation",
              "label": {
                "ja": [
                  "English (machine-generated)"
                ]
              },
              "motivation": "supplementing",
              "body": {
                "id": "https://nakamura196.github.io/ramp_data/demo/3571280/3571280_en.vtt",
                "type": "Text",
                "format": "text/vtt",
                "label": {
                  "ja": [
                    "English (machine-generated)"
                  ]
                }
              },
              "target": "https://nakamura196.github.io/ramp_data/demo/3571280/canvas"
            }
          ]
        }
      ]
...

Note that in Clover, the two transcription texts were displayed consecutively.

https://samvera-labs.github.io/clover-iiif/docs/viewer/demo?iiif-content=https://nakamura196.github.io/ramp_data/demo/3571280/manifest.json

(Reference) Creating English Transcription Text

For creating the English transcription text, the following program was used. This is an example using the GitHub version of Whisper.

https://github.com/openai/whisper

def format_timestamp(seconds):
    """Converts time in seconds to a formatted string 'HH:MM:SS.mmm'."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = seconds % 60
    return f"{hours:02}:{minutes:02}:{seconds:06.3f}"

def write_vtt(transcription, file_path):
    with open(file_path, 'w') as file:
        file.write("WEBVTT\n\n")
        for i, segment in enumerate(transcription['segments']):
            start = format_timestamp(segment['start'])
            end = format_timestamp(segment['end'])
            text = segment['text'].strip()
            file.write(f"{start} --> {end}\n{text}\n\n") # {i + 1}\n

def translate(input_path, output_path, verbose=False):
    model = whisper.load_model('medium')
    result = model.transcribe(input_path, verbose=verbose, language="ja", task="translate")
    write_vtt(result, output_path)
    return result

Initially, I tried translation using the API version of Whisper as follows, but it output in Japanese and I was unable to successfully create English text.

transcript = client.audio.translations.create(
    model="whisper-1",
    file=audio_file,
    response_format="vtt",
)

Summary

I hope this is helpful for describing multiple transcription text and subtitle files.