I have summarized how to automatically add English subtitles and English audio to Japanese videos. This uses Azure OpenAI Service’s Whisper and Speech Services.
Overview
The goal this time is to make a Japanese audio video multilingual as follows:
- Japanese version: Original video (Japanese audio, no subtitles)
- English version: English audio + English subtitles
Services Used
| Service | Purpose |
|---|---|
| Azure OpenAI Service (Whisper) | Translation from Japanese audio to English text |
| Azure Speech Services (TTS) | Synthesis from English text to English audio |
| FFmpeg | Audio extraction and video merging |
Procedure
1. Environment Setup
Required Tools
# Install FFmpeg (macOS)
brew install ffmpeg
# Python libraries
pip install python-dotenv requests
Azure Configuration (.env)
AZURE_OPENAI_ENDPOINT=https://xxxxx.openai.azure.com
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT_NAME=whisper
AZURE_OPENAI_API_VERSION=2024-06-01
2. Extract Audio from Video
Since the Azure Whisper API has a 25MB file size limit, the audio is compressed and extracted.
ffmpeg -i input.mp4 -vn -acodec libmp3lame -b:a 64k -ar 16000 audio.mp3
3. Generate English Subtitles with Whisper
Using the Azure OpenAI Service’s Whisper API, Japanese audio is transcribed while being translated into English.
import requests
url = f"{AZURE_ENDPOINT}/openai/deployments/whisper/audio/translations?api-version=2024-06-01"
headers = {
"api-key": AZURE_API_KEY,
}
with open("audio.mp3", "rb") as audio_file:
files = {"file": audio_file}
data = {"response_format": "srt"} # Output in SRT format
response = requests.post(url, headers=headers, files=files, data=data)
# Save as SRT file
with open("subtitles_en.srt", "w") as f:
f.write(response.text)
Key point: Using the translations endpoint, Japanese audio is directly translated into English text.
4. Generate English Audio with Speech Services
The generated subtitle text is synthesized into audio using Azure Speech Services.
url = f"https://{REGION}.tts.speech.microsoft.com/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": API_KEY,
"Content-Type": "application/ssml+xml",
"X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3",
}
ssml = f"""<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
<voice name='en-US-JennyNeural'>
{text}
</voice>
</speak>"""
response = requests.post(url, headers=headers, data=ssml.encode('utf-8'))
5. Create Timing-Synchronized Audio
To place audio according to subtitle timestamps, the following processing is performed:
- Generate individual audio for each subtitle segment
- Insert silence based on timestamps
- Concatenate all segments
# Processing each segment
for subtitle in subtitles:
# Insert silence for the gap from the previous segment
if gap > 0:
create_silence(gap, silence_path)
# Generate audio with TTS
text_to_speech(subtitle['text'], speech_path)
# Adjust tempo if audio is too long
if actual_duration > segment_duration:
speed = actual_duration / segment_duration
# Adjust speed with ffmpeg's atempo filter
6. Create English Version Video
Finally, combine the original video’s visuals with the English audio.
ffmpeg -i original.mp4 -i english_audio.mp3 \
-c:v copy -map 0:v:0 -map 1:a:0 -shortest \
output_en.mp4
Generated Files
| File | Description |
|---|---|
original.mp4 | Japanese version (original video) |
output_en.mp4 | English version (with English audio) |
subtitles_en.srt | English subtitle file |
Uploading to YouTube
Japanese Version
- Upload the original video as-is
English Version
- Upload the video with English audio
- Add the SRT file in YouTube Studio > Subtitles tab
Cost
| Service | Price (approximate) |
|---|---|
| Azure OpenAI Whisper | $0.006 / minute |
| Azure Speech Services | $16 / 1 million characters |
For a video of about 8 minutes, the cost is roughly a few dozen yen.
Summary
By combining Azure OpenAI Service’s Whisper and Speech Services, you can automate the process of creating English versions of Japanese videos.
Benefits:
- High-accuracy translation (Whisper’s translation feature)
- Natural English audio (Neural TTS)
- Automatic generation of timestamped subtitles
Notes:
- Adjustment of audio length and subtitle timing is required
- Technical terms and proper nouns may need manual correction
Repository Structure
.
├── .env # Azure credentials
├── .gitignore # Exclude data/
├── generate_subtitles.py # Subtitle generation script
├── generate_audio.py # Audio generation script
├── generate_synced_audio.py # Synced audio generation script
├── viewer.html # Subtitle preview viewer
└── data/ # Media files (not tracked by git)
├── original.mp4
├── output_en.mp4
├── subtitles_en.srt
└── audio_en.mp3