This article is co-authored with generative AI. While I have cross-checked facts against official documentation where possible, errors may remain. Please verify primary sources before making important decisions.

I ran an experiment to narrate technical blog articles with a synthetic voice cloned from my own speech. The audio is generated with ElevenLabs Voice Cloning + the v3 model (eleven_v3, in alpha at the time of writing).

This post records an A/B comparison of v2 (eleven_multilingual_v2) and v3 on identical Japanese narration material, together with operational observations.

As a side effect, the resulting audio is wrapped as MP4 (cover image + audio + waveform overlay) and placed on YouTube under a dedicated playlist.

Background

I’m interested in approaches that use AI to reproduce a specific person’s voice and speaking style, and then read written texts or interview transcripts in something close to that person’s voice. Similar efforts are being made in the context of digitally archiving historical or deceased figures, and I wanted to gather first-hand technical and ethical findings on what works and what does not.

Working with someone else’s voice raises rights, consent, and ethical concerns, so I am running this self-experiment first — putting my own voice through the same pipeline to assess synthesis quality, operational cost, and pitfalls.

Experimental pipeline

Article (Markdown)
  ↓ Narration script (.txt) — currently semi-manual via Claude Code
  ↓ ElevenLabs API (eleven_v3) → MP3
  ↓ Pillow → 1920x1080 cover image
  ↓ ffmpeg: still cover + audio + showfreqs bars → MP4
  ↓ YouTube Data API: publish + dedicated playlist + tag-based playlists

A purely static cover looks visually flat, so the ffmpeg showfreqs filter overlays an audio waveform bar at the bottom.

Voice Cloning options

ElevenLabs offers two voice cloning paths:

  • IVC (Instant Voice Cloning) — generates a clone instantly from 1–5 minutes of audio sample, using inference-time conditioning
  • PVC (Professional Voice Cloning) — generates a fine-tuned model from 30+ minutes of audio, said to be more stable for long-form narration

This experiment uses IVC, with a short past public talk (a 2023 Digital Archives Society lightning talk) as the reference sample. PVC would likely be more appropriate for long-form stability, but I wanted to first see how far IVC goes for practical use.

Writing the narration script

Unlike VOICEVOX scripts, ElevenLabs accepts natural prose. However, English abbreviations and technical terms read by a Japanese TTS need explicit katakana spelling for consistent pronunciation.

Source termScript (katakana)Romanized
Next.jsネクスト・ジェイエスnekusuto jeiesu
WAFワッフwaffu
SSRエスエスアールesuesu āru
JA3 fingerprintジェー・エー・スリー・フィンガープリントjē ē surī fingāpurinto
$23二十三ドルnijūsan doru
ALPNエー・エル・ピー・エヌē eru pī enu

“WAF” can be read in several ways (“wafu”, “double-yu ē efu”); I keep it as “waffu” consistently across the series.

Both opening and closing are templated:

こんにちは、〇〇です。今回は、〜について、お話しします。
[body]
以上、ご清聴ありがとうございました。

(English: “Hello, this is [name]. Today I’ll talk about ~. … Thank you for listening.”)

Same script, v2 vs v3

I ran the same 284-character Japanese script — heavy with technical katakana and numbers — through eleven_multilingual_v2 and eleven_v3.

v2 (eleven_multilingual_v2), 45.3 sec:

v3 (eleven_v3, alpha), 42.2 sec:

Observed differences

These are subjective observations from listening side by side.

1. Intonation on katakana technical terms

v2 tends to flatten long compound katakana terms like “ネクスト・ジェイエス” — each unit gets a similar accent. v3 places stress more naturally, even on complex compounds like “ジェー・エー・スリー・フィンガープリント”.

2. Pause length at punctuation

v2 uses fairly uniform pauses at every Japanese comma (). v3 reads the semantic grouping and varies short and long pauses. The phrase “月額、二十三ドルから、ゼロドル、というような” (with multiple commas in succession) sounds more natural in v3.

3. Sentence-final delivery

In v2, sentence endings ("〜です", “〜になりました”) sometimes trail off in a slightly unnatural way. v3 lands the endings more steadily, and noticeably unnatural moments appear less frequent.

4. Synthesis duration

For the 284-char sample, output length was 45.3 sec on v2 vs 42.2 sec on v3 (about 7% shorter). This is not because v3 is faster overall — it appears to be the consequence of more natural pause timing rather than padding silence.

Operational notes

Per-request character limit

According to ElevenLabs’ help center (as of April 2026):

ModelPer-request limit
eleven_v3 (alpha)5,000 chars
eleven_multilingual_v210,000 chars
eleven_flash_v2_540,000 chars

A typical 2,000-character tech-blog episode fits within v3’s limit, but longer-form content needs splitting and ffmpeg concatenation.

Cases where v2 still fits

ElevenLabs documentation positions eleven_multilingual_v2 as the model recommended for long-form stability. v3 alpha was practical enough for the few-minute scripts in this experiment, but for longer content or when stability is the priority, v2 remains a valid choice.

Coexisting with the existing VOICEVOX series

When the same article exists as both a VOICEVOX video and a read-aloud audio on YouTube, viewers can confuse the two. I prefix the read-aloud titles with 【朗読】 (read-aloud) and route them into a dedicated playlist; tag-based playlists hold both.

Cost

I’m on the ElevenLabs Creator plan. Both v2 and v3 are available on the same tier. Each episode script is around 1,500–2,000 characters (3–4 min audio), so a few episodes per month fit comfortably within the API character allowance.

Summary

  • For Japanese tech-blog narration, ElevenLabs v3 (alpha) sounded more natural than v2 in the cases I tested
  • The biggest differences are on consecutive katakana terms, multi-comma sentences, and sentence-final delivery (subjective)
  • That said, v2 is the model officially recommended for long-form stability, so v2 may be preferable for longer or stability-critical use cases
  • This experiment is positioned as a self-administered preparation step for the broader interest in AI-based reproduction of a specific person’s voice (the kind of work being done in the digital-archive context for historical or deceased figures), to first gather technical and ethical findings safely with my own voice
  • As a side effect, the resulting audio is wrapped as MP4 and placed under a dedicated YouTube playlist

Generating the narration script is currently semi-manual; I plan to migrate to a Claude Code-driven workflow that produces the script directly from the source Markdown.