Why I Cloned My Voice
I'm building an AI avatar course. Four hours of video content, 39 segments, delivered by a digital version of me through HeyGen.
The avatar looks right. But the default text-to-speech voices sound like a GPS giving directions to a conference room. Professional. Lifeless. Wrong.
If the voice doesn't sound like you, the whole thing collapses. Viewers clock it in seconds. So I cloned my voice with ElevenLabs, paired it with my HeyGen avatar, and used it to produce a full AI Leadership course. Then I used the same setup for a client project — converting hundreds of hours of professional lectures into avatar courseware.
Here's exactly how I did it, what it cost, and what I'd do differently.
Which Plan You Actually Need
Most people overspend or underspend here. The free plan has no voice cloning at all. The $5 Starter plan only gets you Instant cloning, which sounds decent but not great. If you're doing anything professional — a course, client work, content that represents your brand — you need Professional Voice Cloning.
That starts at the Creator plan ($22/month).
| Plan | Price | Voice Cloning | Credits (~minutes) | Commercial Rights | |------|-------|--------------|-------------------|------------------| | Free | $0 | None | 10K (~10 min) | No | | Starter | $5/mo | Instant only | 30K (~30 min) | Yes | | Creator | $22/mo | Instant + Professional | 100K (~100 min) | Yes | | Pro | $99/mo | Full quality (44.1 kHz PCM via API) | 500K (~500 min) | Yes |
Annual billing saves you two months. I'm on the Creator plan. It handles everything I need for course production — about 100 minutes of generated audio per month, which covers roughly 20 segments.
If you're just experimenting, the $5 Starter with Instant cloning is fine. If you're producing anything that goes in front of clients or students, go Creator.
[Start your ElevenLabs clone here.][AFFILIATE_LINK]
Instant vs. Professional — The Real Difference
ElevenLabs offers two voice cloning methods. They are not the same product.
Instant Voice Clone (IVC):
- Upload 1-5 minutes of audio
- Ready immediately
- Available on Starter ($5) and up
- Can clone anyone's voice (with their consent)
- Quality: usable for demos and internal work. Slightly flat on sentences the model hasn't heard before. Fine for prototyping.
Professional Voice Clone (PVC):
- Upload 30+ minutes of audio (1-3 hours recommended)
- Requires a voice verification recording to prove it's your voice
- Processing takes 2-6 hours officially
- Available on Creator ($22) and up
- Restriction: can only clone YOUR OWN voice. For someone else's voice, use Instant.
- Quality: nearly indistinguishable from the real thing. Captures cadence, emphasis, breathing patterns.
I started with an Instant clone to test the concept. It was good enough to convince me the approach would work. Then I did the Professional clone for actual production.
The difference is immediately obvious. The Instant clone sounds like me reading a script. The Professional clone sounds like me talking.
Step 1: Prepare Your Audio
You need at least 30 minutes of clean audio. More is better — I used about 90 minutes pulled from recordings I already had. ElevenLabs recommends 1-3 hours for the best result.
Audio specs that matter:
Sample rate: 44.1 kHz or 48 kHz
Bit depth: at least 24-bit
Level: between -23dB and -18dB RMS, true peak of -3dB
Format: WAV preferred
If you have existing podcast episodes, lecture recordings, or conference talks, those work. I used recordings from teacher training sessions I'd done. The content doesn't matter — the model only cares about the characteristics of your voice.
Step 2: Clean Up the Audio
This is where most people get sloppy. The AI will faithfully reproduce whatever you give it — including every "um," "ah," and room echo.
Remove filler words. Cut dead air. Strip out background noise. If your recording has a fan humming or traffic in the background, the clone will have a fan humming in the background. Use a pop filter if you're recording fresh audio. Maintain consistent distance from the mic.
I spent about two hours cleaning my source audio. That's boring work. It's also the single highest-impact thing you can do for clone quality.
Step 3: Open ElevenLabs and Find the Voice Cloning Page
Go to elevenlabs.io and sign in. If you don't have an account yet, create one — you'll need at least the Creator plan ($22/month) for Professional cloning.
Once you're logged in, you'll land on the main dashboard. Look at the left sidebar. You'll see a list of navigation items — things like "Text to Speech," "Speech to Speech," "Voice Design," and "Voices." Click "Voices."
This takes you to your voice library. If you've never created a voice, it will be mostly empty except for ElevenLabs' default voices. At the top of this page, you'll see a button that says "Add a new voice." It's hard to miss — it's one of the primary action buttons on the page. Click it.
A menu will appear with several options: Instant Voice Clone, Voice Design, and Professional Voice Clone. Select "Professional Voice Clone."
If you don't see the Professional option, you're likely on the Free or Starter plan. You need Creator ($22/month) or higher. Upgrade first, then come back to this step.
Step 4: Upload Your Audio Files
You'll now see the Professional Voice Clone setup page. There's a large upload area in the center — a dashed-border box with text like "Drag and drop audio files here" or "Click to browse."
You can either drag your cleaned WAV files directly from your file explorer into this box, or click the box to open a file browser and select them manually. You can upload multiple files at once.
As your files upload, you'll see progress bars and file names listed below the upload area. ElevenLabs will show you the total duration of audio you've provided. You want at least 30 minutes here. If you have 60-90 minutes, even better.
Watch for upload errors. If a file gets rejected, it's usually a format issue. WAV and MP3 both work, but WAV is preferred. Files that are too short (under a few seconds) may be rejected too. If something fails, just re-export the audio from your editor and try again.
Once all your files show as uploaded successfully, look for a "Next" or "Continue" button at the bottom of the page. Click it to move to the verification step.
Step 5: Record Your Voice Verification
This is the step that makes Professional cloning different from Instant. ElevenLabs needs to confirm you're cloning your own voice, not someone else's.
The page will show you a short text passage — usually a few sentences of generic content. Below the text, there's a record button (a microphone icon or a red circle). You'll also see a 4-letter verification code displayed on screen. You need to read the passage out loud AND say the code clearly.
Click the record button. Your browser will ask for microphone permission — allow it. Read the text naturally, at your normal speaking pace. When you reach the part where you say the 4-letter code, say each letter clearly (for example, "A-B-C-D" not "abcd").
When you're done, click the stop button (it replaces the record button while recording). You'll see a playback option so you can listen to what you just recorded. If it sounds clear and you didn't stumble over anything, you're good.
If you messed up, there should be a "Re-record" or "Try again" option. Don't stress about perfection — this isn't part of your voice model. It's just identity verification. As long as it's clearly your voice saying the passage and the code, it will pass.
Click "Submit" or "Next" to send your verification.
Step 6: Name Your Voice and Submit
You'll be asked to name your voice clone. Pick something you'll recognize later — I named mine "Ben - Professional" to distinguish it from the Instant clone I'd made earlier. You may also see optional fields for a description and labels. These are for your own organization; they don't affect the clone quality.
Review the summary page. It should show your total uploaded audio duration, your verification status, and your voice name. If everything looks right, click "Create Voice" or "Submit."
Step 7: Wait for Processing
The official estimate is 2-6 hours. My first clone came back in about 4 hours.
You'll see a status indicator on your Voices page — something like "Processing" with a spinner or progress indicator. You don't need to keep the browser open. ElevenLabs will email you when your clone is ready.
Fair warning: in early 2026, some users reported 3-4 week backlogs during peak demand. I hit a 3-day wait on a re-clone I did in February. Plan accordingly — don't start this the night before your deadline.
Step 8: Test Your Clone Before You Produce Anything
Once processing is complete, your new voice will appear in your Voices library. Go back to "Voices" in the left sidebar and find it — it should be listed under your custom voices with the name you gave it.
Now go to "Text to Speech" in the left sidebar. In the voice selector dropdown (usually near the top of the page, labeled "Voice" or showing the name of the currently selected voice), click it and find your Professional clone. Select it.
Type a few paragraphs of text in the main text box. Use the kind of content you'll actually produce — if you're making educational content, test with educational scripts, not marketing copy. Click the "Generate" button.
Listen carefully. Test for:
- Pronunciation — does it handle your domain terminology? My clone initially butchered "SCORM" and "xAPI."
- Natural pauses — does it breathe where you'd breathe?
- Tone shifts — try a sentence that should sound emphatic, then one that should be calm. Does it vary?
- Long passages — generate a full paragraph. Does it stay consistent or drift?
If something sounds off, the settings panel (usually on the right side or below the text box) lets you adjust Stability (higher = more consistent, lower = more expressive) and Clarity + Similarity Enhancement (higher = closer to your voice, but can sound robotic at max). I keep Stability around 50-60% and Clarity around 75% for my production work.
Recording Tips That Actually Matter
After producing dozens of segments with my clone, here's what I wish someone had told me:
Record in the same voice you want the clone to use. If your source audio is you presenting to a room of 50 people, the clone will sound like you presenting to a room of 50 people. If you want a conversational tone, record conversationally. I had to re-do my source audio because my original recordings were too "stage voice."
Variety matters more than volume. 60 minutes of varied speech (questions, statements, emphasis, pauses) beats 3 hours of monotone lecturing. Give the model a full range of how you actually talk.
Don't try to sound "professional." The stiff, careful voice you use when you know you're being recorded? That's what you'll get back. Talk like you're explaining something to a colleague. Use contractions. Let your natural rhythm show.
Test pronunciation of domain terms early. Catch these before you generate 39 segments. You can add pronunciation guidance in the text itself — spelling out tricky words phonetically — but it's better to know upfront where the model struggles.
What Surprised Me
The Professional clone handles emotion better than I expected. It picks up subtle shifts — slight emphasis when making a key point, a change in pace before a conclusion. The Instant clone doesn't do this. It's the main reason the Professional tier is worth the money.
The 5,000 character limit on Eleven v3 is annoying. The latest model (v3) supports 74 languages and audio tags like [excited] and [whispers], but caps each generation at 5,000 characters. Older models allow up to 40,000. For a 5-minute segment script (~650 words), you'll need to split it into chunks. Not a dealbreaker, but a workflow consideration.
PVC isn't fully optimized for v3 yet. My Professional clone sounds best on the older Eleven v2 model. Instant clones work well on v3. ElevenLabs is actively improving this, but as of March 2026, I generate my production audio on v2 for consistency.
The "only your own voice" restriction on Professional cloning matters for client work. I'm producing avatar courseware for a client from their existing lecture recordings. For the client's voice, I had to use Instant cloning since the Professional tier only allows cloning your own voice. The quality gap is noticeable. Plan around this if you're doing work for others.
Budget for iteration. My first clone attempt was good. My second — after cleaning the source audio more aggressively — was significantly better. The credits you use on test generations are worth it.
The Bottom Line
Voice cloning with ElevenLabs works. It's not magic and it's not effortless, but the output is good enough for professional course production, client delivery, and public-facing content.
The minimum viable setup:
- ElevenLabs Creator plan: $22/month
- 60-90 minutes of clean source audio
- 2-3 hours of audio cleanup
- 4-6 hours of processing time (or longer during peak demand)
- A few rounds of test generation before you commit to production
Total cost to get a production-ready voice clone: under $50 including your first month's subscription and the time to set it up.
If you're building courses, creating content at scale, or converting existing material into avatar-delivered formats, this is the infrastructure layer that makes the whole thing sound human.
[Get started with ElevenLabs here.][AFFILIATE_LINK]
Want to see what a full AI avatar course looks like? I'm teaching everything I've learned about AI leadership — including how to build these production systems — in a free course at academy.kaiak.io. No fluff, no "what is a chatbot" filler. Just the practical stuff that works.

