Feature Presentation

Native Audio in Five Languages: When to Enable Voice Control

Kling 3.0 generates synced audio in CN, EN, JP, KR, and ES. The audio surcharge is $0.028/s, voice control adds another $0.028/s. Here is when each is worth it.

Technique..4 min read

One of the biggest quiet wins in Kling 3.0 is that you no longer need a post-production audio pass for 80 percent of what you ship. Native audio is baked into the render, synced to the video tokens, available in Chinese, English, Japanese, Korean, and Spanish with regional accents. You turn it on with one flag. The question is: which flag, and when do you pay the extra premium for voice control.

Audio language spectrum
Audio language spectrum

Three audio tiers, three prices

The v3 Standard I2V pricing sheet makes the tiers clear. Same structure on Pro, just scaled up.

  • Audio off: $0.084/s on Standard, $0.112/s on Pro. This is the baseline. No audio generated.
  • Audio on (audio_enabled: true): $0.126/s on Standard, $0.168/s on Pro. Adds $0.028/s for the synced track. Model picks dialogue tone, foley, and light score on its own.
  • Audio on plus voice control (voice_control: true): $0.154/s on Standard, $0.196/s on Pro. Adds another $0.028/s on top. You get to specify speaker tone, pitch range, and timing anchors.

A five second Pro clip with audio on costs 5 x $0.168 = $0.84. The same clip with voice control costs 5 x $0.196 = $0.98. The delta is $0.14 per five seconds. Not much on one render, but 1000 renders a month is $140 you pay for tighter voice control.

When audio_enabled: true alone is enough

Most ad creative, product b-roll, and scenic content does not need voice control. The model picks a reasonable speaker and reasonable foley. If the clip has no spoken dialogue at all, voice control is wasted spend. Just pass audio_enabled: true and an audio_language and let the model render a soundscape.

Things where plain audio works:

  • Product shots with ambient music and light foley
  • Scenic nature clips with wind, water, foliage sounds
  • Action clips where the subject does not speak to camera
  • Clips where one line of dialogue is fine whatever speaker the model picks

When you actually need voice_control: true

You enable voice control when the speaker identity matters. If the character needs to sound like your brand spokesperson, or the accent has to be Mexican Spanish not Castilian, or the speaker is a stern older man not a young woman, voice control is the flag.

It also matters when you are chaining shots and need the same voice across cuts. In a multi-shot storyboard with dialogue on shots 1, 3, and 5, voice control holds the voice identity across the cuts. Without it, the model may pick a different speaker per shot.

TS
1import { fal } from "@fal-ai/client";
2
3fal.config({ credentials: process.env.FAL_KEY });
4
5const result = await fal.subscribe("fal-ai/kling-video/v3/pro/image-to-video", {
6 input: {
7 prompt: "the woman looks at the camera and says, the lab opens tomorrow",
8 image_url: "https://storage.googleapis.com/falserverless/example_inputs/woman_portrait.jpg",
9 duration: 5,
10 aspect_ratio: "16:9",
11 cfg_scale: 0.5,
12 audio_enabled: true,
13 audio_language: "en",
14 voice_control: true,
15 voice_tone: "warm, mid-range female, confident",
16 negative_prompt: "blur, distort, and low quality"
17 },
18 logs: true
19});
20
21console.log(result.data.video.url);

Lip sync quality

This is the part that actually justifies the whole audio feature. Lip sync on 3.0 holds through second 10 with audio on, and through second 12 with voice control. On 2.6 Pro plus MMAudio post, sync drifted past second 6 on almost every clip. If your shot has a speaker on camera, 3.0 with voice control is a step change, not an incremental one.

Regional accents at launch: CN supports Mandarin and Cantonese, EN supports US and UK, JP is Tokyo standard, KR is Seoul standard, ES supports Mexican and Iberian. If you need Brazilian Portuguese, French, or Arabic, 3.0 does not ship those yet and the model will fall back to English-accented delivery.

Lip sync quality at 10 seconds
Lip sync quality at 10 seconds

Practical defaults

Turn audio on by default for anything that will be watched with sound. Leave voice control off unless you have a speaker on camera. Batch test five clips with voice control on and five with it off before committing to one pattern for a client project. The $0.028/s is real money at scale, and for scenic content the model's default speaker choice is usually fine.


Also reading