Speech synthesis

Convert text into natural, studio-quality speech across dozens of languages.

What it does

Speech synthesis (text-to-speech) takes the text you provide and renders it as spoken audio in the voice you choose. It supports over 50 languages and a library of distinct voices, with low-latency generation suitable for both batch content and interactive use.

Voices & quality tiers

Pick a voice from the library, then choose a quality tier:

Standard — fast, cost-effective, and part of the monthly free-tier allowance.
High Fidelity — richer, more expressive output for premium content. Not included in the free tier.

Voice and tier are set per project. See pricing for current per-character rates.

Output formats

Synthesis can produce audio in any of five formats, so you can match your delivery target without re-encoding:

MP3 — universal, small files.
AAC — efficient, broadly supported.
OPUS — excellent quality at low bitrates.
FLAC — lossless.
WAV — uncompressed.

Pronunciation maps

Proper nouns, brand names, and technical terms don’t always come out the way you want. A pronunciation map lets you override how specific words or phrases are spoken, so names and jargon are consistent across every project.

Set it once, reuse everywhere

Build your map up over time — entries apply to future synthesis runs, so recurring terms stay correct without re-editing your text.

Chunking rules

Long text is split into chunks before synthesis. You can tune the chunking behaviour — target chunk size, hard length limits, and how headings are handled — to control pacing and keep natural pauses where you want them.

Auto-alignment

Need word-level timings for the audio you generate? Synthesis can flow straight into forced alignment. The Dub-Prep workflow bundles both stages and charges a single upfront price.

Via the API

To synthesize from your own code, create an API key and call the synthesis endpoint. Full request and response schemas live in the API reference.