Forced alignment

Precise, deterministic word-level timestamps that match your transcript exactly.

What it does

Forced alignment takes an audio file andthe transcript of what’s said in it, and produces the exact timing of every word and segment. Because you supply the transcript, the words in the output are guaranteed to match the words you put in — there’s no guessing.

Alignment is not transcription

Alignment needs a transcript as input. If you don’t have one, run transcription first (or use the Subtitling workflow, which chains the two).

Why deterministic alignment?

Unlike speech-to-text, which infers the words, forced alignment locks your known transcript to the acoustics. That makes it the most accurate way to get timings for content where you already have the script — your transcript in, the same transcript out, timestamped.

Languages & file size

15 languages supported for alignment.
Large files are supported — comfortably handles long-form audio of 100MB and beyond.

Output formats

Every alignment produces a complete bundle:

JSON — word- and segment-level timestamps for programmatic use.
SRT — ready-to-use subtitles.
WebVTT — captions for the web.

Via the API

Submit alignment jobs with an API key using presigned uploads for large audio. See the API reference for endpoints and schemas.