Under The Hood

Your Script In = Your Script Out

Word-for-word precision through a multi-step alignment pipeline. Every contraction, every proper noun, every word—perfectly timed and exactly as you wrote it.

Try It Free View All Features

True Forced Alignment

Not ASR. Not Guessing.

Automatic Speech Recognition (ASR) transcribes what it thinks it heard. Forced Alignment times the words you provide. The difference is accuracy.

ASR (Speech Recognition)

System listens to audio and guesses what was said

Can hallucinate words that weren't spoken
May miss or swap words entirely
Struggles with names, brands, and technical terms
Variable results across different runs
Requires manual review and correction

Forced Alignment

System aligns your transcript to audio acoustically

You provide the transcript—no guessing involved
100% of your words appear in the output
Names, brands, OOV words handled via G2P
Deterministic: same input = same output
Ready for production—no manual fixes needed

VocaSync uses true forced alignment powered by acoustic models from the Montreal Forced Aligner. Your transcript in = your transcript out.

Your Input

"Don't forget the cat's food, François!"

+ audio file

Your Output

"Don't forget the cat's food, François!"

↳ with millisecond-precision timing for every word

Here's how we guarantee it

The Alignment Pipeline

Five distinct stages ensure 100% accuracy from input to output

Normalize

Prepare your transcript for phonetic analysis while preserving everything needed for recovery.

Smart punctuation removal with recovery mapping
Contraction preservation (don't, it's, we've)
Unicode normalization (NFKC)
Whitespace and special character handling

G2P & OOV Handling

Generate pronunciations for every word, including names, brands, and technical terms not in standard dictionaries.

Grapheme-to-phoneme conversion for unknown words
Out-of-vocabulary (OOV) detection and resolution
Per-language dictionary with learned pronunciations
Pronunciation cache persists across your projects

Forced Alignment

Match your audio to your transcript at the phoneme level, achieving millisecond precision.

Acoustic model analysis of audio waveform
Phoneme-level matching against transcript
16kHz mono audio processing for optimal accuracy
Language-specific acoustic models (15 languages)

Artifact Recovery

Clean up alignment artifacts and reconstruct words that were split during processing.

Filter processing artifacts from output
Merge split contractions: "don" + "'t" ➔ "don't"
Recover unrecognised tokens using transcript context
Drop unrecoverable artifacts (clean output only)

Reconcile

Final verification pass to ensure output matches your original transcript exactly.

Word-for-word verification against original
Apostrophe and punctuation reinsertion
Original formatting restoration
Timing metadata attachment to each word

The VocaSync Guarantee

No dropped words. No junk tokens. No surprises.

100% Coverage

Every word in your transcript appears in the output with timing data.

Clean Output

No processing artifacts—only real words from your script.

Exact Match

Apostrophes, contractions, and formatting preserved exactly.

Learning Dictionary

Unknown words are learned and cached for faster future processing.

Export in Any Format

Get your alignment data in the format you need

.srt

SubRip

Universal subtitle format

• YouTube, Vimeo, social media
• Premiere Pro, Final Cut, DaVinci
• VLC, MPV, media players

.vtt

WebVTT

Web-native captions

• HTML5 video players
• Streaming platforms
• Web accessibility

.json

JSON

Developer-friendly data

• Custom applications
• Karaoke systems
• Word-level highlighting

Workflow Paths

How teams feed alignment

Alignment is still the core engine. These production workflows route synthesis or transcription projects into that same alignment pipeline.

Synthesis ➔ Alignment

Generate speech, then align it for timing outputs.

Run synthesis with auto-alignment enabled for supported languages, or create alignment manually later.
A linked alignment project is created with copied audio and transcript inputs.
Source and linked projects are independent so each can be managed separately.

Transcription ➔ Review ➔ Alignment

Edit the transcript before creating alignment.

Complete transcription, then open the transcript review editor.
Optionally edit text in any supported alignment language before creating the linked alignment project.
Original transcription TXT/JSON artifacts are never modified; alignment always runs on copied inputs.

Alignment runs on a supported language subset. Workflow actions are automatically disabled for out-of-scope languages.

Ready for Perfect Alignment?

Upload your audio and transcript. Get back millisecond-precision timing for every word—guaranteed to match your original text.

Start Free View Pricing