Under The Hood

Your Script In = Your Script Out

Word-for-word precision through a multi-step alignment pipeline. Every contraction, every proper noun, every word—perfectly timed and exactly as you wrote it.

True Forced Alignment

Not ASR. Not Guessing.

Automatic Speech Recognition (ASR) transcribes what it thinks it heard. Forced Alignment times the words you provide. The difference is accuracy.

ASR (Speech Recognition)

System listens to audio and guesses what was said

  • Can hallucinate words that weren't spoken
  • May miss or swap words entirely
  • Struggles with names, brands, and technical terms
  • Variable results across different runs
  • Requires manual review and correction

Forced Alignment

System aligns your transcript to audio acoustically

  • You provide the transcript—no guessing involved
  • 100% of your words appear in the output
  • Names, brands, OOV words handled via G2P
  • Deterministic: same input = same output
  • Ready for production—no manual fixes needed

VocaSync uses true forced alignment powered by acoustic models from the Montreal Forced Aligner. Your transcript in = your transcript out.

Your Input
"Don't forget the cat's food, François!"
+ audio file
Your Output
"Don't forget the cat's food, François!"
↳ with millisecond-precision timing for every word

Here's how we guarantee it

The Alignment Pipeline

Five distinct stages ensure 100% accuracy from input to output

1

Normalize

Prepare your transcript for phonetic analysis while preserving everything needed for recovery.

  • Smart punctuation removal with recovery mapping
  • Contraction preservation (don't, it's, we've)
  • Unicode normalization (NFKC)
  • Whitespace and special character handling
2

G2P & OOV Handling

Generate pronunciations for every word, including names, brands, and technical terms not in standard dictionaries.

  • Grapheme-to-phoneme conversion for unknown words
  • Out-of-vocabulary (OOV) detection and resolution
  • Per-language dictionary with learned pronunciations
  • Pronunciation cache persists across your projects
3

Forced Alignment

Match your audio to your transcript at the phoneme level, achieving millisecond precision.

  • Acoustic model analysis of audio waveform
  • Phoneme-level matching against transcript
  • 16kHz mono audio processing for optimal accuracy
  • Language-specific acoustic models (15 languages)
4

Artifact Recovery

Clean up alignment artifacts and reconstruct words that were split during processing.

  • Filter processing artifacts from output
  • Merge split contractions: "don" + "'t" → "don't"
  • Recover unrecognised tokens using transcript context
  • Drop unrecoverable artifacts (clean output only)
5

Reconcile

Final verification pass to ensure output matches your original transcript exactly.

  • Word-for-word verification against original
  • Apostrophe and punctuation reinsertion
  • Original formatting restoration
  • Timing metadata attachment to each word

The VocaSync Guarantee

No dropped words. No junk tokens. No surprises.

100% Coverage

Every word in your transcript appears in the output with timing data.

Clean Output

No processing artifacts—only real words from your script.

Exact Match

Apostrophes, contractions, and formatting preserved exactly.

Learning Dictionary

Unknown words are learned and cached for faster future processing.

Export in Any Format

Get your alignment data in the format you need

.srt

SubRip

Universal subtitle format

  • • YouTube, Vimeo, social media
  • • Premiere Pro, Final Cut, DaVinci
  • • VLC, MPV, media players
.vtt

WebVTT

Web-native captions

  • • HTML5 video players
  • • Streaming platforms
  • • Web accessibility
.json

JSON

Developer-friendly data

  • • Custom applications
  • • Karaoke systems
  • • Word-level highlighting
Workflow Paths

How teams feed alignment

Alignment is still the core engine. These production workflows route synthesis or transcription projects into that same alignment pipeline.

Synthesis → Alignment

Generate speech, then align it for timing outputs.

  • Run synthesis with auto-alignment enabled for supported languages, or create alignment manually later.
  • A linked alignment project is created with copied audio and transcript inputs.
  • Source and linked projects are independent so each can be managed separately.

Transcription → Review → Alignment

Edit the transcript before creating alignment.

  • Complete transcription, then open the transcript review editor.
  • Optionally edit text in any supported alignment language before creating the linked alignment project.
  • Original transcription TXT/JSON artifacts are never modified; alignment always runs on copied inputs.

Ready for Perfect Alignment?

Upload your audio and transcript. Get back millisecond-precision timing for every word—guaranteed to match your original text.