Your Script In = Your Script Out
Word-for-word precision through a multi-step alignment pipeline. Every contraction, every proper noun, every word—perfectly timed and exactly as you wrote it.
Not ASR. Not Guessing.
Automatic Speech Recognition (ASR) transcribes what it thinks it heard. Forced Alignment times the words you provide. The difference is accuracy.
ASR (Speech Recognition)
System listens to audio and guesses what was said
- Can hallucinate words that weren't spoken
- May miss or swap words entirely
- Struggles with names, brands, and technical terms
- Variable results across different runs
- Requires manual review and correction
Forced Alignment
System aligns your transcript to audio acoustically
- You provide the transcript—no guessing involved
- 100% of your words appear in the output
- Names, brands, OOV words handled via G2P
- Deterministic: same input = same output
- Ready for production—no manual fixes needed
VocaSync uses true forced alignment powered by acoustic models from the Montreal Forced Aligner. Your transcript in = your transcript out.
Here's how we guarantee it
The Alignment Pipeline
Five distinct stages ensure 100% accuracy from input to output
Normalize
Prepare your transcript for phonetic analysis while preserving everything needed for recovery.
- Smart punctuation removal with recovery mapping
- Contraction preservation (don't, it's, we've)
- Unicode normalization (NFKC)
- Whitespace and special character handling
G2P & OOV Handling
Generate pronunciations for every word, including names, brands, and technical terms not in standard dictionaries.
- Grapheme-to-phoneme conversion for unknown words
- Out-of-vocabulary (OOV) detection and resolution
- Per-language dictionary with learned pronunciations
- Pronunciation cache persists across your projects
Forced Alignment
Match your audio to your transcript at the phoneme level, achieving millisecond precision.
- Acoustic model analysis of audio waveform
- Phoneme-level matching against transcript
- 16kHz mono audio processing for optimal accuracy
- Language-specific acoustic models (15 languages)
Artifact Recovery
Clean up alignment artifacts and reconstruct words that were split during processing.
- Filter processing artifacts from output
- Merge split contractions: "don" + "'t" → "don't"
- Recover unrecognised tokens using transcript context
- Drop unrecoverable artifacts (clean output only)
Reconcile
Final verification pass to ensure output matches your original transcript exactly.
- Word-for-word verification against original
- Apostrophe and punctuation reinsertion
- Original formatting restoration
- Timing metadata attachment to each word
The VocaSync Guarantee
No dropped words. No junk tokens. No surprises.
100% Coverage
Every word in your transcript appears in the output with timing data.
Clean Output
No processing artifacts—only real words from your script.
Exact Match
Apostrophes, contractions, and formatting preserved exactly.
Learning Dictionary
Unknown words are learned and cached for faster future processing.
Export in Any Format
Get your alignment data in the format you need
SubRip
Universal subtitle format
- • YouTube, Vimeo, social media
- • Premiere Pro, Final Cut, DaVinci
- • VLC, MPV, media players
WebVTT
Web-native captions
- • HTML5 video players
- • Streaming platforms
- • Web accessibility
JSON
Developer-friendly data
- • Custom applications
- • Karaoke systems
- • Word-level highlighting
How teams feed alignment
Alignment is still the core engine. These production workflows route synthesis or transcription projects into that same alignment pipeline.
Synthesis → Alignment
Generate speech, then align it for timing outputs.
- Run synthesis with auto-alignment enabled for supported languages, or create alignment manually later.
- A linked alignment project is created with copied audio and transcript inputs.
- Source and linked projects are independent so each can be managed separately.
Transcription → Review → Alignment
Edit the transcript before creating alignment.
- Complete transcription, then open the transcript review editor.
- Optionally edit text in any supported alignment language before creating the linked alignment project.
- Original transcription TXT/JSON artifacts are never modified; alignment always runs on copied inputs.
Ready for Perfect Alignment?
Upload your audio and transcript. Get back millisecond-precision timing for every word—guaranteed to match your original text.