Frequently Asked Questions
Find answers to common questions about VocaSync's text-to-speech synthesis, forced alignment, transcription, workflow paths, and pricing.
General
VocaSync is an AI-powered platform that provides text-to-speech synthesis, forced alignment, and audio transcription (ASR) services via a simple REST API. It helps you generate studio-quality audio from text, create precise word-level timestamps for subtitle generation, and convert audio to text with segment-level timestamps.
VocaSync is designed for developers, content creators, educators, and businesses who need reliable audio processing. Common use cases include audiobook production, video subtitle generation, podcast transcription, e-learning content, accessibility features, and interactive voice applications.
No! VocaSync offers both a web dashboard and REST API. You can create projects, upload files, and download results directly from the dashboard without writing any code. For automation and integration, the API is available with comprehensive documentation.
Visit the sign-up page and create an account using your email or social login. Once registered, you can access your dashboard to manage projects and API keys.
Yes! New users with a valid payment method on file receive free credits to explore the platform. This allows you to test both synthesis and alignment features before purchasing additional credits.
Speech Synthesis
Speech synthesis supports 57 languages including English, Spanish, French, German, Chinese, Japanese, Arabic, Portuguese, Russian, Hindi, and many more. See our API documentation for the complete list of supported languages.
VocaSync offers 9 distinct AI voices: Alloy (neutral and balanced), Ash (soft and refined), Coral (warm and friendly), Echo (smooth and calm), Fable (expressive and dramatic), Onyx (deep and authoritative), Nova (bright and energetic), Sage (wise and measured), and Shimmer (clear and optimistic).
Synthesis supports 5 output formats: MP3 (universal compatibility), AAC (Apple optimized), OPUS (web streaming), FLAC (lossless audio), and WAV (uncompressed). Choose the format that best fits your workflow.
Yes, you can control speech speed from 0.25x (very slow) to 4.0x (very fast). The default is 1.0x. This is useful for creating audiobooks at comfortable listening speeds or generating faster audio for time-constrained applications.
Standard synthesis is optimized for speed and lower latency, making it ideal for real-time applications and drafts. High Fidelity synthesis produces higher quality audio with better clarity and naturalness, best for final production content like audiobooks and podcasts. VocaSync uses OpenAI TTS for synthesis (Standard maps to tts-1, High Fidelity maps to tts-1-hd). See the API reference for technical details.
Forced Alignment
Forced alignment times the words YOU provide against audio—your transcript in, your transcript out. Unlike ASR (Automatic Speech Recognition), which guesses what was said and can hallucinate or miss words, forced alignment is deterministic: same input always produces the same output. This makes it ideal for production workflows where accuracy matters.
Forced alignment supports 15 languages with specialized acoustic models: English (US & UK), French, German, Spanish, Portuguese (Portugal), Swedish, Czech, Polish, Turkish, Russian, Ukrainian, Japanese, Korean, and Mandarin (China). Each language uses optimized models for accurate word-level timestamp generation.
Alignment outputs include JSON with word-level timestamps, SRT subtitles, and WebVTT captions. The JSON format includes precise start/end times for each word, perfect for programmatic use.
VocaSync uses MFA (Montreal Forced Aligner) with language-specific acoustic models, achieving word boundary accuracy within 20-50 milliseconds for clear audio. Accuracy depends on audio quality, background noise, and how closely the transcript matches the spoken content.
Forced alignment is widely used for: generating precise subtitles and captions, creating karaoke-style lyric displays, syncing audio with text in e-learning, building interactive audiobooks, and powering read-along applications where text highlights as audio plays.
Transcription (ASR)
Transcription (ASR) converts audio files to text using state-of-the-art WhisperX AI. Upload your audio file, specify the language (or use auto-detect), and receive a text transcript with segment-level timestamps. It's ideal for podcasts, interviews, lectures, and any audio content.
Transcription supports 99 languages including auto-detection. Major languages include English, Spanish, French, German, Chinese, Japanese, Arabic, Hindi, Portuguese, Russian, and many more. The WhisperX engine automatically detects the spoken language if not specified.
Transcription supports MP3, MP4, MPEG, MPGA, M4A, WAV, WebM, OGG, and FLAC formats. Files can be up to 100MB in size.
Transcription outputs plain text (TXT) and JSON with segment-level timestamps. For word-level timestamps and subtitle formats (SRT/VTT), run forced alignment on your edited transcript.
VocaSync uses WhisperX, a state-of-the-art speech recognition model. Accuracy varies by audio quality, accent, and background noise. For best results, use clear audio recordings. You can also provide a prompt to guide the model with context about the content.
Transcription is pay-as-you-go only at £0.01 per minute of audio (rounded up to the nearest 15 seconds, minimum 1 minute charge). Unlike synthesis and alignment, there is no free tier for transcription. This ensures sustainable, high-quality transcription services.
Transcription (ASR) uses AI to guess what was said in audio—great for content you don't have a transcript for. Alignment takes your existing transcript and precisely timestamps each word against the audio. Use transcription when you need to discover what was said; use alignment when you already know and need precise timing.
Speaker diarization (labeling different speakers) is on our roadmap but not yet available. If this is a priority for your use case, contact sales@vocasync.io and let us know.
Workflows
You can either enable auto-alignment during synthesis (for supported languages) or create alignment later from the synthesis project page. VocaSync creates a linked alignment project with copied inputs so both projects stay independent.
Open your completed transcription project, review/edit the transcript, then create a linked alignment project. This gives you AI-assisted transcription plus deterministic word-level alignment output.
Transcription is ASR-based, which is a best-match process and not guaranteed to be 100% accurate. Alignment requires the transcript to match the spoken audio closely, so use the manual transcript review editor before alignment. Helpful fixes include correcting brand names, coined terms, product names, proper nouns, and acronyms, then checking punctuation/casing and replaying uncertain segments before submitting.
Alignment supports a smaller language set than transcription. If a transcription is in an out-of-scope language, alignment actions are disabled for that project. If the language is supported, the review-to-alignment flow is available.
Yes. The transcript review editor supports all languages that VocaSync alignment supports, including non-English languages. You can make edits before creating the linked alignment project.
No. VocaSync never mutates the original transcription TXT/JSON artifacts when you generate alignment. It creates copied alignment-owned inputs in a separate linked project, so each project remains independent.
Projects are intentionally independent, so deleting one does not cascade-delete the other. If a linked alignment project is removed, VocaSync clears stale linkage references so the source project no longer points to a non-existent linked project.
Billing & Credits
You purchase credits in advance, which are consumed based on usage. Alignment and transcription are charged per minute of audio (rounded up to the nearest 15 seconds, minimum 1 minute). Synthesis is charged per 1,000 characters. There are no subscriptions or commitments.
No, credits never expire. Once purchased, they remain in your account until used, giving you complete flexibility in how you use the service.
Credits are non-refundable once purchased. We recommend starting with the free tier to evaluate the service before purchasing additional credits.
All payments are processed securely via Stripe. We accept major credit cards, debit cards, and other payment methods supported by Stripe in your region.
All prices are displayed in GBP (British Pounds). Applicable taxes may apply based on your location.
API & Technical
After signing in, navigate to your dashboard and go to the API Keys section. You can generate new API keys there. Keep your keys secure and never share them publicly.
Synthesis outputs audio in 5 formats: MP3, AAC, OPUS, FLAC, and WAV. Alignment outputs JSON, SRT (subtitles), and VTT (WebVTT captions) with word-level timestamps. Transcription outputs TXT and JSON with segment-level timestamps.
We apply fair-use rate limits to protect platform stability. Current limits are documented in the API reference. If you need higher throughput for production workloads, email sales@vocasync.io and we'll work with you to accommodate your needs.
Your files are stored until you delete them or close your account. There is no automatic expiry. You can delete files anytime from your dashboard. When you close your account, all associated files are permanently removed.
Yes, VocaSync can be used for commercial projects. Review our Terms of Service for complete usage rights and acceptable use guidelines.
For transcription and alignment, we accept most common audio formats including MP3, MP4, MPEG, MPGA, M4A, WAV, WebM, OGG, and FLAC. Files are auto-normalized via ffmpeg. Maximum file size is 100MB.
Transcription uses AI to infer punctuation, casing, and number formatting. Results are generally good but not perfect. For alignment, the output matches your input transcript exactly. We recommend reviewing and editing transcripts before alignment for production use.
Yes! The REST API is designed for automation. Submit jobs programmatically, poll for completion (or use webhooks where available), and download results. Check the API documentation for examples and best practices.
Support & Security
For technical support, billing questions, or account help, email support@vocasync.io. We aim to respond within 24–48 business hours. For enterprise inquiries, volume pricing, or custom SLAs, contact sales@vocasync.io.
We take data security seriously. All data is transmitted over encrypted connections (HTTPS), and we follow industry best practices for data storage and handling. Review our Privacy Policy for complete details.
Yes, you can request account deletion at any time by contacting support. We will remove your account and associated data in accordance with our Privacy Policy.
Notifications
VocaSync can send email notifications for: project completion (when your alignment, synthesis, or transcription job finishes), project failures (if something goes wrong), low balance alerts (when your credits drop below a threshold), API key activity (when keys are created, deleted, or status changes), and optional weekly usage digests summarizing your activity.
All VocaSync notification emails are sent from notifications@vocasync.io. To ensure you receive our emails, add this address to your contacts or safe senders list. If you're not receiving notifications, check your spam or junk folder.
Go to Settings in your dashboard and select the Notifications tab. From there, you can toggle each notification type on or off, set your preferred notification email address, and configure your low balance alert threshold.
Yes! In the Notifications settings, you can specify any email address for receiving notifications. This doesn't have to be the same as your account email. You can change it anytime.
When enabled, you'll receive an email notification when your account balance drops below your configured threshold (from £0.50 to £10.00). To prevent spam, you'll only receive one low balance alert per 24-hour period. Top up your account to continue uninterrupted service.
The weekly digest is an optional email summary of your VocaSync activity. It includes the number of projects completed, alignment minutes and synthesis characters used, and your current credit balance. It's a great way to stay informed about your usage patterns.
API key activity notifications alert you when changes are made to your API keys. You'll be notified when a key is created, deleted, activated, or deactivated. This helps you stay aware of any security-related changes to your account, whether made by you or by someone with access to your dashboard.
Yes, you can disable all notification types individually from the Settings > Notifications page. Each notification type (project completed, project failed, low balance, weekly digest, API key activity) can be toggled independently. You can also click the 'Manage notification preferences' link in any email to update your settings.
Quick Links
Still have questions?
Can't find what you're looking for? Our support team is here to help.
Contact Support