Voxtral TTS converts text into lifelike speech across 9 languages.
Zero-shot voice cloning · 70ms latency · Open weights by Mistral AI.
A 4-billion parameter text-to-speech model by Mistral AI, designed for production-grade voice generation with natural intonation and multilingual fluency.
Generate native-quality speech in English, French, German, Spanish, Portuguese, Italian, Dutch, Hindi, and Arabic — all from a single unified model.
Replicate any voice from as little as 3 seconds of reference audio. No fine-tuning required — just provide a sample and start generating.
Approximately 70ms model latency with a 9.7x real-time factor. Built for streaming applications, conversational AI, and live voice agents.
Full model weights available on Hugging Face under CC BY-NC 4.0. Run on your own infrastructure with complete control over data privacy.
Enterprise-ready speech synthesis that delivers natural, expressive audio at scale.
Generate high-fidelity speech in three simple steps:
Input any text content — from a single sentence to a full article. Voxtral TTS handles punctuation, numbers, and mixed-language inputs automatically.
Select from built-in preset voices, or upload a short audio clip to clone any speaking style with zero-shot voice replication.
Click generate and receive high-quality 24 kHz audio instantly. Stream in real time or download in your preferred format.
Everything you need for production-grade text-to-speech.
Native support for 9 languages including English, French, German, Spanish, and Arabic with accent-accurate pronunciation.
Clone any voice from just 3 seconds of audio. No training pipeline, no fine-tuning — instant replication.
Sub-100ms latency with 9.7x real-time factor. Purpose-built for live conversations and real-time applications.
Export in WAV, MP3, FLAC, AAC, Opus, and PCM. 24 kHz sample rate for broadcast-quality output.
Run on your own hardware with open weights from Hugging Face. Full data sovereignty and offline capability.
Voice-as-an-instruction: the model captures intonation, rhythm, and emotion from reference audio without explicit SSML tags.
Have more questions? Contact us at hello@aivoxtraltts.com.
Try Voxtral TTS now — no sign-up required.