TTS Tools
Last update: March 23, 2026
Introduction
TTS is an abbreviation of Text To Speech, an AI that converts any given text into vocal speech.
The ones listed here offer a decent variety of features & options, such as model training, fine-tuning, 0-shot training, or being mixed with RVC.
The TTS landscape moves fast — new state-of-the-art models appear every few months. Always check the linked GitHub repos and HuggingFace pages for the latest model versions, as entries here reflect what was current at the last update date above.
Here's an index of the best TTS tools out there:
ElevenLabs/11Labs
ElevenLabs is a freemium service that offers TTS, voice cloning, video translation & an AI voice agent platform.
Now on their Eleven v3 model, which introduced an audio tag system for fine-grained emotion and delivery control directly in your script (e.g.
[nervous],[laughs],[whispering]), as well as a Text to Dialogue feature for multi-speaker conversations in a single generation.Supports 32+ languages and offers both instant voice cloning (from ~1 min of audio) and professional voice cloning (30+ min) for higher fidelity.
ElevenLabs' older v1 TTS models are deprecated as of December 2025. Make sure to use v2 or v3 models.
Fish Speech / Fish Audio S2 Pro
Fish Speech is a 0-shot multilingual TTS developed by Fish Audio. The current flagship model is S2 Pro, released March 9, 2026.
S2 Pro uses a Dual Autoregressive (Dual-AR) architecture: a 4B-param Slow AR handles linguistic and prosodic structure, while a 400M-param Fast AR handles fine acoustic detail. Trained on 10M+ hours of audio across 80+ languages.
Supports free-form inline emotion control using natural language tags anywhere in the text (e.g.
[whisper in small voice],[excited],[laugh],[sigh]) — no fixed tag set to memorize.Achieves sub-150ms time-to-first-audio and an RTF of 0.195 on H200. Natively supports multi-speaker and multi-turn generation in a single pass.
On benchmarks, S2 Pro outperforms all evaluated models including closed-source systems from Google and OpenAI on Seed-TTS Eval.
It can be used either locally or on the cloud.
S2 Pro is licensed under the Fish Audio Research License — free for research & non-commercial use. Commercial use is available via the Fish Audio API/platform.
Higgs Audio
Higgs Audio V2 is an open-source audio foundation model developed by Boson AI, released in July 2025 under the Apache 2.0 license, making it freely usable for commercial projects.
Pretrained on over 10 million hours of audio (speech, music, and sound events in a single unified system). Built on top of Llama-3.2-3B with a custom DualFFN adapter and a 24kHz audio tokenizer running at just 25 frames per second.
Goes beyond traditional TTS with several emergent capabilities rarely seen in open-source models: zero-shot multi-speaker dialogues, automatic prosody adaptation based on narrative context, melodic humming with the cloned voice, and simultaneous speech & background music generation.
On EmergentTTS-Eval, it achieves 75.7% win rate over GPT-4o-mini-TTS on emotion expressiveness — the best result of any open-source model at time of release.
The model family also includes a lightweight V2.5 (1B params, GRPO-aligned), which outperforms V2 on speed and accuracy while being smaller.
Can be used locally or on the cloud.
The full V2 (3B params) requires at least an RTX 4090 for efficient inference. V2.5 (1B params) is recommended for most users — it outperforms V2 on speed and accuracy and runs comfortably on 8GB+ VRAM GPUs, as well as via managed platforms (Microsoft Foundry, Eigen AI, Deep Infra) with no local GPU needed.
VibeVoice
VibeVoice is a family of open-source frontier voice AI models developed by Microsoft Research, released under the MIT License.
A core innovation is its use of continuous speech tokenizers (Acoustic & Semantic) operating at an ultra-low frame rate of 7.5 Hz, feeding into a next-token diffusion framework for high-fidelity acoustic detail.
The family currently has three active models:
- VibeVoice-TTS-1.5B — Long-form multi-speaker TTS. Synthesizes speech up to 90 minutes in a single pass, with up to 4 distinct speakers and natural turn-taking. Supports English and Chinese. Weights are on HuggingFace; community-maintained inference code is available.
- VibeVoice-Realtime-0.5B — Lightweight real-time streaming TTS (~300ms first-audio latency). Includes 9 experimental multilingual voices (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 English style voices.
- VibeVoice-ASR-7B — Long-form speech recognition (up to 60 minutes), generating structured transcriptions with speaker diarization and timestamps. Supports 50+ languages.
Can be used locally or on the cloud.
Microsoft pulled the original VibeVoice-TTS inference code on September 5, 2025 due to misuse. The 1.5B weights remain available on HuggingFace, and community Colabs using them exist. The Realtime-0.5B and ASR models are fully open with official code. The model card explicitly states VibeVoice is for research use only — not recommended for commercial deployment without further testing.
Chatterbox
Chatterbox is a production-grade open-source TTS family developed by Resemble AI, released under the MIT License.
Consistently outperforms ElevenLabs in blind side-by-side evaluations (63.75% preference rate in independent testing).
First open-source TTS model with emotion exaggeration control — dial expressiveness from monotone to dramatic with a single parameter.
The family includes three variants:
- Chatterbox — High-quality English TTS with 0-shot voice cloning & emotion control (0.5B params)
- Chatterbox Multilingual — Supports 23 languages with 0-shot voice cloning
- Chatterbox Turbo — Fastest variant (350M params), sub-200ms latency, designed for real-time voice agents. Supports paralinguistic tags like
[laugh],[cough],[chuckle]natively.
All outputs include an imperceptible neural watermark (Perth Watermarker) for responsible AI use.
It can be used locally or on the cloud.
Orpheus TTS
Orpheus is a Llama-3B based open-source TTS developed by Canopy AI, released in March 2025.
Designed for human-like speech — natural intonation, emotion and rhythm that rivals closed-source models.
Supports 0-shot voice cloning and guided emotion control via simple inline tags (e.g.
<laugh>,<sigh>,<gasp>,<cough>).Achieves ~200ms streaming latency for real-time applications, reducible to ~100ms with input streaming.
A family of multilingual models also exists, covering Chinese, Hindi, Korean, and Spanish.
It can be used locally or on the cloud.
IndexTTS
IndexTTS is an industrial-level zero-shot TTS developed by Bilibili (Index Team), built on top of XTTS/Tortoise with significant architectural improvements.
Uses a conformer-based speech conditioning encoder and BigVGAN2 decoder, giving it the lowest Word Error Rate of all evaluated models — ideal for audiobook production and content where accuracy is paramount.
The latest release is IndexTTS-2 (September 2025), which achieves state-of-the-art emotional fidelity, decouples timbre from emotion (independent control of voice identity and expression via style prompts or natural-language descriptions), and introduces the first precise duration control mechanism in an autoregressive TTS model — specifying audio length to the millisecond for video dubbing use cases.
The precise duration control feature exists in the research paper but is not yet enabled in the current public release code. The rest of the model (0-shot cloning, emotion control, timbre decoupling) is fully functional.
- Can be used locally or on the cloud.
OuteTTS
OuteTTS is a novel TTS approach built on pure language modeling — no external adapters, encoders, or diffusion steps. Speech is generated directly from text and audio tokens using a standard LLM backbone (Qwen3).
Now on v1.0 (released September 2025), offering strong 0-shot voice cloning from a reference audio clip. Built-in default speaker profiles are currently English-only, but reference-based cloning works across languages including Chinese, Japanese, Korean, German, and French via zero-shot generalization from v0.3 onwards.
Particularly well-suited for local and edge deployment: supports GGUF, EXL2, Transformers, and vLLM backends out of the box, meaning you can run it with llama.cpp on modest hardware with no dedicated GPU required.
Available under the MIT License.
VoxCPM
VoxCPM is a tokenizer-free open-source TTS developed by OpenBMB (Tsinghua University / ModelBest), now on v1.5, released under the Apache 2.0 license.
Unlike mainstream TTS models that convert speech to discrete tokens, VoxCPM models speech directly in a continuous space via an end-to-end diffusion autoregressive architecture — eliminating the information loss of tokenization. Built on the MiniCPM-4 backbone (0.5B params), keeping it compact and efficient.
Trained on over 1.8 million hours of bilingual Chinese–English corpus, achieving state-of-the-art performance among open-source systems on multiple TTS benchmarks.
Two flagship capabilities:
- Context-Aware Speech Generation — automatically infers and generates appropriate prosody, tone, and pacing from the meaning of the text itself, with no need for emotion tags.
- 0-shot Voice Cloning — replicates timbre, speaking style, accent, and even background ambiance from just 3–10 seconds of reference audio, including cross-language cloning between Chinese and English.
Achieves an RTF of 0.17 on a consumer-grade NVIDIA RTX 4090, making real-time generation feasible. The v1.5 update halved the LM token rate (12.5Hz → 6.25Hz), reducing compute per second of audio and paving the way for longer-form generation.
Has a strong community ecosystem: ComfyUI nodes, ONNX export for CPU inference, Apple Neural Engine backend, and LoRA fine-tuning support.
Supports full fine-tuning and LoRA fine-tuning.
Currently supports Chinese and English, with multilingual support under active development.
Needs 12GB+ VRAM for full-precision inference. ONNX export (CPU) is available via VoxCPM-ONNX for lower-resource setups.
Dia TTS
Dia generates high quality English-only dialogue from a transcript.
Now on Dia2 (released November 19, 2025), with improved voice consistency, zero-shot voice cloning via audio prompts, and native HuggingFace Transformers support.
Can create nonverbal sounds like laughter, coughing, clearing throat, etc., using simple inline tags like
(laughs)and(sighs).Requires ~8–10GB VRAM. Runs at ~40 tokens/second on an A4000.
It can be used either locally or on the cloud.
Edge TTS
This is Microsoft Edge TTS, which is good quality, multilingual & works great on long sentences.
It can only be used online via their API, through their web browser, a HF/Colab space or mixed with RVC.
BROWSER Open your Notepad & paste the following code:
<!DOCTYPE html> <html> <body style="background-color:#dddddd"> <h3 aria-hidden="true">Browser TTS "Hack"</h3> <textarea rows="10" cols="50" id="ttsText" style="background-color:#eeeeee"></textarea> <br /> <button aria-hidden="true" onclick="genText()"><font aria-hidden="true">Generate</font></button> <pre id="tts"></pre> <script> function genText() { var x = document.getElementById("ttsText").value; document.getElementById("tts").innerHTML = x; } </script> </body> </html>
Save it as "Microsoft Edge TTS.txt"
Rename it to "Microsoft Edge TTS.html"
Open Microsoft Edge & drag the .html to it.
Use Audacity to record the audio. Set the recording mode to loopback to record the internal audio (Realtek driver might be needed).
In the TTS input the text you want & click Generate. Stop recording when the voice is done.
You can then select Voice Options in the toolbar & change the speed to a faster/slower speech.
SPACES RVC FORKS These being mixed with RVC means it generates the speech & runs the output through RVC, applying the voice model.
XTTS2
Built on 🐢 Tortoise TTS & developed by Coqui AI, which has been discontinued unfortunately.
Has important model changes that make cross-language 0-shot voice cloning & multilingual speech generation super easy.
You need less training data. Just at least a 2 minute audio.
Full streaming support with 200ms time-to-first-chunk. Still a solid choice for fine-tuning workflows thanks to its mature training tooling — but newer models like XTTS community forks or Chatterbox are generally preferred for new projects.
Can use it either online or locally:
- Inference 0-Shot Training UI Colab (Run it & click the Public Link)
- Training & Inference UI Colab
- Inference 0-Shot Training HF Space
CosyVoice
Multilingual 0-shot TTS by Alibaba FunAudioLLM, now on its third generation (CosyVoice 3 / Fun-CosyVoice3).
Covers 9 major languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects & accents, and supports both multilingual and cross-lingual 0-shot voice cloning.
Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
Supports streaming inference with ultra-low latency (~150ms).
Can be used locally or online.
Qwen3-TTS
Qwen3-TTS is a family of advanced multilingual TTS models developed by the Qwen team at Alibaba Cloud, released on January 22, 2026 under the Apache 2.0 license.
Trained on over 5 million hours of speech data across 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian), plus multiple Chinese dialectal voice profiles.
The family ships three purpose-built model variants (available in 0.6B and 1.7B sizes):
- Base — 3-second voice cloning from a reference audio clip.
- CustomVoice — Style control over 9 premium preset timbres (various gender, age, language, and dialect combos) via user instructions.
- VoiceDesign — Create entirely new voices from a natural-language text description (e.g. "a warm, slightly husky young male voice speaking slowly").
Supports streaming generation with end-to-end synthesis latency as low as 97ms via its proprietary Qwen3-TTS-Tokenizer-12Hz — a 12.5 Hz, 16-layer multi-codebook codec with a lightweight causal ConvNet for fast waveform reconstruction.
Features strong contextual understanding: the model automatically adapts tone, speaking rate, and emotional expression based on the semantics of the input text.
Can be used locally or on the cloud.
Kokoro-TTS
Lightweight yet high-quality TTS model with just 82 million parameters — one of the most downloaded TTS models on HuggingFace.
Faster-than-realtime inference due to its StyleTTS2/ISTFTNet architecture (no encoders or diffusion steps). Processes text in under 0.3 seconds.
Only has premade voices — no voice cloning. Not the best emotion control.
Voice support for English, British English, French, Italian, Japanese and Chinese (with some voice bleeding between languages).
Fully open-source under the Apache 2.0 license, making it great for commercial use.
Some standalone websites claim to be the "official Kokoro TTS" — they are unaffiliated and potentially fraudulent. Always use the links below.
Piper
Fast TTS with great multilingual support — works for almost all languages.
Decent quality; primarily intended for edge / local deployment (runs on a Raspberry Pi with minimal latency).
Not recommended as a primary choice for quality-focused generation — better suited for offline assistants, embedded systems, and pipelines where speed matters more than expressiveness.
- GitHub Repo with Colabs (For training and inference)
- Several HuggingFace Spaces for each Language
GLM-TTS
GLM-TTS is an industrial-grade open-source TTS system developed by Zhipu AI (ZAI) — the team behind the GLM language model family — released December 11, 2025.
Uses a two-stage LLM + Flow Matching architecture. Its standout feature is a multi-reward reinforcement learning (GRPO) framework that trains the model across four simultaneous reward signals (speaker similarity, character error rate, emotion accuracy, and laughter naturalness), resulting in more expressive and emotionally consistent speech than most open-source alternatives.
On benchmarks, GLM-TTS_RL achieves the lowest Character Error Rate (0.89) of any open-source TTS while maintaining high speaker similarity — competitive with commercial systems like MiniMax.
Key features:
- 0-shot voice cloning from just 3–10 seconds of reference audio
- Phoneme-level pronunciation control via "Hybrid Phoneme + Text" input — critical for polyphone disambiguation in Chinese (e.g., the character "行" can be xíng or háng) and for professional audiobook/dubbing use cases
- Streaming inference with ~400ms first-frame latency on GPU
Primarily supports Chinese, with decent bilingual Chinese-English mixed-text support. Not a multilingual model.
Available under Apache 2.0 (code) and MIT (weights) — both permit commercial use.
Can be used locally or on the cloud.
The online demo (audio.z.ai) was not deployed at launch. Use the HuggingFace model weights + local inference script for now. Upcoming updates include RL-optimized weights and a 2D Vocos vocoder for improved audio quality.
GPT-SoVITS
GPT-SoVITS is a few-shot voice cloning & TTS system — just 1 minute of audio is enough to fine-tune a voice, and zero-shot works from a 5-second reference clip.
Very good with Chinese, and supports English, Japanese, Korean and Cantonese cross-language synthesis, though some noise artifacts remain in non-Chinese outputs.
Now on v4, which fixes the metallic artifacts of v3 caused by non-integer upsampling and natively outputs 48kHz audio. The new v2Pro/v2ProPlus variants offer v4-level quality at v2's hardware cost and speed — recommended for most users.
Can be used both locally & online:
- Colab Space (with fine-tuning, inference & UI)