TTS Tools

Last update: March 23, 2026


Introduction


  • TTS is an abbreviation of Text To Speech, an AI that converts any given text into vocal speech.

  • The ones listed here offer a decent variety of features & options, such as model training, fine-tuning, 0-shot training, or being mixed with RVC.

  • The TTS landscape moves fast — new state-of-the-art models appear every few months. Always check the linked GitHub repos and HuggingFace pages for the latest model versions, as entries here reflect what was current at the last update date above.

  • Here's an index of the best TTS tools out there: ‎


ElevenLabs/11Labs

  • ElevenLabs is a freemium service that offers TTS, voice cloning, video translation & an AI voice agent platform.

  • Now on their Eleven v3 model, which introduced an audio tag system for fine-grained emotion and delivery control directly in your script (e.g. [nervous], [laughs], [whispering]), as well as a Text to Dialogue feature for multi-speaker conversations in a single generation.

  • Supports 32+ languages and offers both instant voice cloning (from ~1 min of audio) and professional voice cloning (30+ min) for higher fidelity.


Fish Speech / Fish Audio S2 Pro

  • Fish Speech is a 0-shot multilingual TTS developed by Fish Audio. The current flagship model is S2 Pro, released March 9, 2026.

  • S2 Pro uses a Dual Autoregressive (Dual-AR) architecture: a 4B-param Slow AR handles linguistic and prosodic structure, while a 400M-param Fast AR handles fine acoustic detail. Trained on 10M+ hours of audio across 80+ languages.

  • Supports free-form inline emotion control using natural language tags anywhere in the text (e.g. [whisper in small voice], [excited], [laugh], [sigh]) — no fixed tag set to memorize.

  • Achieves sub-150ms time-to-first-audio and an RTF of 0.195 on H200. Natively supports multi-speaker and multi-turn generation in a single pass.

  • On benchmarks, S2 Pro outperforms all evaluated models including closed-source systems from Google and OpenAI on Seed-TTS Eval.

  • It can be used either locally or on the cloud.


Higgs Audio

  • Higgs Audio V2 is an open-source audio foundation model developed by Boson AI, released in July 2025 under the Apache 2.0 license, making it freely usable for commercial projects.

  • Pretrained on over 10 million hours of audio (speech, music, and sound events in a single unified system). Built on top of Llama-3.2-3B with a custom DualFFN adapter and a 24kHz audio tokenizer running at just 25 frames per second.

  • Goes beyond traditional TTS with several emergent capabilities rarely seen in open-source models: zero-shot multi-speaker dialogues, automatic prosody adaptation based on narrative context, melodic humming with the cloned voice, and simultaneous speech & background music generation.

  • On EmergentTTS-Eval, it achieves 75.7% win rate over GPT-4o-mini-TTS on emotion expressiveness — the best result of any open-source model at time of release.

  • The model family also includes a lightweight V2.5 (1B params, GRPO-aligned), which outperforms V2 on speed and accuracy while being smaller.

  • Can be used locally or on the cloud.


VibeVoice

  • VibeVoice is a family of open-source frontier voice AI models developed by Microsoft Research, released under the MIT License.

  • A core innovation is its use of continuous speech tokenizers (Acoustic & Semantic) operating at an ultra-low frame rate of 7.5 Hz, feeding into a next-token diffusion framework for high-fidelity acoustic detail.

  • The family currently has three active models:

    • VibeVoice-TTS-1.5B — Long-form multi-speaker TTS. Synthesizes speech up to 90 minutes in a single pass, with up to 4 distinct speakers and natural turn-taking. Supports English and Chinese. Weights are on HuggingFace; community-maintained inference code is available.
    • VibeVoice-Realtime-0.5B — Lightweight real-time streaming TTS (~300ms first-audio latency). Includes 9 experimental multilingual voices (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 English style voices.
    • VibeVoice-ASR-7B — Long-form speech recognition (up to 60 minutes), generating structured transcriptions with speaker diarization and timestamps. Supports 50+ languages.
  • Can be used locally or on the cloud.


Chatterbox

  • Chatterbox is a production-grade open-source TTS family developed by Resemble AI, released under the MIT License.

  • Consistently outperforms ElevenLabs in blind side-by-side evaluations (63.75% preference rate in independent testing).

  • First open-source TTS model with emotion exaggeration control — dial expressiveness from monotone to dramatic with a single parameter.

  • The family includes three variants:

    • Chatterbox — High-quality English TTS with 0-shot voice cloning & emotion control (0.5B params)
    • Chatterbox Multilingual — Supports 23 languages with 0-shot voice cloning
    • Chatterbox Turbo — Fastest variant (350M params), sub-200ms latency, designed for real-time voice agents. Supports paralinguistic tags like [laugh], [cough], [chuckle] natively.
  • All outputs include an imperceptible neural watermark (Perth Watermarker) for responsible AI use.

  • It can be used locally or on the cloud.


Orpheus TTS

  • Orpheus is a Llama-3B based open-source TTS developed by Canopy AI, released in March 2025.

  • Designed for human-like speech — natural intonation, emotion and rhythm that rivals closed-source models.

  • Supports 0-shot voice cloning and guided emotion control via simple inline tags (e.g. <laugh>, <sigh>, <gasp>, <cough>).

  • Achieves ~200ms streaming latency for real-time applications, reducible to ~100ms with input streaming.

  • A family of multilingual models also exists, covering Chinese, Hindi, Korean, and Spanish.

  • It can be used locally or on the cloud.


IndexTTS

  • IndexTTS is an industrial-level zero-shot TTS developed by Bilibili (Index Team), built on top of XTTS/Tortoise with significant architectural improvements.

  • Uses a conformer-based speech conditioning encoder and BigVGAN2 decoder, giving it the lowest Word Error Rate of all evaluated models — ideal for audiobook production and content where accuracy is paramount.

  • The latest release is IndexTTS-2 (September 2025), which achieves state-of-the-art emotional fidelity, decouples timbre from emotion (independent control of voice identity and expression via style prompts or natural-language descriptions), and introduces the first precise duration control mechanism in an autoregressive TTS model — specifying audio length to the millisecond for video dubbing use cases.

  • Can be used locally or on the cloud.

OuteTTS

  • OuteTTS is a novel TTS approach built on pure language modeling — no external adapters, encoders, or diffusion steps. Speech is generated directly from text and audio tokens using a standard LLM backbone (Qwen3).

  • Now on v1.0 (released September 2025), offering strong 0-shot voice cloning from a reference audio clip. Built-in default speaker profiles are currently English-only, but reference-based cloning works across languages including Chinese, Japanese, Korean, German, and French via zero-shot generalization from v0.3 onwards.

  • Particularly well-suited for local and edge deployment: supports GGUF, EXL2, Transformers, and vLLM backends out of the box, meaning you can run it with llama.cpp on modest hardware with no dedicated GPU required.

  • Available under the MIT License.


VoxCPM

  • VoxCPM is a tokenizer-free open-source TTS developed by OpenBMB (Tsinghua University / ModelBest), now on v1.5, released under the Apache 2.0 license.

  • Unlike mainstream TTS models that convert speech to discrete tokens, VoxCPM models speech directly in a continuous space via an end-to-end diffusion autoregressive architecture — eliminating the information loss of tokenization. Built on the MiniCPM-4 backbone (0.5B params), keeping it compact and efficient.

  • Trained on over 1.8 million hours of bilingual Chinese–English corpus, achieving state-of-the-art performance among open-source systems on multiple TTS benchmarks.

  • Two flagship capabilities:

    • Context-Aware Speech Generation — automatically infers and generates appropriate prosody, tone, and pacing from the meaning of the text itself, with no need for emotion tags.
    • 0-shot Voice Cloning — replicates timbre, speaking style, accent, and even background ambiance from just 3–10 seconds of reference audio, including cross-language cloning between Chinese and English.
  • Achieves an RTF of 0.17 on a consumer-grade NVIDIA RTX 4090, making real-time generation feasible. The v1.5 update halved the LM token rate (12.5Hz → 6.25Hz), reducing compute per second of audio and paving the way for longer-form generation.

  • Has a strong community ecosystem: ComfyUI nodes, ONNX export for CPU inference, Apple Neural Engine backend, and LoRA fine-tuning support.

  • Supports full fine-tuning and LoRA fine-tuning.

  • Currently supports Chinese and English, with multilingual support under active development.


Dia TTS

  • Dia generates high quality English-only dialogue from a transcript.

  • Now on Dia2 (released November 19, 2025), with improved voice consistency, zero-shot voice cloning via audio prompts, and native HuggingFace Transformers support.

  • Can create nonverbal sounds like laughter, coughing, clearing throat, etc., using simple inline tags like (laughs) and (sighs).

  • Requires ~8–10GB VRAM. Runs at ~40 tokens/second on an A4000.

  • It can be used either locally or on the cloud.



Edge TTS


  • This is Microsoft Edge TTS, which is good quality, multilingual & works great on long sentences.

  • It can only be used online via their API, through their web browser, a HF/Colab space or mixed with RVC.

    1. Download the browser.

    2. Open your Notepad & paste the following code:

    <!DOCTYPE html>
    <html>
    <body style="background-color:#dddddd">
    
    <h3 aria-hidden="true">Browser TTS "Hack"</h3>
    
    <textarea rows="10" cols="50" id="ttsText" style="background-color:#eeeeee"></textarea>
    <br />
    <button aria-hidden="true" onclick="genText()"><font aria-hidden="true">Generate</font></button>
    
    <pre id="tts"></pre>
    
    <script>
    function genText() {
    var x = document.getElementById("ttsText").value;
    document.getElementById("tts").innerHTML = x;
    }
    </script>
    
    
    
    </body>
    </html>
    1. Save it as "Microsoft Edge TTS.txt"

    2. Rename it to "Microsoft Edge TTS.html"

    3. Open Microsoft Edge & drag the .html to it.

    4. Use Audacity to record the audio. Set the recording mode to loopback to record the internal audio (Realtek driver might be needed).

    5. In the TTS input the text you want & click Generate. Stop recording when the voice is done.

    6. You can then select Voice Options in the toolbar & change the speed to a faster/slower speech.



XTTS2


  • Built on 🐢 Tortoise TTS & developed by Coqui AI, which has been discontinued unfortunately.

  • Has important model changes that make cross-language 0-shot voice cloning & multilingual speech generation super easy.

  • You need less training data. Just at least a 2 minute audio.

  • Full streaming support with 200ms time-to-first-chunk. Still a solid choice for fine-tuning workflows thanks to its mature training tooling — but newer models like XTTS community forks or Chatterbox are generally preferred for new projects.

  • Can use it either online or locally:


CosyVoice


  • Multilingual 0-shot TTS by Alibaba FunAudioLLM, now on its third generation (CosyVoice 3 / Fun-CosyVoice3).

  • Covers 9 major languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects & accents, and supports both multilingual and cross-lingual 0-shot voice cloning.

  • Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.

  • Supports streaming inference with ultra-low latency (~150ms).

  • Can be used locally or online.


Qwen3-TTS


  • Qwen3-TTS is a family of advanced multilingual TTS models developed by the Qwen team at Alibaba Cloud, released on January 22, 2026 under the Apache 2.0 license.

  • Trained on over 5 million hours of speech data across 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian), plus multiple Chinese dialectal voice profiles.

  • The family ships three purpose-built model variants (available in 0.6B and 1.7B sizes):

    • Base — 3-second voice cloning from a reference audio clip.
    • CustomVoice — Style control over 9 premium preset timbres (various gender, age, language, and dialect combos) via user instructions.
    • VoiceDesign — Create entirely new voices from a natural-language text description (e.g. "a warm, slightly husky young male voice speaking slowly").
  • Supports streaming generation with end-to-end synthesis latency as low as 97ms via its proprietary Qwen3-TTS-Tokenizer-12Hz — a 12.5 Hz, 16-layer multi-codebook codec with a lightweight causal ConvNet for fast waveform reconstruction.

  • Features strong contextual understanding: the model automatically adapts tone, speaking rate, and emotional expression based on the semantics of the input text.

  • Can be used locally or on the cloud.


Kokoro-TTS


  • Lightweight yet high-quality TTS model with just 82 million parameters — one of the most downloaded TTS models on HuggingFace.

  • Faster-than-realtime inference due to its StyleTTS2/ISTFTNet architecture (no encoders or diffusion steps). Processes text in under 0.3 seconds.

  • Only has premade voices — no voice cloning. Not the best emotion control.

  • Voice support for English, British English, French, Italian, Japanese and Chinese (with some voice bleeding between languages).

  • Fully open-source under the Apache 2.0 license, making it great for commercial use.


Piper


  • Fast TTS with great multilingual support — works for almost all languages.

  • Decent quality; primarily intended for edge / local deployment (runs on a Raspberry Pi with minimal latency).

  • Not recommended as a primary choice for quality-focused generation — better suited for offline assistants, embedded systems, and pipelines where speed matters more than expressiveness.


GLM-TTS


  • GLM-TTS is an industrial-grade open-source TTS system developed by Zhipu AI (ZAI) — the team behind the GLM language model family — released December 11, 2025.

  • Uses a two-stage LLM + Flow Matching architecture. Its standout feature is a multi-reward reinforcement learning (GRPO) framework that trains the model across four simultaneous reward signals (speaker similarity, character error rate, emotion accuracy, and laughter naturalness), resulting in more expressive and emotionally consistent speech than most open-source alternatives.

  • On benchmarks, GLM-TTS_RL achieves the lowest Character Error Rate (0.89) of any open-source TTS while maintaining high speaker similarity — competitive with commercial systems like MiniMax.

  • Key features:

    • 0-shot voice cloning from just 3–10 seconds of reference audio
    • Phoneme-level pronunciation control via "Hybrid Phoneme + Text" input — critical for polyphone disambiguation in Chinese (e.g., the character "行" can be xíng or háng) and for professional audiobook/dubbing use cases
    • Streaming inference with ~400ms first-frame latency on GPU
  • Primarily supports Chinese, with decent bilingual Chinese-English mixed-text support. Not a multilingual model.

  • Available under Apache 2.0 (code) and MIT (weights) — both permit commercial use.

  • Can be used locally or on the cloud.


GPT-SoVITS


  • GPT-SoVITS is a few-shot voice cloning & TTS system — just 1 minute of audio is enough to fine-tune a voice, and zero-shot works from a 5-second reference clip.

  • Very good with Chinese, and supports English, Japanese, Korean and Cantonese cross-language synthesis, though some noise artifacts remain in non-Chinese outputs.

  • Now on v4, which fixes the metallic artifacts of v3 caused by non-integer upsampling and natively outputs 48kHz audio. The new v2Pro/v2ProPlus variants offer v4-level quality at v2's hardware cost and speed — recommended for most users.

  • Can be used both locally & online:

You have reached the end.

Report Issues