GPT-SoVITS

Last update: Mar 8, 2024


Introduction

See original guide

  • GPT-SoVITS is an open-source repository focused on TTS & cross-language inference, with a Colab port coming soon. Credits to RVC-Boss.

  • Currently it only supports Chinese, English & Japanese. More languages are coming soon.

  • You'll require great specs & a NVIDIA GPU with >=6G VRAM to run it smoothly. Otherwise, use the Colab.

  • This guide is a translation of the original one with a few tweaks, made by Delik. [ Discord: @delik - Wechat: Dellikk ] ‎


Installation

1. Download prezip

  • Download the prezip of the latest version here.

2. Extract

  • Unzip the folder. It's advisable to use 7-ZIP to do so.

3. Launch

  • Open the folder & run go-web.bat to open WebUI.

Training

1. Prepare dataset

  • The dataset should be between 1 - 30 minutes. But you must prioritize quality over quantity.

  • For the best results, ensure the audio is properly cleaned, free of undesired noises & distortions.

  • GPT-So-VITS is made for TTS only, so it's also best to remove any singing/muffly voice parts.

  • Learn how to clean datasets.


2. Audio Slicer

  1. Copy the path file of your dataset & paste it in the Audio slicer input bar.

  2. Create a new folder somewhere. Copy its path folder & paste in Audio slicer output. This is where the outputs are getting stored.

  3. Adjust the parameters if needed.

  4. Finally, click Start Audio Slicer to complete this step.


3. ASR

  1. The Input folder path should be the same as Audio slicer output. Jst copy the path & paste it inside the bar.

  2. If the dataset is in English/Japanese, use Faster-Whisper large v3.

    If it's in Chinese, use 达摩ASR.

  3. Then click Start batch ASR.

    If you run GPT-SoVITS for the first time, you might need to wait for a few minutes for it to download the ASR model you select.

  4. Finally, locate the .list file & copy the path. It will be in output/asr_opt, if you didn't change the folder for Output folder path.


4. Text Labelling (optional)

  1. Paste the .list file path into .list annotation file path.

  2. Tick Open labelling WebUI to open Text Labelling WebUI. A new tab will open.

  • Listen to each clip & edit the text if it's not transcribed properly.

  • The functions are self-explanatory. Use next index & previous index to check the next/previous page.

  • If you make changes, remember to save file & submit text.


5. Formatting

  1. Click 1-GPT-SOVITS-TTS & 1A-Dataset formatting to enter the training page.

  2. Input the name of your model in Experiment/model name, & the .list file path to Text labelling file.

  3. Scroll down to the end & start One-click formatting to begin formatting.


6. SoVITS Training

  1. Scroll up then click 1B-Fine-tuned training.

Batch size
2 | Use 1 if the GPU has 6GB VRAM.
Total epochs
8
Text model learning rate weighting
<=0.4
Save frequency
4
  1. After that, click Start SoVITS training

7. GPT Training

Batch size
2 (1 if your gpu has 6G vram)
Total epochs
10
Save frequency
5
DPO training
disabled (explained later)

After that, click Start GPT training


DPO training (optional)

  • DPO training greatly improves the performance (not audio quality) & stability of the model.

  • It can infer more text at once without slicing & it's less prone to errors (like repeating/skipping words) when inferring.

  • For this, you'll requiere:
    • A GPU with 12G VRAM or more.
    • A very high quality dataset (you need to do text labelling) to enable this.
    • Using a batch size of 1. Keep the other settings same as above.
    • Otherwise, this will worsen the model.

Inference

  1. Go to the 1C-inference tab.

  2. Press refreshing model paths & select your models from the dropdowns respectively.

  3. Tick Open TTS inference WEBUI.

  4. Upload a clip for reference audio (must be 3-10 seconds). Then fill-in the Text for reference audio, which is what does the character say in the audio. Choose the language on the right.

    • The reference audio is very important as it determines the speed & emotion of the output. Try different ones to polish your output.

    • You can reopen the text proofreading tool to download the reference audio, and copy & paste the text for reference audio.

    • Hover above the "duration" to adjust the length of the reference audio, & hover above "it" to delete the current reference audio.

    • No reference text mode exists, but it's not advisable to use it. It affects the quality a lot.

  5. Fill the Inference text & set the Inference language, then click Start inference.

    • If the text is too long choose the options in How to slice the sentence.

    • If you did not get your desired output, you can infer it again or change reference audio and/or adjust GPT parameters.


You have reached the end.

Report Issues