# Dataset & Isolation

Last update: Dec 24, 2024


# Introduction

  • In this massive guide it will be explained how to properly prepare a dataset to make a RVC model.

  • In the field of AI, it's the collection of data used to train an AI model. It contains examples of the inputs the model is expected to handle, along with the correct outputs.

  • In the context of RVC, it's an audio file that's given to RVC, containing the voice the model is going to replicate. It can be a speaking, singing voice drums, sound effects or noise.

  • The quality, variety & length of the dataset are the biggest determining factors for the final quality of the model. Let's explain Length and Variety.


# Length & Variety

  • For beginners we recommend sticking with a dataset length of 15 minutes of pure data not counting silence, or if you desire a natural sounding model go for 40+ minutes of dataset. Just remember quality over quantity.

  • Variety in your dataset is also important because without it RVC lacks the ability to generate diverse audio.

  • Some things to increase the generalization abilities of RVC and increase the diversity in your dataset include:

    • Removing repeated words. ( If you want you can be extreme you can do this and remove every single repeated word that's fine, but generally there is no need to do this. )
    • Include speech in many ranges and pitches.
    • Longer datasets.
    • Expressive speech.

# Quality

  • A quality dataset is super important for RVC since without one RVC will struggle to make anything good or believable.

# Here are some recommendations for a quality dataset.

#

# Clean vocals.

  • Ensure there isn't much background noise, reverb, overlapping voices, music, distortion, or small silences. Some quiet natural background noise is fine and won't ruin your model since the original pretrains for RVC were made with a noisy dataset, so RVC knows how to deal with noise. You'll learn more on cleaning vocals in the Vocal Isolation & Cleaning section below.

# Audio quality.

  • The higher the audio quality, the better. If possible have it in a lossless format like WAV or FLAC, not a lossy one like MP3. No converting a MP3 to a FLAC or WAV won't remove the compression.

# No harsh sibilance/popping.

  • Additionally, don't include harsh sibilance (loud "S" & "SH" pronunciation) or popping sounds (loud "P" sound)
    • Robotic sibilances are due to your dataset being short or they are overfitted. You can fix this by making your dataset larger or by choosing an epoch where the sibilants aren't overfitted.
    • Harsh sibilances are due to your dataset having harsh sibilants. You can fix this by de-essing or making your dataset larger. ‎

# No Audio Damage.

  • The most inportant part of a clean dataset, if your audio is damaged RVC will struggle with it causing it to overall sound worse because RVC will create synthetic data and try to learn from it, so make sure your audio isn't damged.

# Artifacts

In RVC, artifacting refers to an anomaly where the output voice sounds "robotic" & glitchy.
This occurs after the inference or model training process.

# Causes

It usually occurs when the dataset/vocal sample meets any of these criteria:

• Audio is low-quality
• Voice model was overfitted, undertrained or overtrained
• There are overlapping voices
• There is reverb
• There is noise

As you noticed, most of the issues boil down to the audio sample not being properly clean. RVC is built for purely working with voices, not other sounds.

Remember that the cleaner your input audio is, the better the results.

# Solutions

# 1. Use a lossless format:

  • If possible, it's best if your audio is in a lossless format like WAV or FLAC, preserving its original quality.

  • Avoid using lossy ones like MP3 or OGG. ‎

# 2. If doing inference:

# 3. If training models:


# Vocal Isolation & Cleaning

  • A vocal isolation app is a software designed to extract a person's vocals from an audio file, usually through the use of AI models.

  • They can remove undesired noises, like background noise, reverb, echo, music, etc.

  • The goal is to get an audio sample with clean and natural vocals, which is what RVC needs to give the most accurate & quality results.

  • For RVC users, the best app is Ultimate Vocal Remover 5 (or UVR). It can be used either locally or through the cloud.

  • If you want to remove noise manually to avoid ai artifacts you can use RX 11, which is mentioned in this guide.

# Local UVR

# Installation


  1. Go to their official website & press Download UVR. If you want to use BS / Mel Roformer you are going to need to install this.

    image

#
  1. It will redirect you their GitHub page. Click the download link for your operating system.
    UVR is available both on Windows & Mac.

#
  1. Once the installer finishes downloading, execute it & follow the instructions.
    Make sure to tick 🗹 Create a desktop shortcut for an easier access to UVR.

    image

#
#

# How to Use


#

# 1. Input audio.

#
  • Click Select input to select your audio/s. Or just drag the files to it.

  • In Select output you can define the folder for the results.

    image


# 2. Select FLAC & GPU Conversion.

#
  1. At the right you can select the output format.
    We recommend picking FLAC. Learn why here.

  2. If your GPU is compatible with CUDA, toggle GPU Conversion on for a faster process.

    image

#

This step is not mandatory, but recommended for better results.


# 3. Extract vocals.

#
  1. In CHOOSE PROCESS METHOD select MDX-Net, and select either the BS Roformer-Viper-X 1296 or MDX23C model.

    image

#
#
  1. Now click the long Start Processing button.
#

# 4. Clean vocals.

#
  • Usually songs include reverb & backing vocals. These negatively impact the results in RVC.

  • So if the output has any undesired noises, follow the procedure on Cleaning Vocals.

#

# 1. Input audio.

#
  • Click Select input to input the vocals. Or just drag the files to it.

  • In Select output you can define the folder for the results.

    image


# 2. Select FLAC & GPU Conversion.

#
  1. At the right you can select the output format.
    We recommend picking FLAC. Learn why here.
  2. If your GPU is compatible with CUDA, toggle GPU Conversion on for a faster process.

    image
#

This step is not mandatory, but recommended for better results.


# 3. Select model.

#
  1. In Process Method select VR.
  2. Set Window Size to 320. (optional)
    Lower Window Size yield a higher output quality, but will take longer to process.
  3. Check the model list. In Select VR Model pick the one according to what you need to remove.

    If you need to remove multiple noises, follow this pipeline for the best results:
    Remove instrumental -> Remove reverb -> Extract main vocals -> Remove noise
#

# 4. Process.

#

Click the Start processing button at the bottom. And that will be all.


#

  • # The best models for UVR are:

Extraction Process Method Model
Vocals MDX-Net Gabox's voc_fv4
De-Reverb VR Anvuew mel dereverb v2
Extract Backing Vocals VR Mel roformer karaoke
De-Noise VR Mel denoiser

# Troubleshooting


#
  • Click the wrench (🔧) on the left & go to Download Center
  • Select the category of the model (MDX-NET or VR)
  • Unfold its dropdown & select the model that you need
  • Then click the download button (📥). The model will download, which will take a few minutes
#
  • Modify the Aggression Setting value on the right.
  • This determines the depth of the extraction. Only the VR method has it.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.
#

# RX 11

# Installation


  1. Go to their official website & buy it or sail the seven seas and find a treasure box which contains RX 11.

#

# Usage

  1. To use RX 11 it is STONGLY recommended that you read this guide on RX 11.

# Cloud UVR

# How to Use


#

# 1. Set up Colab

#
  1. First access the Colab space here.
  2. Then Log in to your Google account.
  3. Execute the Gdrive Connection cell by pressing the play button . Grant all the permissions.

    image
  • It'll finish once the logs say Mounted at /content/drive
  1. Then run the Install cell.

    image
  • Once it's done it will look like this:

    image

# 2. Set up folders

#
  • In Google Drive, make two folders, named input & output.

    image


# 3. Separate

#
  1. Select your model of choice and run the Separation cell. You can look here for the best models


image

  1. Download the result located in the output folder.

Extraction Model
Vocals Gabox's voc_fv4
De-Reverb Anvuew mel dereverb v2
Extract Backing Vocals Mel roformer karaoke
De-Noise Mel denoiser

# Troubleshooting


# Cannot connect to GPU backend.

# UVR Zero GPU


# How to use


#

Access the space here, you don't need an account to use this.

# 2. Select vocals & options

#
  1. Tap the Input Audio box & select your audio, or simply drag & drop.

    image

  2. Once it's done uploading, in CHOOSE PROCESS METHOD, select BS/Mel Roformer. Under that you can change Segment Size and Overlap, the defaults are fine.

    image

# 3. Select model

#
  1. Check the model list & in CHOOSE MODEL pick the one according to what you need to remove.

    If you need to remove multiple noises, follow this pipeline for the best results:
    Remove instrumental -> Remove reverb -> Extract main vocals -> Remove noise

# 4. Start Processing

#
  1. Click Spererate! below. Wait a moment for the audio to process.
  2. Playable audios will then appear in the output boxes below. To download the output, click the little download icon in the top right.
  • If you're extracting lead vocals, remember to download the backing ones if you wish to keep them.

# Troubleshooting


#
  • Modify the Aggression Setting value on the right.
  • This determines the depth of the extraction. Only the VR method has it.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.
  • Run the audio through BVE. Modify the Aggression Setting if necessary.
#
  • Report your issue here.

# GPU task aborted:

ZeroGPU HuggingFace Spaces have a max inference time duration, it’s the time it takes to do an Inference (use the model, not the time of your audio file itself), on default it’s around 1 minute which is what Ilaria RVC uses. You need to retry with a shorter audio, you could also split your audio.


# You have exceeded your GPU quota ( NUMBER s left vs. 60s requested). Sign-up on Hugging Face to get more quotas or retry in Hour:Minutes:Seconds

ZeroGPU HuggingFace Spaces have a quota per account, if you aren’t signed in you will get less quota so it’s better to login for more quota. You could get the ‘Sign-up’ part even if you are logged in. The ZeroGPU Quota can’t be seen but it isn’t unlimited. You can either:

  • Login so you get more quota
  • Wait
  • Pay to be an HuggingFace PRO Member to get X5 times more quota
#

# Important Notes ‎


  • MVSEP is a website for isolating vocals, that works similarly as UVR.

  • The UVR Colab is much faster & convenient for this task. Use MVSEP if you run out of GPU runtime or feel lazy to convert your audio to WAV.

  • For free users, you can't convert audios in batches or longer than 10 minutes. If that's your case, trim it into different pieces.

  • There is a queue so make sure you make an account to skip most of it.


#
#

# How to Use ‎


#

# 1. Log in.

#
  1. First, make an account here.
  2. Once logged in, go to the main page.

# 2. Select audio.

#
  1. Click Browse File & select your audio, or simply drag & drop. The audio will begin to upload.

    image


# 3. Extract vocals.

#
  1. In Separation type select BS Roformer

  2. In Output encoding select FLAC.
    We recommend selecting FLAC from now on. Learn more here.

  3. Once the audio is done uploading, click Separate

    image


# 4. Download output.

#
  • When it's done converting it will redirect you to a page where you can listen the results.
  1. Tap the three buttons of the Vocals audio and then Download.

  2. Same thing for the Instrumental, if you wish to keep it.

    image


# 4. Clean vocals

#
  • Usually songs include reverb & backing vocals. These negatively impact the results in RVC.
  • So if the output has any undesired noises, follow the procedure on Cleaning Vocals.
#

# 1. Log in.

#
  1. First, make an account here.
  2. Once logged in, go to the main page.

# 2. Select audio & output format.

#
  1. Click Browse File & select your audio, or simply drag & drop. The audio will begin to upload.

    image

  2. In Output encoding select FLAC.
    We recommend selecting FLAC from now on. Learn more here.

    image


# 3. Select model.

#
  1. In Separation Type, select DeNoise by aufr33.
  2. Check the model list. Pick the one according to what you need to remove.

    If you need to remove multiple noises, follow this pipeline for the best results:
    Remove instrumental -> Remove reverb -> Extract main vocals -> Remove noise

# 4. Download output.

#
  1. Click Separate & when it's done converting it will redirect you to a page, where you can listen the results.

  2. Tap the three dots of the audio you need and then Download.
    If you wish to keep the backing vocals stem, remember to download it too.

    image


  • # The best models for MVSEP are:

Extraction Separation Type Model
Vocals MelBand Roformer unwa big beta v5e
Instrumental MelBand Roformer unwa instrumental v1e
De-Reverb Reverb Removal Reverb removal by anvuew V2 (MelRoformer)
Extract Backing Vocals MelBand Karaoke Extract from vocals
De-Noise DeNoise by aufr33 Aggresive

# Troubleshooting ‎


#
  • Using the Separation Type of DeNoise by aufr33, you can modify the Aggressiveness. This determines the depth of the extraction.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.
#
  • Try running the audio through MelBand Karaoke or BVE. Modify the Aggression Setting if necessary.

# How to Use


#

# 1. Choose a Separator

#
  1. First go to X-minus's website and click the "Vocal Remover" at the top right.
image
  1. Then select "Music and vocals" and choose "Bs Roformer"
image

# 2. Upload Your Audio File

  1. Then click "select a file" and choose a audio file, or you can drag and drop a file. And when it's done it will look like this:
image
  1. You can now click "Vocals" to download the vocals and "Other" to download the instrumentals.
#

# 1. Choose a De-Noiser

#
  1. In "De-Noise" select "Mel-Roformer De-Noise". You can also check the model list to see what is the best model for your needs.
image

# 2. Upload Your Audio File

  1. Then click "select a file" and choose a audio file, or you can drag and drop a file. And when it's done it will look like this:
image
  1. You can now click "Vocals" to download the vocals and "Other" to download the instrumentals.
  • # The best models for X-Minus are:

Extraction Model
Vocals/Instrumental BS Roformer
De-Reverb MDX23C (De-Reverb)
Extract Backing Vocals UVR BVE 2
De-Noise Mel Roformer De-Noise

# Preparing the dataset

# Step 1: Find the Sample Rate

  • This is a unit in that defines the total amount of samples (data) that can fit within 1 second of an audio. They are measured in kilohertz (kHz).

  • The most common sample rates are 32, 40, 44.1, & 48. The higher the sample rate, the more information it stores, therefore the higher the quality.

  • While training in RVC, you'll have to set the target sample rate as your dataset's. This value affects the final quality.

  • # A simple way to determine it is with Spek:
    • # STEP 1:

      Download and install Spek here.

    • # STEP 2:

      Open spek and just drag & drop audio into it.

      image


# Step 2: Truncating Silence

  • Go to Effects -> Special -> Truncate Silence
  • Use the following values:

    image

    # Step 2: Audio Normalization. (Optional)

    • Go go to Effects -> Volume and Compression -> Loudness Normalization
    • Use these values:

      image

LUFS are used over db because hifigan needs perceptual quality and db doesnt offer that.


# Step 3: Export

  • On the upper right corner go to File and click Export Audio.

    image

    • And finally, introduce these values:

      • Format: WAV
      • Encoding: 32-Bit Float
      #
      image

# You have reached the end.

Report Issues