# Dataset & Isolation

Last update: November 16, 2025

# Introduction

In this massive guide it will be explained how to properly prepare a dataset to make a RVC model.
In the field of AI, it's the collection of data used to train an AI model. It contains examples of the inputs the model is expected to handle, along with the correct outputs.
In the context of RVC, it's an audio file that's given to RVC, containing the voice the model is going to replicate. It can be a speaking, singing voice drums, sound effects or noise.
The quality, variety & length of the dataset are the biggest determining factors for the final quality of the model. Let's explain Length and Variety.

# Length & Variety

For beginners we recommend sticking with a dataset length of 15 minutes of pure data not counting silence, or if you desire a natural sounding model go for 40+ minutes of dataset. Just remember quality over quantity.
Variety in your dataset is also important because without it RVC lacks the ability to generate diverse audio.
Some things to increase the generalization abilities of RVC and increase the diversity in your dataset include:
- Removing repeated words. ( If you want you can be extreme you can do this and remove every single repeated word that's fine, but generally there is no need to do this. )
- Include speech in many ranges and pitches.
- Longer datasets.
- Expressive speech.

# Quality

A quality dataset is super important for RVC since without one RVC will struggle to make anything good or believable.

# `Here are some recommendations for a quality dataset.`

# ‎

# Clean vocals.

Ensure there isn't much background noise, reverb, overlapping voices, music, distortion, or small silences. Some quiet natural background noise is fine and won't ruin your model since the original pretrains for RVC were made with a noisy dataset, so RVC knows how to deal with noise. You'll learn more on cleaning vocals in the Vocal Isolation & Cleaning section below.
‎

# Audio quality.

The higher the audio quality, the better. If possible have it in a lossless format like WAV or FLAC, not a lossy one like MP3. No converting a MP3 to a FLAC or WAV won't remove the compression.
‎

# No harsh sibilance/popping.

Additionally, don't include harsh sibilance (loud "S" & "SH" pronunciation) or popping sounds (loud "P" sound)
- Robotic sibilances are due to your dataset being short or they are overfitted. You can fix this by making your dataset larger or by choosing an epoch where the sibilants aren't overfitted.
- Harsh sibilances are due to your dataset having harsh sibilants. You can fix this by de-essing or making your dataset larger. ‎

# No Audio Damage.

The most inportant part of a clean dataset, if your audio is damaged RVC will struggle with it causing it to overall sound worse because RVC will create synthetic data and try to learn from it, so make sure your audio isn't damged.
‎

# Artifacts

In RVC, artifacting refers to an anomaly where the output voice sounds "robotic" & glitchy.
This occurs after the inference or model training process.

# Causes

It usually occurs when the dataset/vocal sample meets any of these criteria:

• Audio is low-quality
• Voice model was overfitted, undertrained or overtrained
• There are overlapping voices
• There is reverb
• There is noise

As you noticed, most of the issues boil down to the audio sample not being properly clean. RVC is built for purely working with voices, not other sounds.

Remember that the cleaner your input audio is, the better the results.

# Solutions

# 1. Use a lossless format:

If possible, it's best if your audio is in a lossless format like WAV or FLAC, preserving its original quality.
Avoid using lossy ones like MP3 or OGG. ‎

# 2. If doing inference:

Remove undesired noises with an vocal isolation software.
Lowering the search feature ratio can also minimize this issue.
If breathing sounds produce it, lower the Protection value. ‎

# 3. If training models:

Ensure to clean your dataset properly, this includes removing silences and distortions.

‎

# Vocal Isolation & Cleaning

A vocal isolation app is a software designed to extract a person's vocals from an audio file, usually through the use of AI models.
They can remove undesired noises, like background noise, reverb, echo, music, etc.
The goal is to get an audio sample with clean and natural vocals, which is what RVC needs to give the most accurate & quality results.
For RVC users, the best app is Ultimate Vocal Remover 5 (or UVR). It can be used either locally or through the cloud.
If you need to remove multiple noises, follow this pipeline for the best results: Remove instrumental -> Remove reverb -> Extract main vocals -> Remove noise
If you want to remove noise manually to avoid ai artifacts you can use RX 11, which is mentioned in this guide.

‎

# Voice Restoration

For restoring and enhancing voice quality, especially from low-quality sources, dedicated restoration models (like AI Audio Upscaling) can be used.

Using Upscaled/Restored Audio as an RVC Dataset can cause issues

It's not the best option for RVC training and it may cause issues as the model trains on fake data.

Apollo:
- By itself, it's an MP3 uncompressor. Can be found on: MVSP | Jarredou's Apollo Audio Restoration Google Colab | Web UI Fork of Jarredou's Apollo Audio Restoration Google Colab
- Apollo by Lew is a Vocal Enhancer. Can be found on: MVSP | X-minus | Jarredou's Apollo Audio Restoration Google Colab | Web UI Fork of Jarredou's Apollo Audio Restoration Google Colab
- Apollo Universal is for both music and vocals. Can be found on: MVSP | X-minus | Jarredou's Apollo Audio Restoration Google Colab | Web UI Fork of Jarredou's Apollo Audio Restoration Google Colab
- Apollo by baicai1145 is a Vocal Enhancer for muddy vocals. Can be found on: MVSP | Jarredou's Apollo Audio Restoration Google Colab | Web UI Fork of Jarredou's Apollo Audio Restoration Google Colab
ClearerVoice-Studio's Clear Voice: It has an active development. Can be found on: GitHub Repository | ModelScope Demo
A2SB: Nvidia's own Upscaler. Can be found on: GitHub Repository | Google Colab by Sir Joseph
AudioSR: It doesn't have active development, it has a basic and speech model. Can be found on: GitHub Repository | MVSEP (in the Experimental section, automatic cutoff search, but rather no mid/side processing) | Nick088's Huggingface ZeroGPU Space | Audacity Plugin (CPU with SSE 4.2 required, or at least Intel i/GPU for acceleration, as it doesn't run with Nvidia/AMD GPUs for acceleration) | Web UI Fork of Jarredou's Google Colab

These tools can be a separate step in your pipeline, used before or after initial noise reduction, depending on the audio source.

You can also check the Audio Separation community AI Upscalers section in here linked on thier Discord. Please be aware that even if it's a good resource, it may contain some outdated info as it's long.

‎

# Local UVR

Support UVR5

You'll require great specs & GPU to run it effectively. Otherwise, use either the Eddy's UVR5 UI Google Colab or the HuggingFace Space.

# Installation

Go to their official website & press Download UVR. If you want to use BS / Mel Roformer you are going to need to install this.

#

It will redirect you their GitHub page. Click the download link for your operating system.
UVR is available both on Windows & Mac.

# ‎

Once the installer finishes downloading, execute it & follow the instructions.
Make sure to tick 🗹 Create a desktop shortcut for an easier access to UVR.

# ‎

# How to Use

# 1. Input audio.

# ‎

Click Select input to select your audio/s. Or just drag the files to it.
‎
In Select output you can define the folder for the results.

‎

For better results, have the audio in a lossless format (WAV or FLAC), & not MP3.

‎

# 2. Select FLAC & GPU Conversion.

# ‎

At the right you can select the output format.
We recommend picking FLAC. Learn why here.
‎
If your GPU is compatible with CUDA, toggle GPU Conversion on for a faster process.

# ‎

This step is not mandatory, but recommended for better results.

‎

# 3. Extract vocals.

# ‎

Select the Process Method and Model depending on your use case and the List of Best Models.

‎

# ‎

Now click the long Start Processing button.

# ‎

TIP: To test models/options more efficiently, tick Sample Mode to only process 30 seconds of your sample.

‎

# Best models for UVR:

Note: most actually are all MelBand Roformer models, but there isn't a proper list yet for them

Extraction	Process Method	Model
Vocals	MDX-Net	Gabox's `voc_fv4`
Instrumental	MDX-Net	Instrumental	Unwa's `Inst V1 Plus` or `Inst V1 (E) Plus` (the e is emphasis for fullness, making the output sound full but possibly raising noise and lowering bleedlessness, so it might sound less clean)
De-Reverb	MDX-Net	Anvuew's `mel_dereverb_v2` or `room_mono_v1`
Extract Leading Vocals	MDX-Net	`bs roformer karaoke by anvuew`
De-Noise	MDX-Net	`Mel denoiser v2`
De-Echo	MDX-Net	`uvr de-echo` and `uvr de-echo aggressive` (use aggressive only if there's still some echo left after the first model, their output has a cutoff around 15–16 kHz)

‎

# Troubleshooting

# A model isn't there.

Click the wrench (🔧) on the left & go to Download Center
Select the category of the model (Process Method)
Unfold its dropdown & select the model that you need
Then click the download button (📥). The model will download, which will take a few minutes

# UVR extracted too little/too much.

Modify the Aggression Setting value on the right.
This determines the depth of the extraction. Only the VR method has it.
A higher value will deepen the extraction, and a lower one will soften it.
Each audio is different, so you'll have to test the ideal value.

# I can't remove some of the backing vocals.

Run the audio through BVE. Modify the Aggression Setting if necessary.