# Dataset & Isolation

Last update: August 18, 2025


# Introduction

  • In this massive guide it will be explained how to properly prepare a dataset to make a RVC model.

  • In the field of AI, it's the collection of data used to train an AI model. It contains examples of the inputs the model is expected to handle, along with the correct outputs.

  • In the context of RVC, it's an audio file that's given to RVC, containing the voice the model is going to replicate. It can be a speaking, singing voice drums, sound effects or noise.

  • The quality, variety & length of the dataset are the biggest determining factors for the final quality of the model. Let's explain Length and Variety.


# Length & Variety

  • For beginners we recommend sticking with a dataset length of 15 minutes of pure data not counting silence, or if you desire a natural sounding model go for 40+ minutes of dataset. Just remember quality over quantity.

  • Variety in your dataset is also important because without it RVC lacks the ability to generate diverse audio.

  • Some things to increase the generalization abilities of RVC and increase the diversity in your dataset include:

    • Removing repeated words. ( If you want you can be extreme you can do this and remove every single repeated word that's fine, but generally there is no need to do this. )
    • Include speech in many ranges and pitches.
    • Longer datasets.
    • Expressive speech.

# Quality

  • A quality dataset is super important for RVC since without one RVC will struggle to make anything good or believable.

# Here are some recommendations for a quality dataset.

#

# Clean vocals.

  • Ensure there isn't much background noise, reverb, overlapping voices, music, distortion, or small silences. Some quiet natural background noise is fine and won't ruin your model since the original pretrains for RVC were made with a noisy dataset, so RVC knows how to deal with noise. You'll learn more on cleaning vocals in the Vocal Isolation & Cleaning section below.

# Audio quality.

  • The higher the audio quality, the better. If possible have it in a lossless format like WAV or FLAC, not a lossy one like MP3. No converting a MP3 to a FLAC or WAV won't remove the compression.

# No harsh sibilance/popping.

  • Additionally, don't include harsh sibilance (loud "S" & "SH" pronunciation) or popping sounds (loud "P" sound)
    • Robotic sibilances are due to your dataset being short or they are overfitted. You can fix this by making your dataset larger or by choosing an epoch where the sibilants aren't overfitted.
    • Harsh sibilances are due to your dataset having harsh sibilants. You can fix this by de-essing or making your dataset larger. ‎

# No Audio Damage.

  • The most inportant part of a clean dataset, if your audio is damaged RVC will struggle with it causing it to overall sound worse because RVC will create synthetic data and try to learn from it, so make sure your audio isn't damged.

# Artifacts

In RVC, artifacting refers to an anomaly where the output voice sounds "robotic" & glitchy.
This occurs after the inference or model training process.

# Causes

It usually occurs when the dataset/vocal sample meets any of these criteria:

• Audio is low-quality
• Voice model was overfitted, undertrained or overtrained
• There are overlapping voices
• There is reverb
• There is noise

As you noticed, most of the issues boil down to the audio sample not being properly clean. RVC is built for purely working with voices, not other sounds.

Remember that the cleaner your input audio is, the better the results.

# Solutions

# 1. Use a lossless format:

  • If possible, it's best if your audio is in a lossless format like WAV or FLAC, preserving its original quality.

  • Avoid using lossy ones like MP3 or OGG. ‎

# 2. If doing inference:

# 3. If training models:

  • Ensure to clean your dataset properly, this includes removing silences and distortions.

# Vocal Isolation & Cleaning

  • A vocal isolation app is a software designed to extract a person's vocals from an audio file, usually through the use of AI models.

  • They can remove undesired noises, like background noise, reverb, echo, music, etc.

  • The goal is to get an audio sample with clean and natural vocals, which is what RVC needs to give the most accurate & quality results.

  • For RVC users, the best app is Ultimate Vocal Remover 5 (or UVR). It can be used either locally or through the cloud.

  • If you need to remove multiple noises, follow this pipeline for the best results: Remove instrumental -> Remove reverb -> Extract main vocals -> Remove noise

  • If you want to remove noise manually to avoid ai artifacts you can use RX 11, which is mentioned in this guide.

# Local UVR

# Installation


  1. Go to their official website & press Download UVR. If you want to use BS / Mel Roformer you are going to need to install this.

    UVR Official Website Download Button

#
  1. It will redirect you their GitHub page. Click the download link for your operating system.
    UVR is available both on Windows & Mac.

#
  1. Once the installer finishes downloading, execute it & follow the instructions.
    Make sure to tick 🗹 Create a desktop shortcut for an easier access to UVR.

    UVR Installation Desktop Shortcut

#

# How to Use


# 1. Input audio.

#
  • Click Select input to select your audio/s. Or just drag the files to it.

  • In Select output you can define the folder for the results.

    Select input and outputs and folders


# 2. Select FLAC & GPU Conversion.

#
  1. At the right you can select the output format.
    We recommend picking FLAC. Learn why here.

  2. If your GPU is compatible with CUDA, toggle GPU Conversion on for a faster process.

    Setting FLAC & GPU Conversion
#

This step is not mandatory, but recommended for better results.


# 3. Extract vocals.

#
  1. Select the Process Method and Model depending on your use case and the List of Best Models.

    UVR Process Method & Model Selection

#
#
  1. Now click the long Start Processing button.
#

# Best models for UVR:

Note: they actually are all MelBand Roformer models, but there isn't a proper list yet for them

Extraction Process Method Model
Vocals MDX-Net Gabox's voc_fv4
Instrumental MDX-Net INST Gabox V7
De-Reverb MDX-Net Anvuew mel dereverb v2
Extract Backing Vocals MDX-Net Mel roformer karaoke
De-Noise MDX-Net Mel denoiser v2

# Troubleshooting


# A model isn't there.

  • Click the wrench (🔧) on the left & go to Download Center
  • Select the category of the model (Process Method)
  • Unfold its dropdown & select the model that you need
  • Then click the download button (📥). The model will download, which will take a few minutes

# UVR extracted too little/too much.

  • Modify the Aggression Setting value on the right.
  • This determines the depth of the extraction. Only the VR method has it.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.

# I can't remove some of the backing vocals.

# Local Eddy's UVR5 UI

# Installation


  1. Go to their Eddy's UVR5 UI Latest Release & Follow the installation steps (Precompiled versions are suggested).

    Eddy's UVR5 UI Local Installation Steps

#

# How to Use

# 1. Select input & options

#
  1. Tap the Input Audio box & select your audio, or simply drag & drop.

    Input Audio

# 2. Select model

#
  1. Once it's done uploading, select a Model by the List of the Best Models. Under that you can change Segment Size and Overlap, the defaults are fine.
    Model Selection

# 3. Start Processing

#
  1. Click Spererate! below. Wait a moment for the audio to process.

  2. Playable audios will then appear in the output boxes below. To download the output, click the little download icon in the top right.

# Best Models for Local Eddy UVR5 UI

Extraction Model
Vocals MelBand Roformer | Vocals FV4 by Gabox
Instrumental MelBand Roformer | INSTV7 by Gabox
De-Reverb MelBand Roformer | De-Reverb by anvuew
Extract Backing Vocals Mel-Roformer-Karaoke-Aufr33-Viperx
De-Noise Mel-Roformer-Denoise-Aufr33-Aggr

# Troubleshooting


# UVR extracted too little/too much.

  • Modify the Aggression Setting value on the right.
  • This determines the depth of the extraction. Only the VR method has it.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.

# I can't remove some of the backing vocals.

  • Run the audio through BVE. Modify the Aggression Setting if necessary.

# I couldn't find my answer.

  • Report your issue here.

# Eddy's UVR5 UI HuggingFace ZeroGPU Space


# How to use

# 1. Access the HuggingFace Space

Access the space here, you don't need an account to use this, but making one will get you more free time, and paying for HuggingFace PRO will give you the most ZeroGPU time.

# 2. Select input & options

#
  1. Tap the Input Audio box & select your audio, or simply drag & drop.

    Input Audio

# 3. Select model

#
  1. Once it's done uploading, select a Model by the List of the Best Models. Under that you can change Segment Size and Overlap, the defaults are fine.
    Model Selection

# 4. Start Processing

#
  1. Click Spererate! below. Wait a moment for the audio to process.

  2. Playable audios will then appear in the output boxes below. To download the output, click the little download icon in the top right.

# Best Models for UVR5-UI-HF

Extraction Model
Vocals MelBand Roformer | Vocals FV4 by Gabox
Instrumental MelBand Roformer | INSTV7 by Gabox
De-Reverb MelBand Roformer | De-Reverb by anvuew
Extract Backing Vocals Mel-Roformer-Karaoke-Aufr33-Viperx
De-Noise Mel-Roformer-Denoise-Aufr33-Aggr

# Troubleshooting


# UVR extracted too little/too much.

  • Modify the Aggression Setting value on the right.
  • This determines the depth of the extraction. Only the VR method has it.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.

# I can't remove some of the backing vocals.

  • Run the audio through BVE. Modify the Aggression Setting if necessary.

# I couldn't find my answer.

  • Report your issue here.

# GPU task aborted:

ZeroGPU HuggingFace Spaces have a max inference time duration, it’s the time it takes to do an Inference (use the model, not the time of your audio file itself), on default it’s around 1 minute which is what Ilaria RVC uses. You need to retry with a shorter audio, you could also split your audio.

# You have exceeded your GPU quota ( NUMBER s left vs. 60s requested). Sign-up on Hugging Face to get more quotas or retry in Hour:Minutes:Seconds

ZeroGPU HuggingFace Spaces have a quota per account, if you aren’t signed in you will get less quota so it’s better to login for more quota. You could get the ‘Sign-up’ part even if you are logged in. The ZeroGPU Quota can’t be seen but it isn’t unlimited. You can either:

  • Login so you get more quota
  • Wait
  • Pay to be an HuggingFace PRO Member to get X5 times more quota

# Eddy's UVR5 UI Google Colab


# How to Use

# 1. Set up Colab

#
  1. First access Eddy's UVR UI Google Colab.

  2. Then Log in to your Google account.

  3. Execute the Installation cell by pressing the play button , optionally check use_drive & grant all the permissions for Batch Separation. ‎
    UVR5 UI Colab Installation Cell

  4. Then run the Run UI cell. You can choose different tunnels in case one is down.

    UVR5 UI Colab Run UI Cell
  • Once it's done open the public url, and it will be the same as using Eddy's UVR5 UI Local/HF-Space.

# Best Models for UVR5-UI-Colab

Extraction Model
Vocals MelBand Roformer | Vocals FV4 by Gabox
Instrumental MelBand Roformer | INSTV7 by Gabox
De-Reverb MelBand Roformer | De-Reverb by anvuew
Extract Backing Vocals Mel-Roformer-Karaoke-Aufr33-Viperx
De-Noise Mel-Roformer-Denoise-Aufr33-Aggr

# Troubleshooting


# Cannot connect to GPU backend.

# MSST Colab

This is jarredou's Music Source Separation Training (MMST) (Colab Inference).

# How to Use

# 1. Set up Colab

#
  1. First access the Colab space here.

  2. Then Log in to your Google account.

  3. Execute the Gdrive Connection cell by pressing the play button . Grant all the permissions.

    GDrive Connection cell

  • It'll finish once the logs say Mounted at /content/drive
  1. Then run the Install cell.

    Installation Cell
  • Once it's done it will look like this:

    Finished Installation

# 2. Set up folders

#
  • In Google Drive, make two folders, named input & output.

    GDrive folders input & output


# 3. Separate

#
  1. Select your model of choice and run the Separation cell. You can look for the List of the Best Models.


image

  1. Download the result located in the output folder.

# Best models for MSST:

Extraction Model
Vocals Gabox's voc_fv4
Instrumental INST Gabox V7
De-Reverb Anvuew mel dereverb v2
Extract Backing Vocals Mel roformer karaoke
De-Noise Mel denoiser v2

# Troubleshooting


# Cannot connect to GPU backend.

# MVSEP

# Important Notes ‎


  • MVSEP is a website for isolating vocals, that works similarly as UVR.

  • For free users, you can't convert audios in batches or longer than 10 minutes. If that's your case, trim it into different pieces.

  • There is a queue so make sure you make an account to skip most of it.


# How to Use


# 1. Log in.

#
  1. First, login. ‎

  2. Once logged in, go to the main page.


# 2. Select audio.

#
  1. Click Browse File & select your audio, or simply drag & drop. The audio will begin to upload.

    Upload Input

# 3. Model usage.

#
  1. In Separation type Select a Model based on the Best Models List. ‎

  2. In Output encoding select FLAC.
    We recommend selecting FLAC from now on. Learn more here.

  3. Once the audio is done uploading, click Separate

    Model Selection

    Flac as Output Encoding


# 4. Download output.

#
  • When it's done converting it will redirect you to a page where you can listen the results.
  1. Tap the three buttons of the Vocals audio and then Download.

  2. Same thing for the Instrumental, if you wish to keep it.

    Results of separated Stems


# Best Models for MVSEP:

Extraction Separation Type Model
Vocals MelBand Roformer either: unwa big beta v5e OR 2024.10
Instrumental MelBand Roformer either: unwa instrumental v1e plus OR 2024.10
De-Reverb Reverb Removal Reverb removal by either: Sucial V2 (MelRoformer) OR Anvuew V2 (MelRoformer)
Extract Backing Vocals MelBand Karaoke Model fuzed gabox & aufr33/viperx (SDR: 9.85)
De-Noise DeNoise by aufr33 Aggresive

# Troubleshooting


# MVSEP extracted too much/too little.

  • Using the Separation Type of DeNoise by aufr33, you can modify the Aggressiveness. This determines the depth of the extraction.
  • A higher value will deepen the extraction, and a lower one will soften it.
  • Each audio is different, so you'll have to test the ideal value.

# I can't remove some of the backing vocals.

  • Try running the audio through MelBand Karaoke or BVE. Modify the Aggression Setting if necessary.

# What is SDR?

Signal to distortion ratio, the higher is technically better, but your ears are more trustworthy.

# X-Minus (aka UVROnline)

# How to use

# 1. Choose a Separator

  1. First go to X-minus's website and click the "Vocal Remover" at the top right.
X-Minus Homepage
  1. Then select the Seperator type you need and the model based on the Best Models List
Select a Model in X-Minus

# 2. Upload Your Audio File

  1. Then click "select a file" and choose a audio file, or you can drag and drop a file. And when it's done it will look like this:
Use a Model in X-Minus
  1. You can now click "Vocals" to download the vocals and "Other" to download the instrumentals.
  • # Best models for X-Minus:

Extraction Model
Vocals/Instrumental Mel-RoFormer by Gabox Fv7z
De-Echo / De-Reverb De-reverb & De-echo by Sucial v2
Extract Backing Vocals Mel-RoFormer Lead/Back (the only, invisible, available one)
De-Noise (found in Restoration) Mel Roformer De-Noise
Restoration Apollo Universal by Lew (to enhance mp3 and other low quality files)

# RX 11

Go to their Official Website & buy it.


# Getting a Dataset

Getting the highest quality audio works best for Izotope and will result in a better model. Ideally, you want to preserve the dynamic range, the frequencies, and the fidelity/clarity. If you're working with low quality audio, RX cannot upscale or restore the missing details that were originally there. The end result will be audible as the voice model quality will be muddied with artifacts/tearing.

Mic proximity is also another factor to consider as you want the voice to be consistent since RVC does not handle audio frequency response well and will muffle the pronunciation of words. Keep this in mind for studio sessions and video game voice lines rips as it may have been bass boosted, compressed, or eq'd by default. Arguably, more variation of the voice will add to the vocal range of the model but we want to keep the accent consistent as well.

# For Youtube ripping

You can use Cobalt, yt-dlp, or Loader.to. Overall, yt-dlp is the best for ripping. Preferably rip the audios in Opus format so the downloaded audio will be 48khz, which can be resampled down to 44.1k on Izotope and trained on 40k via RVC. The quality depends on what is uploaded on the server side so this might not always be the case.

For YT-dlp, the command prompt is:

yt-dlp -x ffmpeg -i audio.opus audio.wav or yt-dlp.exe --audio-format wav -x https://www.youtube.com/watch?v=5aYwU4nj5QA&t=2s

Resampling down to 32k is also fine since it results in less harsher sibilance and -plosives for your model.

  • Sibilance's are the hissing sounds when a person speaks, and plosives are the sounds of air that are released through the mouth. They are both considered consonant sounds with RVC.

Whenever you export the dataset, you can export it to WAV 32-bit. For Flac files, use 24-bit. We never use MP3 files since RVC heavily compresses those audio files.


# Loading The Audio And Changing Settings

Open the WAV or FLAC file.

Now that we have the file in RX, make sure to turn this slider to the right to show the Spectrogram only. Waveform

Now after that it should look like this: Spectogram View

This is using Mel scaling, if you right click on the numbers list(20k and such). you can change the scaling. Mel is the best scaling in our case since it shows vocals better than Linear scaling would.


# Trying To Explain a Spectrogram

Spectrograms are graphical representations of frequencies which can determine whether your audio sample rate is 32k, 40k, or 48k.

frequency.png

There is a distinction between high-end frequency and low-end frequency in a spectrogram. The high-end frequency (48k-40khz), or the air region in the chart isn't audible for the human ears. But, it'll help to handle aliasing. Aliasing is the effect of new frequencies appearing in the sampled signal after reconstruction, that were not present in the original signal. In other words, it creates artifacts to your model.

Meanwhile you can hear some of the low-end frequencies. If you were to resample from 40k to 32k, then most of the high-end and low-end frequencies are gone at the cost of less harsher sibilances and -plosives sounds like I mentioned before.

Chart.png

# Example of Audios Through a Spectrogram

For example, this is noise. The orange area may look like breathing that's masked under the artifacts, but RVC will consider this noise as it's barely audible and you'll only hear static or dry air. The blue areas surrounding the audio are also noise. You can think of these as layers that needs to be removed when we use the Spectral Denoise module later.

noiseprofile.png

Now this is breathing. Keep in mind that RVC will consider "soft breathing" as a white noise if the breath sounds are too low (around -40db). Proper vocal breathing are grunts like "huffs, "hahhhhs, "hoohhhhhs". There cannot be harsh inhaling sounds. Breathe sounds by itself without a voice or tone behind it will also cause RVC to think it is noise. Without breathes, the model sound will robotic. Breathing.png

And this is speech: Speech.png


# Preparing the Dataset Through Music/SFX Removal

The key principal behind preparing your dataset is doing the least audio processing as possible as you want to keep the overall robustness of the model. Excessively stacking vocal separator models such as UVR Inst Voc, Kim Vocals, Denoise, ensemble mode, and so forth can introduce noises to your dataset as it rips away frequencies from your audio. This harms the model fidelity and quality.


# RX De-Clipping

This helps to remove most of the clipping that is occurring throughout the audio around the -0db range. Do not touch the make-up gain as it will alter the natural dynamic range of the audio. You can press "preview" to see that it is working as indicated with "clipped intervals repaired". If it does nothing, then you can skip this part.

de-clip-settings.png

# Removing DC Offset

Before starting the process for separation, make sure that the audio has zero DC offset to prevent further issues that would interrupt the processing due to the bottom line noise. The waveform statistics is under the Window tab in RX11. dcoffset.png

This can be done in audacity by going to the Effect tab > Volume and Compression > Normalize then check the box to remove the DC offset. Remove-dc-offset.png

If you prefer to EQ in RX 11, then click on the EQ module and enable only the hp bell curve. Keep the frequency precision to 6 and the frequency at 30-32hz. Your DC offset will be at 0% if you've done it correctly.

Here's a small comparison with the changes after vocal separation. Without DC offset: resizing.png

With DC offset: With-dc-offset.png


# Dereverbing and De-echo The Audio

While keeping in mind of doing the least processing, you want to keep the audio as natural as possible for RVC, or Hifigan. Hifigan is the generative adversarial network that makes up our discriminator and generator for cloning sounds and training stability. Having audio that has been damaged or reconstructed will affect the generalization of the graphs to fully replicate the clarity and accent of the voice.

Reverb is multiple sources of sound returning and echo is the delay. It is important to remove these from the dataset otherwise it'll cause your model to have artifacts as Hifigan cannot replicate the model's clarity.

# Best Model for Dereverb And De-echo

The current best Dereverb plugin is the Clarity VX DeReverb Pro module, another paid software that you can get for free. The aggressiveness can be tweaked or finetuned to your liking as it cuts into the audio and does not reconstruct the frequencies. Clarity VX cannot de-echo the audio so you need to use UVR De-echo or RX Dialogue Isolate.

Clarity-VX.png

If you don't want to use ClarityVX, the most common method is to use UVR Dereverb and De-echo separately as you have full control on what's needed for your audio. There is no predetermined settings as each audio is different. The rule of thumb is to use an aggression of 3.0 -5.0 and nothing more than that. If the audio does not have reverb or echo, do not use any of these models as it can muffle the audio.

If you only rely on MVSEP, you're forced to use UVR-DeEcho-Dereverb as there is no standalone option for the dereverb model. The cutoff frequency for this separation model would be 17.5khz.

Deecho.png

Here's what audio with mono reverb will look like on Izotope RX11: mono-reverb.png

And the result of DeEcho-Dereverb at 0.5 aggression: aggression.png

There may be leftover reverb or echo that weren't removed so that'll be addressed in the RX Dereverb module later in the Manual Denoising section. I will also troubleshoot the audio increasing in volume after dereverb and deecho in the FAQ Regarding Normalization and Compression section.


# RX11 Modules


# Spectral Denoising The Audio

Izotope's Spectral Denoise provides natural noise reduction and will preserve the quality of your audio as it minimizes artifacts. It analyzes the signals of noise that we select and modifies the frequency components. Essentially, UVR Denoise is not needed afterwards since the cleaning has been processed in spectral. The rest of the the cleanup is done through our manual denoising via RX11.

Here is the settings that were used in the previous guide, which is mostly fine if you can't recognize patterns in a noise profile. Artifact control, whitening, Release [ms], Smoothing, and Reduction could be adjusted based on particular datasets. Again, don't fix it if it aint broke. Spectral-denoise.png

You only need to do one or two passes of spectral denoising. Doing further processing than that will compress, or degrade your audio. Be cautious of overdoing it with the noise profile (selecting breathing or speech by accident) as RX might take away details and important aspects of a person's voice.

This is what the audio looks like before spectral denoising: Spectral-denoise-1.png

Now we select the noise (click and hold shift after selecting an area to include multiple audios) then press learn in the Spectral Denoise module after we captured the entire noise profile. Unselect the audio and click render. Spectral-denoise-2.png

Now in this case one pass of spectral denoise wasn't enough. Repeat the steps done in the first pass by selecting the areas with the blue noise. Spectral-denoise-3.png

The 2nd pass will remove the noise. 2Pass.png


# FAQ Regarding Normalization and Compression

Why does my audio sounds like it's clipping or distorted? Should I use the gain tool in RX? Audios will usually have crackling and clicking noises introduced if it's over 1.0+db. We don't want to use the gain tool to address this since it's a compressor. If it doesn't have any of those issues, then leave the dialogue alone to maintain its natural range. too-loud.png

Again, try to maintain the natural dynamic range if possible. If you must decrease the volume on a particular dialogue because its too loud for your ears and you still need to use the RX Dereverb module, consider normalizing between -2db or -3db. The use of the RX Dereverb will put it back to normal volume.

When should I normalize the whole dataset? It is optional to use this module as it can be argued that rvc will do the normalization for you with -2db. You can normalize the dataset when it has been thoroughly cleaned at the end, if you want.

Normalization.png *** #### Modules overview **De-ess** This tool is used so that the -ess /fff/z/ch sounds, or sibilances, are less harsh/robotic when the voice model tries to pronounce certain words. Only use this on the actual sibilances and not the entire dataset/dialogue. Automatic de-ssing with Ctrl + A can select the wrong consonant sounds or skip audios that needs more de-essing so keep that in mind. Adjust spectral shaping as needed if you know what you're doing. De-ess.png

De-crackle Only use this tool if you hear crackling noises in specific parts of your dialogue/speech. It's not a requirement to use it. De-crackle.png

Mouth De-clicking and De-click Only use mouth de-clicking when clicking noises are audible. It's a good practice to use mouth de-clicking on only the click and not the entire dialogue/audio. Adjust the sensitivity or frequency skew as needed, but do not go overboard as it can remove the finer details of our audio.

  • This module is specifically tailored to address mouth noise issues in vocal recordings. Mouth noises are typically caused by saliva, lip smacks, tongue clicks, or other oral sounds that can be distracting or unpleasant to listen to in vocal recordings. Mouth de-click module preserves the integrity and naturalness of the vocal performance while reducing these types of mouth noises.

We run the de-click module only as a last resort when mouth de-click doesn't work. Using both modules together may strip away the "k" consonants of our model. De-click.png

  • This module is designed to remove or reduce short impulse noises such as clicks, pops, and digital clipping artifacts from audio recordings. These noises can occur due to various reasons, including imperfections in the recording equipment, electrical interference, or flaws in the audio signal itself. The De-click module analyzes the audio waveform and identifies these short, transient noises, then applies processing algorithms to smooth out or remove them, restoring the audio to a cleaner state.

Dialogue Isolate Use this tool when you have audible room echo that could be removed on specific parts of your speech. Do not use it for the entire dataset because it may be inaccurate and strip away details. Or sometimes it doesn't work. Keep the settings the same here including sensitivity. dialogue-isolate.png

De-plosive Use Deplosives when there are audible thumps of air coming through the mouth. Again, -plosives are consonant sounds that needs attention. Specifically the P, T, K, and CH sounds. There are no right settings for this so adjust it until the roughness of the -plosives are gone. Deplosive.png

Plosives look like this. Do not select the whole speech and only the plosives as suggested in the red lines. plosive-examples.png

De-hum Use De-humming to take care of low or high frequency hums. There are no consistent settings for this as each situation is different for your audio.

  • Sensitivity will adjust the amount of hum that will be removed.
  • Bands, or notch filters will increase depending on the complexity of the noise.
  • Filter Q is the range selector.
dehum.png

For example, use the frequency selector tool, select the humming occurring below 100hz, then press learn and render. freq.png

FnwwDec.png

The red underline shows what you should be looking for. humming100hz.png

EQ This has already been covered in the section for removing the Removing DC Offset. Do this if you haven't done it already. It takes care of low-end noise, but if your dc offset is already at 0% then skip this module.


# Manual Denoising

You should have noticed that RX11 moves the audio closer to each other whenever you delete a space or speech. Use the Lasso or Brush tool to delete dialogue precisely, which should preserve the spacing in phonetics. Make sure it does not cause clipping on the waveform as it'll look like a sharp spike. You can also clean up the leftover residuals left behind by spectral denoising.

Audios with thumping, SFX sounds, ringing noises, and weird vocalizations or breaths that might affect the voice model should be removed from the dataset as it'll have those characteristics. You can try to lasso out the possible source of the noise, but keep the audio as natural as possible without damaging the frequencies.

lasso.png lasso-3.png

The RX Dereverb module will remove reverb that's leftover from the audio and may help remove noises. With RX Dereverb, select the audio, press learn and render. Adjust the reduction if needed. Do not use a strong reduction as it may muffle the audio. You can always undo with Ctrl + Z.

Refer to the FAQ Regarding Normalization and Compression if the volume is peaking or "introducing noises", which isn't the case at all. Do note that RX Dereverb will raise the volume by 2db each time you use it.

dereverb.png When the dataset has been thoroughly cleaned, you can resample the dataset to 32k or 44.1k and export as a WAV 32-bit or Flac 24-bit. resample.png *** :::content-center ### Noise Gating and Audio Labeling ::: [Auburn Renegate](https://www.auburnsounds.com/products/Renegate.html) (basically Audacity noisegate but better) the free version will be more than enough. Auburn-noise-gate.png

We can open Audacity and run the the dataset through Auburn Renegate. After that convert your dataset to mono since RVC works on mono and not stereo. There are two ways of neatly removing the silences in your dataset called Audio Labeling and Truncate.

Follow these steps: Open the menu for labeling. audiolabel1.png audiolabel2.png

You will now have it like this: audiolabel3.png

Turn off shaped dither with Ctrl + P > Quality since we are exporting with WAV 32-Bit or FLAC 24-bit anyways. preferences-quality.png

Now we go to export our audio. audiolabel4.png audiolabel5.png

The output will be like this: audiolabel6.png

Now go in the RVC folder and place all these files in datasets folder. Zip it up if needed.

You can find an extremely long and complex guide by the Audio Separation's Discord: here, but it's not suggested and might have some outdated info as it got hundrends of pages.


# Preparing the dataset

  • To do these next steps you are going to need Spek and Audacity.

# Step 1: Find the Sample Rate

  • This is a unit in that defines the total amount of samples (data) that can fit within 1 second of an audio. They are measured in kilohertz (kHz).

  • The most common sample rates are 32, 40, 44.1, & 48. The higher the sample rate, the more information it stores, therefore the higher the quality.

  • While training in RVC, you'll have to set the target sample rate as your dataset's. This value affects the final quality.

  • # A simple way to determine it is with Spek:
    • # STEP 1:

      Download and install Spek here.

    • # STEP 2:

      Open spek and just drag & drop audio into it.

      image


# Step 2: Truncating Silence

  • In Audacity import your audio
  • Go to Effects -> Special -> Truncate Silence
  • Use the following values:

    image


    # Step 2.1: Audio Normalization (Optional)

  • Go go to Effects -> Volume and Compression -> Loudness Normalization
  • Use these values:

    image

LUFS are used over db because hifigan needs perceptual quality and db doesnt offer that.


# Step 2.2: Combining Audio

  • Go go to Tracks -> Align Tracks -> Align End to End
image

# Step 3: Export

  • On the upper right corner go to File and click Export Audio.

    image
    • And finally, introduce these values:

      • Format: WAV
      • Encoding: 32-Bit Float
      #
      image

# Communities


# You have reached the end.

Report Issues