Note: This application is currently under development and needs improvements. Features and functionality may change.
Swaps voices in source mp4 videos with a voice in a target audio file.
- Perform zero-shot voice conversion, including singing voice conversion, using Seed-VC models.
- Estimate Room Impulse Responses (RIR) directly from reverberant speech in the source video using Speech2RIR.
- Scene cut detection for automatic RIR markers
- Calculate the RT60 (reverberation time) of audio using BlindRT60
- Automatically identify and segment different speakers in an audio track through speaker diarization (powered by pyannote.audio).
- Manually define and apply simulated room acoustics to audio segments.
- Mark audio segments for targeted voice conversion and RIR application.
- Merge the processed audio back into the original video.
- Seed-VC Zero-Shot Voice Conversion: Swap voices (and other humans sounds) in audio segments without needing pre-trained models for specific target speakers. Supports both speech and singing.
- Speech2RIR Integration: Automatically estimate the acoustic characteristics (RIR) of the environment from the original video's audio.
- BlindRT60 Estimation: Calculate the reverberation time (RT60) from an audio file.
- PySceneDetect: Auto set RIR RT60 value markers based on detected scene changes.
- ClearVoice: Clean background noises from source target voice audio.
- Pyanote Speaker Diarization 3.1: Identify different speakers and their speech segments in the source audio.
- Segment-based Processing: Apply voice conversion and RIR effects to specific, user-defined time segments.
- Manual RIR Simulation: Define custom room dimensions and target RT60 values to simulate and apply reverberation.
- Parameter Control: Fine-tune voice conversion parameters (e.g., diffusion steps, pitch shift) and RIR settings.
- Video Merging: Combine the finally processed audio with the original video frames.
- Mix with original background: Mix final audio with original background. Thx to Asmirald for the code !
- Conda: It is recommended to use Conda for managing the environment. Ensure you have Anaconda or Miniconda installed.
- CUDA (for GPU acceleration): The
requirements-cu124.txt
file is set up for CUDA 12.4. If you have a different CUDA version or want to run on CPU, you might need to adjust dependencies (particularlytorch
).
-
Clone the repository (if you haven't already):
git clone <your-repository-url> cd <repository-folder>
-
Create a Conda environment:
conda create -n VideoVoiceSwapper python=3.10 -y
(Using Python 3.10 as a common compatible version, adjust if needed.)
-
Activate the Conda environment:
conda activate VideoVoiceSwapper
-
Install dependencies: Make sure you have a
requirements-cu124.txt
file in the root of your project directory. This file should list all necessary Python packages.pip install -r requirements-cu124.txt
Note: If
ffmpeg
is not found by the application, you might need to install it system-wide and ensure it's in your system's PATH, or install theffmpeg-python
package if it's not already in requirements. -
Run the application:
python main_app.py
-
Hugging Face Token (Optional):
- For speaker diarization, a Hugging Face User Access Token is required.
- Enter your token in the "HuggingFace Token" field under "Global Configuration".
- If you don't have one, you can create it on the Hugging Face website (Settings -> Access Tokens).
- You also need to accept terms to access pyanote models here ()
-
Select Input MP4: Click "Select Input MP4" to load your source video. The audio will be automatically extracted and displayed in the "Source Audio" waveform.
-
Select Target WAV: Click "Select Target WAV" to choose the voice/singing style you want to convert to.
-
Select Output Directory: Click "Select Output Directory" to specify where the final processed video will be saved.
-
Waveform Interaction:
- Click on the waveform to set the playback cursor.
- Drag on the waveform to select a region.
-
Speaker Diarization (Optional):
- Click "Detect Speakers". This process may take some time.
- Once complete, detected speakers will appear in the dropdown list.
- Select a speaker from the list.
- Click "Apply Selected Speaker's Markers" to add time markers for all segments spoken by that speaker.
-
Manual Time Markers for Voice Conversion:
- Select a region on the source audio waveform.
- Click "Add Current Waveform Selection as Marker".
- These markers define the segments that will undergo voice conversion.
-
Manage Markers: Use "Remove Selected Source Markers" or "Clear All Source Markers" as needed.
- Adjust parameters like "Diffusion Steps", "Length Adjustment", "Inference CFG Rate", "F0 Condition", and "Pitch Shift" to fine-tune the voice conversion process. Recommended to only change diffusion steps and leave the other parameters.
- Once source audio markers are set and VC parameters are configured, click the "Swap Voice (Process Marked Segments)" button.
- The application will process only the marked segments from the source audio, convert their voice, and then stitch them back into the full-length audio track with non-marked segments remaining original.
- The resulting "swapped" audio will be loaded into the "Swapped Audio" waveform display (Right Pane).
- Waveform Interaction & Playback: Similar to the source audio pane, you can interact with and play the swapped audio.
- RIR Configuration Mode:
- Automatic (Speech2RIR):
- Check the "Automatically use RIR from Source Video (Speech2RIR)" box.
- If Speech2RIR models are correctly set up and source audio is loaded, the system will attempt to estimate an RIR from the original video's audio.
- This estimated RIR will be applied globally to the entire swapped audio track during final processing.
- Manual RIR marker controls below will be disabled.
- Manual/Simulated RIR Markers:
- Ensure the "Automatically use RIR from Source Video (Speech2RIR)" box is unchecked.
- (Optional) Estimate Source RT60: In the "Global Configuration" section, click "Calculate RT60 from Source Audio". The result can be used as a reference.
- Select a region on the swapped audio waveform where you want to apply specific reverberation.
- (Optional) Use Estimated RT60: Check "Use Estimated Source RT60 for RIR markers" if you want the RT60 value from the global calculation to be used for the new marker.
- Target RT60: If not using the estimated source RT60, manually enter a "Target RT60 (s)" value.
- Room Dimensions: The "Default Room Dimensions" from "Global Configuration" (influenced by the "Environment Preset") will be used for simulating the RIR for this segment.
- Click "Add Current Swapped Waveform Selection as RIR Marker".
- Repeat for other segments if you want different RIR characteristics in different parts of the audio.
- Automatic (Speech2RIR):
- Manage RIR Markers: Use "Remove Selected RIR Markers" or "Clear All RIR Markers" as needed for manual/simulated markers.
- Once you are satisfied with the voice swap and RIR configuration (either Speech2RIR mode or manual RIR markers), click the "Final Process" button.
- The application will:
- Apply the chosen RIRs to the swapped audio (either the global Speech2RIR or the segmented simulated RIRs).
- Merge this final audio track with the original video frames.
- The path to the output video will be displayed in the "Output Video Path" field.
- FFmpeg: Ensure FFmpeg is installed and accessible in your system's PATH for the final video creation step.
- CUDA Errors: If you encounter CUDA-related errors, ensure your NVIDIA drivers, CUDA toolkit, and PyTorch version are compatible. The provided
requirements-cu124.txt
targets CUDA 12.4.