VideoVoiceSwap

Note: This application is currently under development and needs improvements. Features and functionality may change.

Description

Swaps voices in source mp4 videos with a voice in a target audio file.

Perform zero-shot voice conversion, including singing voice conversion, using Seed-VC models.
Estimate Room Impulse Responses (RIR) directly from reverberant speech in the source video using Speech2RIR.
Scene cut detection for automatic RIR markers
Calculate the RT60 (reverberation time) of audio using BlindRT60
Automatically identify and segment different speakers in an audio track through speaker diarization (powered by pyannote.audio).
Manually define and apply simulated room acoustics to audio segments.
Mark audio segments for targeted voice conversion and RIR application.
Merge the processed audio back into the original video.

Features

Seed-VC Zero-Shot Voice Conversion: Swap voices (and other humans sounds) in audio segments without needing pre-trained models for specific target speakers. Supports both speech and singing.
Speech2RIR Integration: Automatically estimate the acoustic characteristics (RIR) of the environment from the original video's audio.
BlindRT60 Estimation: Calculate the reverberation time (RT60) from an audio file.
PySceneDetect: Auto set RIR RT60 value markers based on detected scene changes.
ClearVoice: Clean background noises from source target voice audio.
Pyanote Speaker Diarization 3.1: Identify different speakers and their speech segments in the source audio.
Segment-based Processing: Apply voice conversion and RIR effects to specific, user-defined time segments.
Manual RIR Simulation: Define custom room dimensions and target RT60 values to simulate and apply reverberation.
Parameter Control: Fine-tune voice conversion parameters (e.g., diffusion steps, pitch shift) and RIR settings.
Video Merging: Combine the finally processed audio with the original video frames.
Mix with original background: Mix final audio with original background. Thx to Asmirald for the code !

Installation

Prerequisites

Conda: It is recommended to use Conda for managing the environment. Ensure you have Anaconda or Miniconda installed.
CUDA (for GPU acceleration): The requirements-cu124.txt file is set up for CUDA 12.4. If you have a different CUDA version or want to run on CPU, you might need to adjust dependencies (particularly torch).

Steps

Clone the repository (if you haven't already):

git clone <your-repository-url>
cd <repository-folder>

Create a Conda environment:
```
conda create -n VideoVoiceSwapper python=3.10 -y
```
(Using Python 3.10 as a common compatible version, adjust if needed.)
Activate the Conda environment:
```
conda activate VideoVoiceSwapper
```
Install dependencies: Make sure you have a requirements-cu124.txt file in the root of your project directory. This file should list all necessary Python packages.
```
pip install -r requirements-cu124.txt
```
Note: If ffmpeg is not found by the application, you might need to install it system-wide and ensure it's in your system's PATH, or install the ffmpeg-python package if it's not already in requirements.
Run the application:
```
python main_app.py
```

How to Use

1. Initial Setup & File Loading

Hugging Face Token (Optional):
- For speaker diarization, a Hugging Face User Access Token is required.
- Enter your token in the "HuggingFace Token" field under "Global Configuration".
- If you don't have one, you can create it on the Hugging Face website (Settings -> Access Tokens).
- You also need to accept terms to access pyanote models here ()
  - https://hf.co/pyannote/speaker-diarization-3.1
  - https://hf.co/pyannote/segmentation-3.0
Select Input MP4: Click "Select Input MP4" to load your source video. The audio will be automatically extracted and displayed in the "Source Audio" waveform.
Select Target WAV: Click "Select Target WAV" to choose the voice/singing style you want to convert to.
Select Output Directory: Click "Select Output Directory" to specify where the final processed video will be saved.

2. Source Audio Processing (Left Pane)

Waveform Interaction:
- Click on the waveform to set the playback cursor.
- Drag on the waveform to select a region.
Speaker Diarization (Optional):
1. Click "Detect Speakers". This process may take some time.
2. Once complete, detected speakers will appear in the dropdown list.
3. Select a speaker from the list.
4. Click "Apply Selected Speaker's Markers" to add time markers for all segments spoken by that speaker.
Manual Time Markers for Voice Conversion:
1. Select a region on the source audio waveform.
2. Click "Add Current Waveform Selection as Marker".
- These markers define the segments that will undergo voice conversion.
Manage Markers: Use "Remove Selected Source Markers" or "Clear All Source Markers" as needed.

3. Voice Conversion Parameters (Middle Pane)

Adjust parameters like "Diffusion Steps", "Length Adjustment", "Inference CFG Rate", "F0 Condition", and "Pitch Shift" to fine-tune the voice conversion process. Recommended to only change diffusion steps and leave the other parameters.

4. Perform Voice Swap

Once source audio markers are set and VC parameters are configured, click the "Swap Voice (Process Marked Segments)" button.
The application will process only the marked segments from the source audio, convert their voice, and then stitch them back into the full-length audio track with non-marked segments remaining original.
The resulting "swapped" audio will be loaded into the "Swapped Audio" waveform display (Right Pane).

5. Swapped Audio Processing & RIR Application (Right Pane)

Waveform Interaction & Playback: Similar to the source audio pane, you can interact with and play the swapped audio.
RIR Configuration Mode:
- Automatic (Speech2RIR):
  1. Check the "Automatically use RIR from Source Video (Speech2RIR)" box.
  2. If Speech2RIR models are correctly set up and source audio is loaded, the system will attempt to estimate an RIR from the original video's audio.
  3. This estimated RIR will be applied globally to the entire swapped audio track during final processing.
  4. Manual RIR marker controls below will be disabled.
- Manual/Simulated RIR Markers:
  1. Ensure the "Automatically use RIR from Source Video (Speech2RIR)" box is unchecked.
  2. (Optional) Estimate Source RT60: In the "Global Configuration" section, click "Calculate RT60 from Source Audio". The result can be used as a reference.
  3. Select a region on the swapped audio waveform where you want to apply specific reverberation.
  4. (Optional) Use Estimated RT60: Check "Use Estimated Source RT60 for RIR markers" if you want the RT60 value from the global calculation to be used for the new marker.
  5. Target RT60: If not using the estimated source RT60, manually enter a "Target RT60 (s)" value.
  6. Room Dimensions: The "Default Room Dimensions" from "Global Configuration" (influenced by the "Environment Preset") will be used for simulating the RIR for this segment.
  7. Click "Add Current Swapped Waveform Selection as RIR Marker".
  8. Repeat for other segments if you want different RIR characteristics in different parts of the audio.
Manage RIR Markers: Use "Remove Selected RIR Markers" or "Clear All RIR Markers" as needed for manual/simulated markers.

6. Final Processing

Once you are satisfied with the voice swap and RIR configuration (either Speech2RIR mode or manual RIR markers), click the "Final Process" button.
The application will:
1. Apply the chosen RIRs to the swapped audio (either the global Speech2RIR or the segmented simulated RIRs).
2. Merge this final audio track with the original video frames.
The path to the output video will be displayed in the "Output Video Path" field.

Troubleshooting & Notes

FFmpeg: Ensure FFmpeg is installed and accessible in your system's PATH for the final video creation step.
CUDA Errors: If you encounter CUDA-related errors, ensure your NVIDIA drivers, CUDA toolkit, and PyTorch version are compatible. The provided requirements-cu124.txt targets CUDA 12.4.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
checkpoints		checkpoints
configs		configs
dac		dac
layers		layers
models		models
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_app.py		main_app.py
requirements-cu124.txt		requirements-cu124.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoVoiceSwap

Description

Features

Installation

Prerequisites

Steps

How to Use

1. Initial Setup & File Loading

2. Source Audio Processing (Left Pane)

3. Voice Conversion Parameters (Middle Pane)

4. Perform Voice Swap

5. Swapped Audio Processing & RIR Application (Right Pane)

6. Final Processing

Troubleshooting & Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Glat0s/VideoVoiceSwap

Folders and files

Latest commit

History

Repository files navigation

VideoVoiceSwap

Description

Features

Installation

Prerequisites

Steps

How to Use

1. Initial Setup & File Loading

2. Source Audio Processing (Left Pane)

3. Voice Conversion Parameters (Middle Pane)

4. Perform Voice Swap

5. Swapped Audio Processing & RIR Application (Right Pane)

6. Final Processing

Troubleshooting & Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages