Skip to content

Glat0s/VideoVoiceSwap

Repository files navigation

VideoVoiceSwap

Note: This application is currently under development and needs improvements. Features and functionality may change.

videovoiceswap

Description

Swaps voices in source mp4 videos with a voice in a target audio file.

  • Perform zero-shot voice conversion, including singing voice conversion, using Seed-VC models.
  • Estimate Room Impulse Responses (RIR) directly from reverberant speech in the source video using Speech2RIR.
  • Scene cut detection for automatic RIR markers
  • Calculate the RT60 (reverberation time) of audio using BlindRT60
  • Automatically identify and segment different speakers in an audio track through speaker diarization (powered by pyannote.audio).
  • Manually define and apply simulated room acoustics to audio segments.
  • Mark audio segments for targeted voice conversion and RIR application.
  • Merge the processed audio back into the original video.

Features

  • Seed-VC Zero-Shot Voice Conversion: Swap voices (and other humans sounds) in audio segments without needing pre-trained models for specific target speakers. Supports both speech and singing.
  • Speech2RIR Integration: Automatically estimate the acoustic characteristics (RIR) of the environment from the original video's audio.
  • BlindRT60 Estimation: Calculate the reverberation time (RT60) from an audio file.
  • PySceneDetect: Auto set RIR RT60 value markers based on detected scene changes.
  • ClearVoice: Clean background noises from source target voice audio.
  • Pyanote Speaker Diarization 3.1: Identify different speakers and their speech segments in the source audio.
  • Segment-based Processing: Apply voice conversion and RIR effects to specific, user-defined time segments.
  • Manual RIR Simulation: Define custom room dimensions and target RT60 values to simulate and apply reverberation.
  • Parameter Control: Fine-tune voice conversion parameters (e.g., diffusion steps, pitch shift) and RIR settings.
  • Video Merging: Combine the finally processed audio with the original video frames.
  • Mix with original background: Mix final audio with original background. Thx to Asmirald for the code !

Installation

Prerequisites

  • Conda: It is recommended to use Conda for managing the environment. Ensure you have Anaconda or Miniconda installed.
  • CUDA (for GPU acceleration): The requirements-cu124.txt file is set up for CUDA 12.4. If you have a different CUDA version or want to run on CPU, you might need to adjust dependencies (particularly torch).

Steps

  1. Clone the repository (if you haven't already):

    git clone <your-repository-url>
    cd <repository-folder>
  2. Create a Conda environment:

    conda create -n VideoVoiceSwapper python=3.10 -y

    (Using Python 3.10 as a common compatible version, adjust if needed.)

  3. Activate the Conda environment:

    conda activate VideoVoiceSwapper
  4. Install dependencies: Make sure you have a requirements-cu124.txt file in the root of your project directory. This file should list all necessary Python packages.

    pip install -r requirements-cu124.txt

    Note: If ffmpeg is not found by the application, you might need to install it system-wide and ensure it's in your system's PATH, or install the ffmpeg-python package if it's not already in requirements.

  5. Run the application:

    python main_app.py

How to Use

1. Initial Setup & File Loading

  • Hugging Face Token (Optional):

  • Select Input MP4: Click "Select Input MP4" to load your source video. The audio will be automatically extracted and displayed in the "Source Audio" waveform.

  • Select Target WAV: Click "Select Target WAV" to choose the voice/singing style you want to convert to.

  • Select Output Directory: Click "Select Output Directory" to specify where the final processed video will be saved.

2. Source Audio Processing (Left Pane)

  • Waveform Interaction:

    • Click on the waveform to set the playback cursor.
    • Drag on the waveform to select a region.
  • Speaker Diarization (Optional):

    1. Click "Detect Speakers". This process may take some time.
    2. Once complete, detected speakers will appear in the dropdown list.
    3. Select a speaker from the list.
    4. Click "Apply Selected Speaker's Markers" to add time markers for all segments spoken by that speaker.
  • Manual Time Markers for Voice Conversion:

    1. Select a region on the source audio waveform.
    2. Click "Add Current Waveform Selection as Marker".
    • These markers define the segments that will undergo voice conversion.
  • Manage Markers: Use "Remove Selected Source Markers" or "Clear All Source Markers" as needed.

3. Voice Conversion Parameters (Middle Pane)

  • Adjust parameters like "Diffusion Steps", "Length Adjustment", "Inference CFG Rate", "F0 Condition", and "Pitch Shift" to fine-tune the voice conversion process. Recommended to only change diffusion steps and leave the other parameters.

4. Perform Voice Swap

  • Once source audio markers are set and VC parameters are configured, click the "Swap Voice (Process Marked Segments)" button.
  • The application will process only the marked segments from the source audio, convert their voice, and then stitch them back into the full-length audio track with non-marked segments remaining original.
  • The resulting "swapped" audio will be loaded into the "Swapped Audio" waveform display (Right Pane).

5. Swapped Audio Processing & RIR Application (Right Pane)

  • Waveform Interaction & Playback: Similar to the source audio pane, you can interact with and play the swapped audio.
  • RIR Configuration Mode:
    • Automatic (Speech2RIR):
      1. Check the "Automatically use RIR from Source Video (Speech2RIR)" box.
      2. If Speech2RIR models are correctly set up and source audio is loaded, the system will attempt to estimate an RIR from the original video's audio.
      3. This estimated RIR will be applied globally to the entire swapped audio track during final processing.
      4. Manual RIR marker controls below will be disabled.
    • Manual/Simulated RIR Markers:
      1. Ensure the "Automatically use RIR from Source Video (Speech2RIR)" box is unchecked.
      2. (Optional) Estimate Source RT60: In the "Global Configuration" section, click "Calculate RT60 from Source Audio". The result can be used as a reference.
      3. Select a region on the swapped audio waveform where you want to apply specific reverberation.
      4. (Optional) Use Estimated RT60: Check "Use Estimated Source RT60 for RIR markers" if you want the RT60 value from the global calculation to be used for the new marker.
      5. Target RT60: If not using the estimated source RT60, manually enter a "Target RT60 (s)" value.
      6. Room Dimensions: The "Default Room Dimensions" from "Global Configuration" (influenced by the "Environment Preset") will be used for simulating the RIR for this segment.
      7. Click "Add Current Swapped Waveform Selection as RIR Marker".
      8. Repeat for other segments if you want different RIR characteristics in different parts of the audio.
  • Manage RIR Markers: Use "Remove Selected RIR Markers" or "Clear All RIR Markers" as needed for manual/simulated markers.

6. Final Processing

  • Once you are satisfied with the voice swap and RIR configuration (either Speech2RIR mode or manual RIR markers), click the "Final Process" button.
  • The application will:
    1. Apply the chosen RIRs to the swapped audio (either the global Speech2RIR or the segmented simulated RIRs).
    2. Merge this final audio track with the original video frames.
  • The path to the output video will be displayed in the "Output Video Path" field.

Troubleshooting & Notes

  • FFmpeg: Ensure FFmpeg is installed and accessible in your system's PATH for the final video creation step.
  • CUDA Errors: If you encounter CUDA-related errors, ensure your NVIDIA drivers, CUDA toolkit, and PyTorch version are compatible. The provided requirements-cu124.txt targets CUDA 12.4.

About

Zeroshot Video voice swapper, speech2rir & blindRT60, Speaker & Scene change detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages