Skip to content

Audiofool934/WeaveWave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WeaveWave: Towards Multimodal Music Generation

WeaveWave Logo

WeaveWave: Towards Multimodal Music Generation

Python Version PyTorch Status License

Overview

Artificial Intelligence Generated Content (AIGC), as a next-generation content production method, is reshaping the possibilities in artistic creation. This project focuses on the vertical domain of music generation, exploring advanced models for music generation under multimodal conditions (text, images, videos).

For humans, music creation can be abstracted into two stages: inspiration and implementation. The former originates from the fusion of diverse sensory experiences: visual scenes stimulation, literary imagery resonance, auditory fragment association, and other cross-modal perceptions. The latter manifests as the process of concretizing inspiration through singing, instrumental performance, etc.

For machines, can artificial intelligence music creation mimic these two stages? We believe that the task of multimodal music generation precisely simulates "inspiration" and "implementation," where "inspiration" can be viewed as multimodal data, and "implementation" as a music generation model.

Music Creation: Humans and Machines

Music creation: humans and machines

However, research on multimodal music generation has not yet garnered widespread attention, with most existing work confined to music understanding and generation within a single modality. This limitation clearly fails to fully capture the complex multimodal sources of inspiration in music creation.

To address this gap, we first implemented a text-bridging strategy, harnessing the capabilities of existing multimodal large language models (MLLMs) and text-to-music generation systems. Building on this foundation, we proposed two candidate end-to-end architectures and developed a scalable training pipeline based on MusicGen-Style. This methodological exploration culminated in WeaveWave, a unified music generation framework designed to integrate multimodal inputs through a cohesive generative process.

frame-1

Text-Bridging

frame-1

End-to-End 1, based on AudioLDM2

frame-1

End-to-End 2, based on MusicGen

Features

  • Multimodal Music Generation: Create music from text, music, images, or video inputs
  • Integration Framework: Leverages existing multimodal large language models (MLLM) and text-to-music (T2M) models
  • (🏗️)Custom Training Pipeline: Specialized pipeline for multimodal music generation
  • (🏗️)Comprehensive Evaluation Tools: Metrics for assessing music quality and multimodal coherence

Project Status

🏗️ Work in Progress

  • ✅ Architecture design
  • ✅ Propose training pipeline based on MusicGen-Style
  • 🔄 Dataset preparation and curation
  • 🔄 Model training and fine-tuning
  • 🔄 Evaluation metrics design
  • ✅ Web app development for demo and testing (using Gradio)

Installation

🔄 will be updated soon

Demo

In this demo, we showcase the WeaveWave web app implementing our text-bridging framework. It integrates MiniCPM-o-2.6 for multimodal conditions integeration and Meta’s MusicGen for music generation.

WeaveWave: Web app built with Gradio

References

[1]. Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., & Défossez, A. (2024). Simple and controllable music generation. arXiv:2306.05284.

[2] Rouard, S., Adi, Y., Copet, J., Roebel, A., & Défossez, A. (2024). Audio conditioning for music generation via discrete bottleneck features. arXiv:2407.12563.

[3]. Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., Wang, Y., Wang, W., Wang, Y., & Plumbley, M. D. (2024). AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv:2308.05734.

[4]. Rinaldi, I., Fanelli, N., Castellano, G., & Vessio, G. (2024). Art2Mus: Bridging visual arts and music through cross-modal generation. arXiv:2410.04906.

[5] MiniCPM-o Team. (2025). MiniCPM-o: A GPT-4.o-level MLLM for vision, speech, and multimodal live streaming on your phone. OpenBMB. Link

About

Towards multimodal music generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages