Skip to content

swfly/AutoSong

Repository files navigation

🎵 AutoSong – Lyric-Driven Autoregressive Composition

🧠 Purpose

AutoSong explores whether a GPT-style autoregressive Transformer can compose complete pieces of music directly from full-song lyrics and a genre tag.
The model receives the entire lyric script up front, allowing it to decide global form (verse–chorus order, bridge placement, etc.) while it generates audio continuously.

🎯 Prototype Goals

Stage What we demonstrate
1. Text → Conditioning Encode lyrics & genre with a frozen BERT-like encoder → dense matrix (M).
2. Audio → Continuous Latents Compress 256 × 256-bin mel-spectrogram blocks with a convolutional autoencoder (AE).
 • Encoder (E): ((T, F)\to (C,16,16))
 • No quantization — latent tensors remain real-valued.
3. Transformer Autoreg. A causal decoder predicts the next latent patch (\hat z_{t+1}) given past latents (z_{\le t}) and cross-attention to (M).
4. Latent → Audio AE decoder reconstructs mel frames; Griffin–Lim inverts them to waveform.

Success is measured by both objective reconstruction (L1 on mels) and subjective musicality of new lyrics-conditioned generations.

🧠 New Architecture Overview: Composer → Decoder

AutoSong now adopts a two-stage generation pipeline, enabling better separation of structure and sound:


🧱 Stage 1: Composer — Text-to-Block Generation

The Composer is a Transformer that operates autoregressively over sentence-level blocks, each corresponding to a segment of lyrics or silence.

  • Input: Full-song lyrics, tokenized as pinyin-tone syllables.

  • Lyrics are segmented into blocks, each annotated with:

    • Token sequence: lyric content (or empty).
    • Finish tag: whether this block concludes a phrase (e.g., end of sentence or pause).
    • Length class: coarse quantization of its expected duration.
    • Role tag: section function (e.g., verse, chorus, bridge).
  • These features are embedded, projected, and combined with temporal and positional embeddings.

  • The Composer autoregressively generates the next block embedding, conditioning on the full lyrics context.

🧠 The Composer models global structure — form, alignment, and phrasing — entirely in the text domain.


🎼 Stage 2: Decoder — Latent-to-Audio Generation

The Decoder translates the Composer's block sequence into continuous audio, via autoregression in a compressed latent space:

  • A separately trained autoencoder compresses mel spectrograms into real-valued latent patches (e.g., 4×32×32), enabling high-fidelity and low-entropy representation.

    • The encoder is used to preprocess real audio into training latents.
    • The decoder is used to reconstruct mels from predicted latents.
  • The Decoder Transformer:

    • Operates over latent patches.
    • Cross-attends to the sequence of block embeddings generated by the Composer.
    • Predicts residual deltas between latent steps (rather than raw latents), improving temporal smoothness and learning efficiency.
  • Reconstructed mels are inverted to waveform via Griffin–Lim.


🔄 Summary of the Flow

Lyrics (pinyin) ──▶ Composer ──▶ Sentence Block Embeddings ──▶ Decoder ──▶ Latent Patches ──▶ Audio
  • Level 1: Composer — learns high-level musical planning aligned to text.

  • Level 2: Decoder — realizes the low-level audio from abstract latent plans.

This two-level hierarchy allows the model to compose structured, expressive music directly from lyrics, while handling sound synthesis separately in a learned continuous domain.

🗂️ Dataset Layout

dataset/
├── song_0001/
│   ├── lyrics.txt
│   └── audio.wav  (≥48 kHz mono)
├── song_0002/
│   ├── lyrics.txt
│   └── audio.wav
⋮

Lyrics are converted to pinyin-tone tokens; audio is resampled, converted to 256-bin mels, and cached.

Research Log: The Road to Autoregressive Music Generation

  1. I started with a pure end-to-end approach: using EnCodec + transformer autoregression.
    It didn't work at all. Looking back, I think it's because EnCodec's latent codes are too high in information entropy, and doing end-to-end music generation is simply too ambitious — the structure is too deep and subtle.

  2. So I tried scaling the problem down.
    I thought, "Okay, what if we use an autoencoder to compress the content further, and let the transformer handle high-level composition instead?"
    I attempted VQ-VAE on top of EnCodec, but that didn’t go well either. Discretizing EnCodec's output just wasn’t stable.

  3. Then I realized: maybe we shouldn't compress so aggressively.
    I switched to MEL spectrograms, aiming to model continuous sound dynamics instead. This was much more promising: the autoencoder worked.
    But VQ-VAE still didn’t — not sure exactly why. I eventually gave up on quantization and stuck with a continuous latent space.
    The audio was now represented as 2D latent patches. I used a GAN to sharpen the output and avoid blurry reconstructions from L1/L2 losses.
    Visualizing the latents, I noticed they looked like downsampled MELs — but with 4 channels showing visibly different structures. That felt like real, meaningful representation.

  4. Then came autoregression again.
    Previously, I was focused on predicting token distributions — but that didn’t transfer to continuous data. So I redesigned the logic:

    • Embeddings became biases added to the latent patches
    • I downsampled them via a simple MLP
    • Ran them through a transformer stack
    • And then upsampled back to patches
      Most importantly, I shifted to residual prediction, which offloads capacity and makes learning much more efficient.

Now, things finally look reasonable and actually work.
We can consider scaling the network and training on more diverse tracks.


Reflection

Machine learning and generative modeling are no joke.
Everything feels obvious in hindsight, but incredibly hard when you're stuck.
This process — of hitting walls, debugging ideas, and slowly forming intuition — is what makes research difficult and meaningful.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages