Skip to content

Would you please share some training details for stabilityai/stable-audio-open-small? #199

@yynil

Description

@yynil

I'm trying to replicate the diffusion training process using benjamin-paine/freesound-laion-640k and benjamin-paine/free-music-archive-large.

Currently I use the same small network defined by stabilityai/stable-audio-open-small which is 16 layers with 1024 hidden size. I trained 3 epoches for now and continued training. But the demo reconstruction is not that good.
The diffusion mse loss is shake around 0.85.

Image

The cfg-7 conditioned audio mel spectrum looks like:

Image

Although it seemed learn to generate low frequency generation, the high frequency one is noisy.

Would you please share some details about the training data and the training time and loss trends? I'm not sure if I should terminate the training and add more data.

I just copied almost everything training details from stabilityai/stable-audio-open-small:

"diffusion": {
            "cross_attention_cond_ids": ["prompt", "seconds_total"],
            "global_cond_ids": ["seconds_total"],
            "diffusion_objective": "rectified_flow",
            "distribution_shift_options": {
                "min_length": 256,
                "max_length": 4096
            },
            "type": "dit",
            "config": {
                "io_channels": 64,
                "embed_dim": 1024,
                "depth": 16,
                "num_heads": 8,
                "cond_token_dim": 768,
                "global_cond_dim": 768,
                "transformer_type": "continuous_transformer",
                "attn_kwargs": {
                    "qk_norm": "ln"
                }
            }
        },
        "io_channels": 64
    },
    "training": {
        "use_ema": true,
        "log_loss_info": false,
        "pre_encoded": false,
        "timestep_sampler": "trunc_logit_normal",
        "optimizer_configs": {
            "diffusion": {
                "optimizer": {
                    "type": "AdamW",
                    "config": {
                        "lr": 2e-4,
                        "betas": [0.9, 0.95],
                        "eps": 1e-8,
                        "weight_decay": 0.01,
                        "foreach": true
                    }
                },
                "scheduler": {
                    "type": "InverseLR",
                    "config": {
                        "inv_gamma": 1000000,
                        "power": 0.5,
                        "warmup": 0.995
                    }
                }
            }
        },
        "demo": {
            "demo_every": 2000,
            "demo_steps": 50,
            "num_demos": 8,
            "demo_cond": [
                {"prompt": "Amen break 174 BPM", "seconds_total": 6},
                {"prompt": "People talking in a crowded cafe", "seconds_total": 10},
                {"prompt": "A short, beautiful piano riff in C minor", "seconds_total": 6},
                {"prompt": "Tight Snare Drum", "seconds_total": 1},
                {"prompt": "A dog barking next to a waterfall", "seconds_total": 6},
                {"prompt": "Glitchy bass design, I used Serum for this", "seconds_total": 4},
                {"prompt": "Synth pluck arp with reverb and delay, 128 BPM", "seconds_total": 6},
                {"prompt": "Birds singing in the forest", "seconds_total": 10}
            ],
            "demo_cfg_scales": [1, 4, 7]
        }
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions