-
Notifications
You must be signed in to change notification settings - Fork 389
Description
I'm trying to replicate the diffusion training process using benjamin-paine/freesound-laion-640k and benjamin-paine/free-music-archive-large.
Currently I use the same small network defined by stabilityai/stable-audio-open-small which is 16 layers with 1024 hidden size. I trained 3 epoches for now and continued training. But the demo reconstruction is not that good.
The diffusion mse loss is shake around 0.85.

The cfg-7 conditioned audio mel spectrum looks like:

Although it seemed learn to generate low frequency generation, the high frequency one is noisy.
Would you please share some details about the training data and the training time and loss trends? I'm not sure if I should terminate the training and add more data.
I just copied almost everything training details from stabilityai/stable-audio-open-small:
"diffusion": {
"cross_attention_cond_ids": ["prompt", "seconds_total"],
"global_cond_ids": ["seconds_total"],
"diffusion_objective": "rectified_flow",
"distribution_shift_options": {
"min_length": 256,
"max_length": 4096
},
"type": "dit",
"config": {
"io_channels": 64,
"embed_dim": 1024,
"depth": 16,
"num_heads": 8,
"cond_token_dim": 768,
"global_cond_dim": 768,
"transformer_type": "continuous_transformer",
"attn_kwargs": {
"qk_norm": "ln"
}
}
},
"io_channels": 64
},
"training": {
"use_ema": true,
"log_loss_info": false,
"pre_encoded": false,
"timestep_sampler": "trunc_logit_normal",
"optimizer_configs": {
"diffusion": {
"optimizer": {
"type": "AdamW",
"config": {
"lr": 2e-4,
"betas": [0.9, 0.95],
"eps": 1e-8,
"weight_decay": 0.01,
"foreach": true
}
},
"scheduler": {
"type": "InverseLR",
"config": {
"inv_gamma": 1000000,
"power": 0.5,
"warmup": 0.995
}
}
}
},
"demo": {
"demo_every": 2000,
"demo_steps": 50,
"num_demos": 8,
"demo_cond": [
{"prompt": "Amen break 174 BPM", "seconds_total": 6},
{"prompt": "People talking in a crowded cafe", "seconds_total": 10},
{"prompt": "A short, beautiful piano riff in C minor", "seconds_total": 6},
{"prompt": "Tight Snare Drum", "seconds_total": 1},
{"prompt": "A dog barking next to a waterfall", "seconds_total": 6},
{"prompt": "Glitchy bass design, I used Serum for this", "seconds_total": 4},
{"prompt": "Synth pluck arp with reverb and delay, 128 BPM", "seconds_total": 6},
{"prompt": "Birds singing in the forest", "seconds_total": 10}
],
"demo_cfg_scales": [1, 4, 7]
}
}