This repository implements an end-to-end 1D convolutional + self-attention architecture for audio classification on raw waveforms (4 s @ 16 kHz) along with the wav2vec2 and AST model finetuning.
Below is a high-level overview, followed by detailed instructions for setup, training, and inference. The accompanying architecture diagram shows the full data flow and dimensionality annotations.
├── data/ # (Optional) Placeholder for dataset scripts or links
├── models/ # Model definitions
├── scripts/ # Training & evaluation scripts (train.py, inference.py)
├── utils/ # Utility functions (audio_preprocessing.py, augmentations.py)
├── model_diagram.png # Architecture diagram
├── requirements.txt # Python dependencies
└── README.md # This file
Install dependencies via:
pip install -r requirements.txt
-
Clone this repository:
git clone https://github.com/Toprak2/SANet.git cd SANet
-
(Optional) Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install requirements:
pip install -r requirements.txt
- Model expects 4-second audio clips at 16 kHz (i.e., 64,000 samples).
- Use
utils/audio_preprocessing.py
to load and preprocess waveforms:- Resample to 16 kHz
- Convert stereo to mono
- Crop or pad to exactly 64,000 samples
Will be added later
- For training, organize data into directories by class, e.g.:
├── data/ # (Optional) Placeholder for dataset scripts or links
├── denoised_clips/
├── class1/
│ ├── clip1.wav
│ ├── clip2.wav
│ └── ...
├── class2/
│ ├── clip1.wav
│ ├── clip2.wav
│ └── ...
└── ...
├── speech_segments/ # For training with dynamic data creation
├── class1/
│ ├── clip1.wav
│ ├── clip2.wav
│ └── ...
├── class2/
│ ├── clip1.wav
│ ├── clip2.wav
│ └── ...
└── ...
Refer to models/SqueezeAttendNet.py
for the full implementation. Key components:
-
ResBlock1D: 1D residual block with optional Squeeze-and-Excitation (
SE1D
). -
Squeeze-and-Excitation (SE1D):
- Global average pooling to obtain channel-wise statistics.
- 1x1 convolutions for bottleneck that calculates attenntion weights.
- Scale input features by these weights.
-
DownsampledAttentionBlockDecoupled:
- Temporal downsampling via
Conv1d(stride=s)
- Channel reduction via
Conv1d(1×1)
- Add learnable positional embeddings
- Multi-Head Attention (on reduced sequence)
- Temporal upsampling via
ConvTranspose1d
- Channel expansion via
Conv1d(1×1)
- Temporal downsampling via
-
AttentivePool: Computes a self-attentive weighted mean and standard deviation over time.
-
AttentionEncoder:
- Stacked
ResBlock1D
stages to shrink 64,000→2,000 frames. - First
DownsampledAttentionBlock
on 2,000 tokens (down→500→up). - Additional
ResBlock1D
to shrink to 1,000 frames. - Second
DownsampledAttentionBlock
on 1,000 tokens (down→250→up). AttentivePool
to obtain a 1×1,024 embedding.- Classifier head to produce logits for
N
classes.
- Stacked
Maximum intermediate sequence lengths and channel dimensions are annotated in the diagram.
Use scripts/train.py
to train from scratch. Example usage:
Will be added later
Use scripts/inference.py
to perform inference on a list of audio files. Example:
Will be added later