Chameleon: A MatMul-Free TCN Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
Chameleon is the first chip carrying out end-to-end FSL and CL with temporal data (no off-chip embedder, all weights on-chip), with three key innovations all validated in silicon:
-
Chameleon builds on a reformulation of Prototypical Networks to integrate FSL/CL directly as part of the inference process, with only 0.5% logic area and negligible latency/power overheads. Simple yet powerful: we show the first 250-class demonstration of CL with Omniglot, also exceeding the accuracy of previous FSL-only demonstrations.
-
Chameleon uses temporal convolutional networks (TCNs) as an efficient embedder for temporal data. This choice simultaneously enables (i) accuracy/power performance on-par or exceeding state-of-the-art inference-only accelerators for keyword spotting (93.3% at 3.1μW real-time power on Google Speech Commands), and (ii) efficiently dealing with >10^4 context lengths, demonstrating competitive performance also when directly processing 16kHz raw audio (demonstrated for the first time end-to-end on-chip, to the best of our knowledge).
-
Small or large on-chip TCN? Learning- or inference-focused execution? Emphasis on throughput or low static power? Chameleon can efficiently support all of these application-dependent choices, thanks to a reconfigurable PE array.
In case you decide to use the Chameleon source code for academic or commercial use, we would appreciate it if you let us know; feedback is welcome.
Note: the documentation of Chameleon is currently work in progress! Feel free to open an issue if there is something that is unclear.
src/
: RTL (Verilog) source codescripts/
: Scripts for Verilog formatting and configs for various waveviewerschameleon/
core/
: Core Python code to load and run trained nets and to communicate with the ASIC.fpga_bridge/
: Python code that runs on the FPGA to interface with the ASIC.sim/
: Simulation code for the ASIC.
nets/
: Pre-trained networks in quantized state dict format used in the experiments.
Run: bender update
to install the necessary Verilog dependencies. Always run make
to generate all required Verilog code before synthesizing or simulating the hardware.
To use only the core Python code from Chameleon, simply run pip install .
from the project root. If you also want to use the FPGA bridge code, run pip install . [fpga_bridge]
. Finally, if you want to run the simulation code, run pip install . [sim]
.
Chameleon is tested using the cocotb
framework. To run all full system tests, go into the test/
directory and run:
python test_chameleon.py
To run the tests for individual components, run:
python test_pe.py
python test_pe_array.py
python test_argmax_tree.py
Install the required package for TCN networks from here.
from tcn_lib import TCN
# Model for sequential MNIST task (1 input, 10 outputs, 8 layers of 32 channels, kernel size 7)
model = TCN(1, 10, [32] * 8, 7)
# Perform training in any way you want
dataset = ...
for epoch in range(10):
for batch in dataset:
...
loss = ...
loss.backward()
...
optimizer.step()
...
All experiments described in our paper were conducted using our fully open-source AutoLightning framework, which builds on PyTorch Lightning while adding support for detailed hyperparameter tracking, sweeping, and specialized training recipes, such as prototypical learning and quantization-aware training. However, you can of course use any framework you like to train your TCN model.
Install the required package for configurable quantization-aware-training from here.
from brevitas_utils import create_qat_ready_model, get_quant_state_dict, save_quant_state_dict, QuantConfig
WEIGHT_BIT_WIDTH = 4
ACT_BIT_WIDTH = 4
BIAS_BIT_WIDTH = 15
ACCUMULATION_BIT_WIDTH = 18
# Define quantization configurations (see for more details: https://xilinx.github.io/brevitas/tutorials/tvmcon2021.html#Inheriting-from-a-quantizer)
weight_quant_cfg = QuantConfig(
base_classes=["Int8WeightPerTensorPowerOfTwo"],
kwargs={"bit_width": WEIGHT_BIT_WIDTH, "narrow_range": False}
)
act_quant_cfg = QuantConfig(
base_classes=["ShiftedParamFromPercentileUintQuant"],
kwargs={"bit_width": ACT_BIT_WIDTH, "collect_stats_steps": 1500}
)
bias_quant_cfg = QuantConfig(
base_classes= ["Int16Bias"],
kwargs={"bit_width": BIAS_BIT_WIDTH}
)
# Make sure to define output quantization in case of prototypical training
output_quant_cfg = None
load_float_weights_into_model = True # Reuse weights from the floating point model
calibration_setup = None # Do not calibrate (via: https://xilinx.github.io/brevitas/tutorials/tvmcon2021.html#Calibration-based-post-training-quantization) the model. Optionally configure for better model performance and/or faster convergence
skip_modules = [] # Quantize all modules
# Create a QAT-ready model
qat_ready_model = create_qat_ready_model(model,
weight_quant_cfg,
act_quant_cfg,
bias_quant_cfg,
load_float_weights_into_model=load_float_weights_into_model,
calibration_setup=calibration_setup,
skip_modules=skip_modules)
# Retrain for a few more epochs with a smaller learning rate to recover accuracy
# Get quantized weights and biases
quant_state_dict = get_quant_state_dict(qat_ready_model, (1, 1, 100))
# Export model
save_quant_state_dict(quant_state_dict, "quant_model.pkl")
We use Numpy primitives to simulate the network instead of PyTorch, as the Python code of this project has to run on the ARM core of the FPGA as well and PyTorch is likely a bit too heavy for that.
from chameleon.core.net_transfer_utils import get_quant_state_dict_and_layers
path = "quant_model.pkl" # Path to the exported quantized model
in_quant, quant_layers = get_quant_state_dict_and_layers(path)
dataset = ... # Load your dataset here
accuracy, preds_and_targets = infer(path, dataset)
print(f"Accuracy: {accuracy}")
Note that when simulating the network like this, no limits are enforced on the number of layers and the width of the layers. Only the accumulation width is enforced.
# TO DO!
To configure the power supply and the current measurement unit correctly, run the following commands after booting the FPGA:
sudo ip addr add xxx.yyy.z.p/qq dev eth1
ifconfig eth1 up
Then navigate to the location of this repository on the FPGA and run:
python -m asyncio
To start Python session to interact with Chameleon.
All memories are accessible for both read and write operations via the system’s 32-bit SPI bus (1 R/W-bit, 4-bit memory index, 16-bit start address, 11-bit transaction count).
This same SPI bus is also used to program the SoC’s configuration registers, such as the number of layers and kernel size per layer. These configuration values serve, among other things, as inputs to the network address generator, which generates the necessary control signals and memory addresses during processing.
A full list of configuration registers is provided below. If you want to change the configuration of the chip, edit the config_memory.json
file in the source code directory and rerun make
to regenerate the associated Verilog code.
Name | Width | Vector Size | Async Reset | Description |
---|---|---|---|---|
continuous_processing |
1 | – | ✅ | Enables processing (classification/regression) processing every few inputs for continuous streaming inputs. |
classification |
1 | – | ✅ | Enables classification mode. |
power_down_memories_while_running |
1 | – | ✅ | If you have a very high dimensional input or are transferring data into Chameleon very slowly, you can enable this flag. This will power down all memories except for the input memory while receiving input data. As soon as the necessary input data for a new set of computations is received, all memories will be powered up and the computation will be started. |
enable_clock_divider |
1 | – | ✅ | Enables clock division (off by default to save power). |
ring_oscillator_stage_selection |
6 | – | ❌ | Selects number of stages in ring oscillator. |
max_weight_address |
WEIGHT_ADDRESS_WIDTH |
– | ❌ | The maximum weight index uses by the DNN during PNs |
max_bias_address |
BIAS_ADDRESS_WIDTH |
– | ✅ | Sets maximum addressable bias index. |
shots |
SHOT_BIT_WIDTH |
– | ✅ | Number of shots for FSL. |
few_shot_scale |
FEW_SHOT_SCALE_WIDTH |
– | ✅ | Scaling factor to scale summed embeddings into the range of weights during FSL. |
k_shot_division_scale |
K_SHOT_SCALE_WIDTH |
– | ✅ | Scale factor to divide k in FSL. |
use_l2_for_few_shot |
1 | – | ✅ | Selects L2 distance in FSL. Otherwise dot-product is used. |
continued_learning |
1 | – | ✅ | Enables continued learning mode to enable FSCIL. |
fill_input_memory |
1 | – | ✅ | Forces Chameleon to always fill the input memory before proceeding with processing. |
force_downsample |
1 | – | ✅ | Forces a downsampling layer in the residual path of a TCN residual block instead of an identity operation (where possible). |
power_down_small_bias |
1 | – | ✅ | Powers down small biases to save energy. |
power_down_srams_in_standby |
1 | – | ✅ | Powers down SRAMs in standby mode (in idle state). |
wake_up_delay |
WAIT_CYCLES_WIDTH |
– | ✅ | Delay in cycles before waking up the SRAMs from power down. |
power_up_delay |
WAIT_CYCLES_WIDTH |
– | ✅ | Delay in cycles to power-up the SRAMs. |
in_4x4_mode |
1 | – | ✅ | Enables 4×4 mode operation. |
require_single_chunk |
1 | – | ✅ | Requires single input chunk (less than PE array size input channels). |
send_all_argmax_chunks |
1 | – | ✅ | Sends all argmax bits, instead of only the first HIGH_SPEED_OUT_PINS bits. |
load_inputs_from_activation_memory |
1 | – | ✅ | Loads inputs directly from activation memory instead of from the input memory. |
use_subchunks |
1 | – | ✅ | Enables use of subchunks (this chip will then be able to receive HIGH_SPEED_OUT_PINS bits of outputs from another Chameleon chip). |
input_blocks_times_kernel_size |
BLOCKS_KERNEL_WIDTH |
– | ❌ | Product of input blocks (input channels / PE array size) and kernel size. |
num_conv_layers |
LAYER_WIDTH |
– | ❌ | Number of convolution layers. |
num_conv_and_linear_layers |
LAYER_WIDTH |
– | ❌ | Number of convolution + linear layers. |
kernel_size_per_layer |
KERNEL_WIDTH |
33 | ❌ | Kernel size per layer. |
blocks_per_layer |
BLOCKS_WIDTH |
33 | ❌ | Number of blocks (input channel of layer / PE array size) per layer (vector). |
blocks_per_layer_times_kernel_size_cumsum |
CUMSUM_WIDTH |
33 | ❌ | Cumulative block × kernel size per layer. |
scale_and_residual_scale_per_layer |
2*SCALE_BIT_WIDTH |
32 | ❌ | Vector of {scale (for linear and convolution layer), residual_scale (for convolution layers only, 0 otherwise)} pairs per layer. |
Upon usage of the source code of Chameleon, please cite the associated paper:
@misc{blanken2025chameleon,
title={Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data},
author={Douwe den Blanken and Charlotte Frenkel},
year={2025},
eprint={2505.24852},
archivePrefix={arXiv},
primaryClass={cs.AR}
}
The HDL source code of Chameleon in the src
directory is licensed under a Solderpad Hardware License v2.1 (see LICENSE-HW
file or https://solderpad.org/licenses/SHL-2.1/).
All other source code in this repository is licensed under the Apache License Version 2.0 (see LICENSE
file or http://apache.org/licenses/LICENSE-2.0).
© (2025) Douwe den Blanken, Delft, The Netherlands