The official repository of "Flashbacks to harmonize stability and plasticity in continual learning" published in the Neural Networks, 2025.
Official paper link [NN]: (https://doi.org/10.1016/j.neunet.2025.107616)
Read the paper on [arXiv]: (https://arxiv.org/abs/2506.00477)
Figure: Flashback Learning overview. At task t, Phase 1 updates the old model on new data to obtain primary model f(.; θₚ). Then, it extracts new knowledge and stores it in PKB. Phase 2 flashbacks to the old model f(.; θt−1*), regularized bidirectionally by stable and plastic knowledge, yielding f(.; θt*).
We introduce Flashback Learning (FL), a novel method designed to harmonize the stability and plasticity of models in Continual Learning (CL). Unlike prior approaches that primarily focus on regularizing model updates to preserve old information while learning new concepts, FL explicitly balances this trade-off through a bidirectional form of regularization. This approach effectively guides the model to swiftly incorporate new knowledge while actively retaining its old knowledge.
FL operates through a two-phase training process and can be seamlessly integrated into various CL methods, including replay, parameter regularization, distillation, and dynamic architecture techniques. In designing FL, we use two distinct knowledge bases: one to enhance plasticity and another to improve stability. FL ensures a more balanced model by utilizing both knowledge bases to regularize model updates.
Theoretically, we analyze how the FL mechanism enhances the stability–plasticity balance. Empirically, FL demonstrates tangible improvements over baseline methods within the same training budget. By integrating FL into at least one representative baseline from each CL category, we observed an average accuracy improvement of up to 4.91% in Class-Incremental and 3.51% in Task-Incremental settings on standard image classification benchmarks. Additionally, measurements of the stability-to-plasticity ratio confirm that FL effectively enhances this balance. FL also outperforms state-of-the-art CL methods on more challenging datasets like ImageNet.
To reproduce the experiments, you need to install the required Python packages. We recommend creating a virtual environment:
conda create -n flashback-env python=3.9
conda activate flashback-env
pip install -r requirements.txt
We have used the following common benchmarks in this project:
- Split-CIFAR10: A standard benchmark where CIFAR-10 is divided into multiple tasks, typically with 2 classes per task.
- Split-CIFAR100: A more challenging version using CIFAR-100, often divided into 10 tasks with 10 classes each.
- Split-Tiny-ImageNet: A subset of the ImageNet dataset, split into multiple tasks. It contains 200 classes of tiny resolution images, making it suitable for scalable continual learning evaluation.
Datasets will be automatically downloaded into the ./data
directory in the root of this project.
Note 1: You can change the data path in the
base_path_dataset()
function located inutils/conf.py
.Note 2: The
./data
folder will be created automatically if it does not exist.Note 3: For cleanliness and to avoid large file tracking, the
data
folder should not be tracked by Git.
You can customize the structure of each benchmark by modifying the constants used in the corresponding dataset class under the ./datasets/
directory. These constants are:
N_CLASSES_TASK_ZERO
: Specifies the number of classes in the first (base) task.N_CLASSES_PER_TASK
: Controls how many classes are introduced per incremental task.N_TASKS
: Determines the total number of tasks to create from the dataset.
For example, in SequentialCIFAR100
defined in ./datasets/seq_cifar100.py
, the following configuration:
N_CLASSES_TASK_ZERO = 10
N_CLASSES_PER_TASK = 10
N_TASKS = 10
defines a benchmark with 10 tasks, each containing 10 classes (standard Split-CIFAR100).
To customize it to a CIFAR-100-B50-10 setup (50 base classes, 10 classes/task afterward, 6 total tasks):
N_CLASSES_TASK_ZERO = 50
N_CLASSES_PER_TASK = 10
N_TASKS = 6
Continual Learning Baselines with Flashback Learning
We develop and evaluate Flashback Learning across four major categories of CL methods:
- Knowledge distillation methods
- Replay memory methods
- Parameter regularization methods
- Architecture expansion methods
Our experiments demonstrate that FL consistently yields tangible improvements across all these categories, confirming its broad applicability and effectiveness as a plug-in module for continual learning. We show how FL is integrated into eah CL category here:
Knowledge Distillation Methods
In distillation-based continual learning, a copy of the old model is retained. When a new task begins, the model is updated on new data while a distillation loss encourages it to maintain similarity to the old model’s representations, preserving stability.
When Flashback Learning is integrated into this setup:
- Phase 1: A primary model is trained on the new task.
- Phase 2: The model is reinitialized to the old model and trained using a bidirectional distillation loss—from both the old and primary models—guiding the representation toward a balance between past and new knowledge, improving stability and plasticity.
We integrate Flashback Learning into two representative distillation-based continual learning methods:
-
Learning a Unified Classifier Incrementally via Rebalancing (LUCIR) is known for applying distillation constraints at the feature embedding level (before the final logits) to preserve learned representations. Although it allows replay of a selected exemplar set, its core mechanism relies on distillation LUCIR – CVPR 2019. The FL-integrated version of LUCIR is implemented in
./models/fl-lucir.py
. -
Learning without Forgetting (LwF.MC) uses distillation at the logit level, transferring knowledge from the old model to the current one without replay . We specifically use LwF.MC, a multi-class variant adapted from iCaRL iCaRL – CVPR 2017. The FL-integrated version of LwF.MC is implemented in
./models/fl-lwf_mc.py
.
Memory Replay Methods
In memory replay methods, a memory with limited capacity of previous tasks' samples with their corresponding feature embeddings or logits is selected and brought to new task. When new task begins, the model is updated on a joint distribution of new data and limited kept samples from past.
When Flashback Learning is integrated into this setup:
- Phase 1: A primary model is trained on the new task, and primary new features embeddings or logits are generated for the memory samples by the primary model.
- Phase 2: The model is reinitialized to the old model and trained under a bidirectional replay—from both the old and primary logits or feature emebeddings—guiding the model responce towards a balance between past and new knowledge, improving stability and plasticity.
We integrate Flashback Learning into two representative memory replay methods:
-
Incremental Classifier and Representation Learning (iCaRL) is a replay method with herding strategy to select samples closest to each class prototype and use them while distilling knowledge from the old model at logits iCaRL – CVPR 2017. The FL-integrated version of iCaRL is implemented in
./models/fl-icarl.py
. -
eXtended Dark Exoerinece Replay (X-DER) is an extension to vanilla DER – NeurIPS 2020, from replay category, that keeps old logits in the memory buffer for distillation during rehearsal. We selected X-DER because it performs better than other DER variations X-DER – TPAMI 2022. The FL-integrated version of X-DER is implemented in
./models/fl-xder.py
.
Parameter Regularization Methods
In parameter regularization methods, parameters and importnace matrix (Fisher information matrix) of old model are stored and brought to new task. When new task starts, the model is updated under a unidirectional regularization to learn new task and keep important old parameters.
When Flashback Learning is integrated into this setup:
- Phase 1: A primary model is trained on the new task, its parameters and importance matrix are kept.
- Phase 2: The model is reinitialized to the old model and trained under a bidirectional regularization—from both the old and primary parameters—guiding the model paremeters towards an interpolation between old and primary new parameters, improving stability and plasticity.
We integrate Flashback Learning into one representative parameter regularization methods:
- online Elastic Weight Consolidation(oEWC)
is a baseline from parameter-regularization category, which calculates old parameters' importance recursively and then apply it for weighted regularization on new parameters update. Progress and Compress – ICML 2018. The FL-integrated version of oEWC is implemented in
./models/fl-ewc_on.py
.
Run Scripts
To run any continual learning baseline with or without Flashback Learning integration, use the following command pattern:
python utils/main.py \
--run [original | flashback] \
--model <model_name> \
--alpha_p <plasticity_loss_scaler> \
--cl_arguments <arguments_specific_to_cl_baseline> \
--epoch_base <epochs_for_task0> \
--sch0 <use_scheduler_for_task0: 0|1> \
--epoch_cl <phase1_epochs_for_tasks> \
--sch <use_scheduler_for_phase1: 0|1> \
--epoch_fl <phase2_epochs_for_tasks> \
--schf <use_scheduler_for_phase2: 0|1> \
--dataset <dataset_name> \
--batch_size <batch_size> \
--lr <learning_rate> \
--optim_mom <optimizer_momentum> \
--optim_wd <optimizer_weight_decay>
-
--run
: Set tooriginal
for running the baseline only, orflashback
to activate Flashback Learning (FL). -
--model
: Name of the model to run. It must be one of the models that support FL, defined inmodels/fl-model.py
. -
--alpha_p
: Scaling factor for the plasticity loss component used in FL Phase 2. -
--cl_arguments
: Baseline-specific arguments required by the continual learning method (e.g.,--e_lambda
for EWC). -
--epoch_base
: Number of training epochs for the base task (task 0). -
--sch0
: Whether to apply a learning rate scheduler during training on the base task. Use1
to enable or0
to disable. -
--epoch_cl
: Number of epochs for Phase 1 (task > 0), which trains the primary model on the new task. -
--sch
: Whether to use a scheduler in Phase 1. Set to1
to enable or0
to disable. -
--epoch_fl
: Number of epochs for Phase 2 (FL), which flashbacks from the primary model to the old model. -
--schf
: Whether to apply a scheduler during Phase 2. Set to1
or0
. -
--dataset
: Name of the dataset. Examples includeseq-cifar10
,seq-cifar100
, orseq-tinyimagenet
. -
--batch_size
: Mini-batch size for training. -
--lr
: Learning rate. -
--optim_mom
: Momentum parameter for the optimizer (e.g., SGD). -
--optim_wd
: Weight decay (L2 regularization) used by the optimizer.
Note: You can find ready-to-run scripts for all models supported by Flashback Learning (defined under
models/
) across all datasets in thescripts/
folder.
We gratefully acknowledge the contributions of the following repositories, which served as the foundation or inspiration for parts of this work:
We thank the authors of these projects for making their code publicly available.