Code for our SIGKDD'25 paper: "BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models".
The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.
Important
This repository contains the code to generate the BLAST corpus.
If you are interested in using the BLAST corpus to train universal forecasting models, please refer to the BasicTS repository. It provides native support for BLAST and can be used to train models such as TimeMoE (decoder-only architecture) and ChronosBolt (encoder-decoder architecture).
The folders raw_data_construction
, metrics_calculation
, feature_construction
, dimension_reduction
, and sampling
correspond to the sections shown in the diagram above. The raw BLAST data contains 321 billion observations, approximately 3.4TB in size. After sampling, BLAST includes 3 million time series, each with a maximum length of 4096, and is approximately 227GB in size. These data will be open-sourced on HuggingFace after the review process. TimeMoE_BLAST
contains the code to train the TimeMoE model on the BLAST corpus.
This code is built on Python 3.11. The required packages can be installed using the following command:
pip install -r requirements.txt
Download the data from Chronos, MOIRAI, and MOMENT, and place it in the datasets/raw_datasets/
folder.
The directory becomes:
datasets/raw_datasets/
├── chronos_datasets
│ ├── dominick
│ └── electricity_15min
| └── ...
├── lotsa_datasets
│ ├── alibaba_cluster_trace_2018
│ └── australian_electricity_demand
│ └── ...
├── Timeseries-PILE
│ ├── anomaly_detection
│ ├── classification
│ ├── forecasting
Read raw data from various sources and save them in a unified format
python raw_data_construction/Chronos_read_and_save.py
python raw_data_construction/LOTSA_read_and_save.py
python raw_data_construction/Monash_read_and_save.py
python raw_data_construction/UCR_read_and_save.py
python raw_data_construction/UAD_read_and_save.py
The processed data will be saved in the datasets/processed_datasets/
folder.
python metrics_calculation/main.py
The results will be saved in the metrics_calculation/output/
folder.
python feature_construction/main.py
The results will be saved in the feature_construction/output/
folder.
First, run dimension_reduction_train.py
to train the UMAP model.
Then, run dimension_reduction,py
to reduce the dimensions of the features.
-
run
sampling/sampling.py
-
run
sampling/split.py
-
run
sampling/clean_data.ipynb
The sampled data will be saved in the sampling/train
and sampling/valid
folders.
- Clean up the code and add comments.