BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models

Code for our SIGKDD'25 paper: "BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models".

The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.

Important

This repository contains the code to generate the BLAST corpus.

If you are interested in using the BLAST corpus to train universal forecasting models, please refer to the BasicTS repository. It provides native support for BLAST and can be used to train models such as TimeMoE (decoder-only architecture) and ChronosBolt (encoder-decoder architecture).

The folders raw_data_construction, metrics_calculation, feature_construction, dimension_reduction, and sampling correspond to the sections shown in the diagram above. The raw BLAST data contains 321 billion observations, approximately 3.4TB in size. After sampling, BLAST includes 3 million time series, each with a maximum length of 4096, and is approximately 227GB in size. These data will be open-sourced on HuggingFace after the review process. TimeMoE_BLAST contains the code to train the TimeMoE model on the BLAST corpus.

💿 Requirements

This code is built on Python 3.11. The required packages can be installed using the following command:

pip install -r requirements.txt

📂 Prepare Raw Data

Download the data from Chronos, MOIRAI, and MOMENT, and place it in the datasets/raw_datasets/ folder.

The directory becomes:

datasets/raw_datasets/
├── chronos_datasets
│   ├── dominick
│   └── electricity_15min
|   └── ...
├── lotsa_datasets
│   ├── alibaba_cluster_trace_2018
│   └── australian_electricity_demand
│   └── ...
├── Timeseries-PILE
│   ├── anomaly_detection
│   ├── classification
│   ├── forecasting

BLAST Workflow

1. Raw Data Construction

Read raw data from various sources and save them in a unified format

python raw_data_construction/Chronos_read_and_save.py
python raw_data_construction/LOTSA_read_and_save.py
python raw_data_construction/Monash_read_and_save.py
python raw_data_construction/UCR_read_and_save.py
python raw_data_construction/UAD_read_and_save.py

The processed data will be saved in the datasets/processed_datasets/ folder.

2. Metrics Calculation

python metrics_calculation/main.py

The results will be saved in the metrics_calculation/output/ folder.

3. Feature Construction

python feature_construction/main.py

The results will be saved in the feature_construction/output/ folder.

4. Dimension Reduction

First, run dimension_reduction_train.py to train the UMAP model.

Then, run dimension_reduction,py to reduce the dimensions of the features.

5. Sampling

run sampling/sampling.py
run sampling/split.py
run sampling/clean_data.ipynb

The sampled data will be saved in the sampling/train and sampling/valid folders.

TODO

Clean up the code and add comments.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
dimension_reduction		dimension_reduction
feature_construction		feature_construction
metrics_calculation		metrics_calculation
raw_data_construction		raw_data_construction
sampling		sampling
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models

💿 Requirements

📂 Prepare Raw Data

BLAST Workflow

1. Raw Data Construction

2. Metrics Calculation

3. Feature Construction

4. Dimension Reduction

5. Sampling

TODO

About

Uh oh!

Releases 1

Packages

Languages

License

GestaltCogTeam/BLAST

Folders and files

Latest commit

History

Repository files navigation

BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models

💿 Requirements

📂 Prepare Raw Data

BLAST Workflow

1. Raw Data Construction

2. Metrics Calculation

3. Feature Construction

4. Dimension Reduction

5. Sampling

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages