llmabba is a software framework for time series analysis using Large Language Models (LLMs) based on symbolic representation, as introduced in the paper LLM-ABBA: Symbolic Time Series Approximation using Large Language Models.
Time series analysis involves identifying patterns, trends, and structures within data sequences. Traditional methods like discrete wavelet transforms or symbolic aggregate approximation (SAX) convert continuous time series into symbolic representations for better analysis and compression. However, these methods often struggle with complex patterns.
llmabba enhances these techniques by leveraging LLMs, which excel in pattern recognition and sequence prediction. By applying LLMs to symbolic time series, llmabba discovers rich, meaningful representations, offering:
- Higher accuracy and compression: Better symbolic representations via LLMs, improving data compression and pattern accuracy.
- Adaptability: Robust performance across domains like finance, healthcare, and environmental science.
- Scalability: Efficient handling of large-scale time series datasets.
- Automatic feature discovery: Uncovers novel patterns that traditional methods may miss.
- Symbolic Time Series Approximation: Converts time series into symbolic representations.
- LLM-Powered Encoding: Enhances compression and pattern discovery.
- Efficient and Scalable: Suitable for large-scale datasets.
- Flexible Integration: Compatible with machine learning and statistical workflows.
To set up a virtual environment, use one of these methods:
Using venv:
mkdir ~/.myenv
python -m venv ~/.myenv
source ~/.myenv/bin/activate
Using conda:
conda create -n myenv
conda activate myenv
Then, install llmabba via pip:
pip install llmabba
The ABBA
class provides quantized symbolic approximation using Fixed-point Adaptive Piecewise Linear Continuous Approximation (FAPCA). It converts numerical time series into symbolic strings (e.g., letters like 'A', 'B' representing trends such as increasing or decreasing segments) and supports reconstruction back to numerical form.
Key Parameters:
- tol
: Tolerance for approximation error (smaller values increase segmentation granularity).
- alpha
: Quantization parameter (controls symbol assignment to segments).
- init
: Initialization method for FAPCA (e.g., 'agg' for aggregation-based).
- scl
: Scaling factor for approximation.
Input Format: - A list of lists, where each inner list is a 1D time series of floating-point numbers (e.g., multiple time series samples).
Output:
- encode
: Returns a list of symbolic strings (e.g., ['A B C'] for each time series).
- decode
: Reconstructs the time series as a list of lists of floats, approximating the input.
Example:
from llmabba import ABBA
# Sample time series data
ts = [[1.2, 1.4, 1.3, 1.8, 2.2, 2.4, 2.1], [1.2, 1.3, 1.2, 2.2, 1.4, 2.4, 2.1]]
abba = ABBA(tol=0.1, alpha=0.1, init='agg', scl=3)
symbolic_representation = abba.encode(ts)
print("Symbolic Representation:", symbolic_representation)
reconstruction = abba.decode(symbolic_representation)
print("Reconstruction:", reconstruction)
This example encodes two time series into symbolic strings and reconstructs them, with minor deviations based on tol
. The output can be used for pattern matching, compression, or visualization.
The LLMABBA
class integrates ABBA with LLMs (e.g., Mistral-7B) for advanced tasks like classification, regression, and forecasting. It processes time series into symbolic representations, tokenizes them for LLM input, fine-tunes the model, and performs inference. The class supports 4-bit quantization (via BitsAndBytesConfig) and LoRA (Low-Rank Adaptation) for efficient fine-tuning, along with FSDP for distributed training. This guide provides a overview of LLM-ABBA's functionality, including installation, symbolic approximation with the ABBA class, and advanced tasks (classification, regression, forecasting) using the LLMABBA
class. We refer to the GitHub repository (https://github.com/inEXASCALE/llm-abba) and the examples folder for additional details and parameter tuning. For more details on the use of fABBA and ABBA, switch to GitHub repository (https://github.com/nla-group/fABBA).
Key Steps:
- Step 1. Data Preparation: Load numerical time series and labels, scale data (e.g., z-score or min-max), and split into training/validation sets.
- Step 2. Processing: Use ABBA to symbolize time series, then tokenize with the LLM's tokenizer.
- Step 3. Training: Fine-tune the LLM on the tokenized dataset using LoRA and FSDP.
- Step 4. Inference: Generate predictions on new data, leveraging the fine-tuned model.
Input Format: - Training (``process`` method):
data
: Dictionary with: -'X_data'
: 2D NumPy array (n_samples × time_series_length, floats). -'Y_data'
: 1D NumPy array (n_samples, integers/strings for classification, floats for regression).- For forecasting,
data
is a 2D NumPy array, and the method creates input/output patches of lengthsseq_len_pre
andseq_len_post
.
- Inference:
-
data
: 2D NumPy array (n_test_samples × time_series_length, floats; can be (1, length) for a single sample).
Output: - Classification/Regression: A string with the predicted class or value (e.g., "0" for abnormal, "Normal", or a numerical value). - Forecasting: A 2D NumPy array (seq_len_post × n_features, floats) with the predicted time series continuation.
Key Parameters: - Initialization:
abba_tol
,abba_alpha
,abba_init
,abba_scl
: ABBA parameters for symbolic approximation.lora_r
,lora_alpha
,lora_dropout
: LoRA parameters for fine-tuning.quant_process
: Enable 4-bit quantization (default: True).
- Processing:
-
project_name
: Identifier for saving models/scalers. -task
: 'classification', 'regression', or 'forecasting'. -prompt
: Task-specific instruction for the LLM. -scalar
: Scaling method ('z-score' or 'min-max'). -alphabet_set
: Custom symbols or None (uses LLM tokenizer vocab). -seq_len_pre
/seq_len_post
: Sequence lengths for forecasting input/output. - Training:
-
num_epochs
: Number of training epochs. -output_dir
: Directory for saving checkpoints. - Inference:
-
llm_max_new_tokens
: Maximum tokens for LLM output. -llm_temperature
: Controls randomness (0.0 for deterministic). -llm_repetition_penalty
: Penalizes repeated tokens.
Example: ECG Classification with PTBDB
This example demonstrates binary classification (abnormal vs. normal ECG signals) using the PTB Diagnostic ECG Database (~187 time points per sample). To successfully run the example, one will need a Hugging Face account and obtain access to Mistral-7B-Instruct-v0.1s:
import pandas as pd
import numpy as np
from llmabba.llmabba import LLMABBA
from sklearn.model_selection import train_test_split
project_name = "PTBDB"
task_type = "classification"
model_name = 'mistralai/Mistral-7B-Instruct-v0.1'
prompt_input = """This is a classification task. Identify the 'ECG Abnormality' according to the given 'Symbolic Series'."""
# Load and process data
abnormal_df = pd.read_csv('../test_data/ptbdb_abnormal.csv', header=None)
normal_df = pd.read_csv('../test_data/ptbdb_normal.csv', header=None)
Y_data = np.concatenate((np.zeros([abnormal_df.shape[0]], dtype=int), np.ones([normal_df.shape[0]], dtype=int)))
X_data = pd.concat([abnormal_df, normal_df]).to_numpy()
arranged_seq = np.random.permutation(len(Y_data))
train_data, test_data, train_target, test_target = train_test_split(X_data[arranged_seq], Y_data[arranged_seq], test_size=0.2)
train_data_split = {'X_data': train_data[:500], 'Y_data': train_target[:500]}
# Train with LLM-ABBA
LLMABBA_classification = LLMABBA()
model_input, model_tokenizer = LLMABBA_classification.model(model_name=model_name, max_len=2048)
tokenized_train_dataset, tokenized_val_dataset = LLMABBA_classification.process(
project_name=project_name,
data=train_data_split,
task=task_type,
prompt=prompt_input,
alphabet_set=-1,
model_tokenizer=model_tokenizer,
scalar="z-score"
)
LLMABBA_classification.train(
model_input=model_input,
num_epochs=1,
output_dir='../save/',
train_dataset=tokenized_train_dataset,
val_dataset=tokenized_val_dataset
)
# Inference
test_data = np.expand_dims(test_data[1], axis=0)
peft_model_input, model_tokenizer = LLMABBA_classification.model(
peft_file='../llm-abba-master/save/checkpoint-25/',
model_name=model_name,
max_len=2048
)
out_text = LLMABBA_classification.inference(
project_name=project_name,
data=test_data,
task=task_type,
prompt=prompt_input,
ft_model=peft_model_input,
model_tokenizer=model_tokenizer,
scalar="z-score",
llm_max_length=256,
llm_repetition_penalty=1.9,
llm_temperature=0.0,
llm_max_new_tokens=2
)
print(out_text)
LLM-ABBA automatically saves the scaler and ABBA model during processing to ../save/[project_name]/[task]_Scaler_save.pkl
and ../save/[project_name]/[task]_ABBA_save.pkl
. These are loaded during inference to ensure consistent scaling and symbolization.
Example:
from llmabba.llmabba import save_abba
save_abba(LLMABBA_classification.xabba, "../save/PTBDB/classification_ABBA_save.pkl")
# Loading
from llmabba.llmabba import load_abba
xabba = load_abba("../save/PTBDB/classification_ABBA_save.pkl")
Currently under development.
1. Why are some LLM-generated tokens not recognized by ABBA?
LLM hallucination may generate tokens outside ABBA’s alphabet. Solutions include:
- Method A: Skip or replace unrecognized tokens, using repetition_penalty=1.3 and temperature=0.0.
- Method B: Use as many LLM tokens as possible (e.g., 32,000 for Mistral) to avoid this issue.
- Method C: Use ASCII symbols instead of LLM tokens, specifying them in the prompt (under testing).
- Method D: Fine-tune the LLM with add_tokens (not recommended).
2. Is inhibition level necessary?
Adjusting the inhibition level helps LLMs adapt to downstream tasks, especially with special symbols (Method C).
We welcome contributions in any form! Assistance with documentation is always welcome. To contribute, feel free to open an issue or please fork the project make your changes and submit a pull request. We will do our best to work through any issues and requests.
Released under the MIT License.
For questions, use GitHub issues or contact the paper’s authors.
Users of LLM-ABBA
and ABBA
as well as the software can cite:
.. bibliography:: bibtex @misc{carson2024, author={Erin Carson and Xinye Chen and Cheng Kang}, title={{LLM-ABBA}: Understanding time series via symbolic approximation}, year={2024}, eprint={2411.18506}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2411.18506}, } @article{10.1145/3532622, author = {Chen, Xinye and G\"{u}ttel, Stefan}, title = {An Efficient Aggregation Method for the Symbolic Representation of Temporal Data}, year = {2023}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {17}, number = {1}, doi = {10.1145/3532622}, journal = {ACM Trans. Knowl. Discov. Data}, numpages = {22}, } @article{Chen2024, author = {Chen, Xinye and Güttel, Stefan}, title = {{fABBA}: A Python library for the fast symbolic approximation of time series}, doi = {10.21105/joss.06294}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {95}, pages = {6294}, journal = {Journal of Open Source Software} }