cFlow-Bench is a benchmark dataset for anomaly detection techniques in computational workflows. cFlow-Bench contains workflow execution traces, executed on distributed infrastructure, that include systematically injected anomalies (labeled), and offers both the raw execution logs and a more compact parsed version. In this GitHub repository, apart from the logs and traces, you will find sample code to load and process the parsed data using pytorch, as well as, the code used to parse the raw logs and events.
The dataset contains 6352 DAG executions from 11 computational science workflows and 1 ML data science workflow, under normal and anomalous conditions. These workflows were executed using Pegasus WMS - Panorama. Synthetic anomalies, were injected using Docker’s runtime options to limit and shape the performance. The data have been labeled using 6 tags (normal, cpu_2, cpu_3, cpu_4, hdd_5 and hdd_10).
- normal: No anomaly is introduced - normal conditions.
- CPU K: M cores are advertised on the executor nodes, but on some nodes, K cores are not allowed to be used. (K = 2, 3, 4M = 4, 8 and K < M)
- HDD K: On some executor nodes, the average write speed to the disk is capped atK MB/s and the read speed at (2×K) MB/s. (K = 5, 10)
Detailed description and statistics of the dataset can be found in ./adjacency_list_dags/README.md
- Install the required packages by using
bash setup.sh
-
load data as graphs in
pytorch_geometric
format:from flowbench.dataset import FlowDataset dataset = FlowDataset(root="./", name="montage") data = dataset[0]
The
data
contains the structural information by accessingdata.edge_index
, and node feature informationdata.x
. -
load data as tabular data in
pytorch
format:from flowbench.dataset import FlowDataset dataset = FlowDataset(root="./", name="montage") data = dataset[0] Xs = data.x ys = data.y
Unlike the graph
data
, thedata
only contains the node features. -
load data as tabular data in
numpy
format:from flowbench.dataset import FlowDataset dataset = FlowDataset(root="./", name="montage") data = dataset[0] Xs = data.x.numpy() ys = data.y.numpy()
This is the same as the previous one, but the data is in
numpy
format, which is typically used in the models fromsklearn
andxgboost
. -
load text data with
huggingface
interface. We have uploaded our parsed text data in thehuggingface
dataset. You can load the data with the following code:from datasets import load_dataset dataset = load_dataset("cshjin/poseidon", "1000genome")
The dataset is in the format of
dict
with keystrain
,test
, andvalidation
.
- unsupervised: We provide benchmarks for anomaly detection based on PyGOD and PyOD from graph data and tabular data, respectively.
- Checkout the script under
./example/demo_pygod.py
and./example/demo_pyod.py
for more details.
- Checkout the script under
- supervised: We provide supervised methods for both tabular and graph data
- Checkout the script under
./example/demo_mlp.py
and./example/demo_gnn.py
for more details.
- Checkout the script under
- supervised fine-tuned (SFT) LLMs: We provide text data based anomaly detection using fine-tuned language models with LoRA for efficient training.
- Checkout the script under
./example/demo_sft_lora.py
for more details.
- Checkout the script under
The repository is structured as follows:
- adjacency_list_dags: Contains json files of the executable workflow DAGs in adjacency list representation.
- images: Contains diagrams of the abstract workflow DAGs and of the processes used to generated the data.
- parsed: Contains the parsed version of the data. The folder is structured in subfolders per anomaly label.
- py_script: Contains scripts to load the dataset and run the benchmark.
- raw: Contains the raw logs and scripts to parse them.
.
├── adjacency_list_dags
├── benchmark
├── data
│ ├── xxx.zip
├── examples
| └── demo_xxx.py
├── flowbench
│ ├── nlp
│ ├── llm.py
│ ├── supervised
| | ├── mlp.py
| | ├── gnn.py
| | └── xxx.py
│ └── unsupervised
| ├── gmm.py
| ├── pca.py
| └── xxx.py
├── hps/
├── tests/
├── LICENSE
├── README.md
├── requirements.txt
├── setup.py
The dataset is licensed under the Creative Commons Attribution 4.0 International License. The code is licensed under the MIT License.