Skip to content

raseidi/cosmo

Repository files navigation

COnditioned process Simulation MOdels (CoSMo)

Codebase for the CoSMo paper accepted at the BPM'24 conference.


Table of Contents

Overview

In process mining, the most popular process simulation models are fully data-driven, as traditional learning-based solutions do not offer sufficient flexibility for real-world applications. This project aims to develop a constrained learning approach, enabling deep neural networks to meet user-defined constraints at simulation time. This approach addresses the flexibility gap in such existing solutions.

I have made significant efforts to ensure this repository is reproducible and understandable. If you face any issues, please open an issue or contact me via email or social media.

Datasets

NOTE: these statistics are after the preprocessing step (cached logs), thus the number of activities and events might differ from the original ones from the 4tu repository.

Event log # activities # events (10^3) # cases (103) Split
BPI12 23 104.82 11.05 ± 9.64 Unbiased¹
BPI13 6 4.89 3.69 ± 4.09 Unbiased¹
BPI17 26 1210.81 38.44 ± 17.96 Unbiased¹
BPI20 50 82.22 12.01 ± 5.46 Unbiased¹
SEPSIS 8 9.87 10.37 ± 3.9 pm4py²

¹ Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring

² pm4py library

Installation

Instructions for setting up the environment and installing the dependencies for this repository. Make sure to use python3.10.

Note: Tested only on Ubuntu.

# Clone this repository
git clone https://github.com/raseidi/cosmo.git

# Navigate to the project directory
cd cosmo

# Create a virtual environment (tested with conda 23.11.0 only)
conda create --name cosmo python=3.10

# using pyenv should be
python3.10 -m venv cosmoenv
source cosmoenv/bin/activate  # On Windows use `cosmoenv\Scripts\activate`

# Install required packages
pip install -r requirements.txt

Usage

Download all the cached data. For debugging purposes, we suggest using the sepsis event log, as its preprocessing and training times are relatively fast, especially if a GPU is available.

Extract the cached data and make sure the repository looks like:

cosmo/
data/
    ├── bpi12/
    ├── bpi13_problems/
    ├── bpi17/
    ├── bpi20_permit/
    ├── sepsis/
    └── simulation/
other files...

In general, to reproduce the results of this repository, follow these steps:

  1. Extract the declare rules from event logs.
  2. Train the model(s).
  3. Simulate processes based on the trained model.

Discovering declare rules

You can either extract rules for a single dataset:

python preprocess_log.py --log-name sepsis

or run the following script to extract rules from all event logs:

chmod +x scripts/extract_declare.sh
./scripts/extract_declare.sh

Training

You can either train one single instance with a custom configuration

python train.py \
    --dataset sepsis \
    --template choice \
    --backbone crnn \
    --lr 0.0005 \
    --batch-size 64 \
    --hidden-size 256 \
    --input-size 32 \
    --n-layers 1 \
    --epochs 50

or reproduce the whole paper running the following bash script (it might take a few hours)

chmod +x scripts/reproduce_paper.sh

./scripts/reproduce_paper.sh

If you want to optimize different hyperaparameters, edit the scripts/train.sh script and run it as the above script. Unfortunately, though, the current simulation scripts support only the default models trained using the scripts/reproduce_paper.sh script.

Find below the available arguments for the train.py script:

Argument Type Default Value Choices Decription
--dataset str sepsis ["sepsis", "bpi12", "bpi13_problems", "bpi17", "bpi20_permit"] Event log to be used
--lr float 5e-4 Learning rate
--batch-size int 32 Batch size
--weight-decay float 1e-5 Weight decay for training regularization
--epochs int 100 Number of epochs
--device str cuda ["cuda", "cpu"] Whether it should run on gpu or cpu
--hidden-size int 32 Number of hidden units
--input-size int 8 Embedding size (input for the hidden layer)
--project-name str "cosmo-bpm-sim" Project name, only if wandb is enabled
--n-layers int 1 Number of (constrained) recurrent layers
--wandb str False True if passed, False otherwise Enable or disable of wandb
--template str "choice" ["existence", "choice", "positive relations", "all"] Declare tempalte to be trained
--backbone str "crnn" ["crnn", "vanilla"] Backbones (constrained rnn proposed in this work or vanilla rnn)

NOTE: wandb is not included in the requirements but train.py script accepts the flag --wandb to enable it, if you decide to install it.

Simulating

As specified in the previous table, you can use the constrained or vanilla rnn for training. Accordingly, all the simulations can be reproduced by running the following script after training

chmod +x scripts/simulation_<backbone>.sh
./scripts/simulation_<backbone>.sh

where backbone is either crnn or vanilla.

Currently, running a simulation for a custom model trained with different hyperparameters is not supported. That requires a bit of refactoring on the codebase (sorry about that).

I/O files

Event logs

How the data/ directory is structured:

# Example structure
data/
    └── bpi12/
        ├── cached_train_test/      # datasets with its respective declare rules
            ├── dataset_choice_test.pt
            └── dataset_choice_train.pt        
        ├── declare/                # declare rules extracted from bpi12
            └── constraints.pkl     
        ├── train_test/              # raw event logs from the unbiased split paper
            ├── train.csv  
            └── test.csv        
        ├── cached_log.pkl          # event log preprocessed
        └── log.xes                 # original log downloaded from the 4tu repository

Trained models

The models are persisted in the models/ directory. Here is an example of how it looks like:

models/
    └── sepsis/
        └── backbone=crnn-templates=choice-lr=0.0005-bs=64-hidden=256-input=32.pth

Simulated logs

The simulated logs will be persisted under the data/simulation/ directory, which is organized by the two subfolders crnn and vanilla. Example:

data/simulation/
    └── crnn/
        └── dataset=sepsis-template=positive relations-sim_strat=original-sampling_strat=multinomial.pkl
    └── vanilla/
        └── dataset=sepsis-template=all-sim_strat=original.pkl

Citation

@InProceedings{Oyamada2023cosmo,
  author="Oyamada, Rafael Seidi
  and Marques Tavares, Gabriel
  and Barbon Junior, Sylvio
  and Ceravolo, Paolo",
  editor="Marrella, Andrea
  and Resinas, Manuel
  and Jans, Mieke
  and Rosemann, Michael",
  title="CoSMo: A Framework to Instantiate Conditioned Process Simulation Models",
  booktitle="Business Process Management",
  year="2024",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="328--344",
  isbn="978-3-031-70396-6"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published