The BenLOC
class is designed to facilitate the training and evaluation of machine learning models for optimization problems. It provides methods for data preprocessing, model selection, training, and evaluation. Below is a step-by-step guide on how to use the class:
See BenLOC/data/dataset.md
for the introduction.
The parameters for the BenLOC
class can be set during initialization. You can either pass a Params
object or use the default parameters.
from BenLOC import TabParams
params = TabParams(default="default", label_type="log_scaled", shift_scale=10)
model = BenLOC(params)
- label_type: Specifies whether the labels are scaled (
log_scaled
) or in their original form (original
). - default: The column name representing the default time in the dataset.
- shift_scale: The scale used for the geometric mean during the baseline calculation.
You can set a model (e.g., RandomForestRegressor
) as the trainer for the BenLOC
class using the set_trainner
method.
from sklearn.ensemble import RandomForestRegressor
model.set_trainner(RandomForestRegressor(verbose=1))
- RandomForestRegressor: A random forest regressor model is used as the default estimator for the training process. You can also use other sklearn estimators, such as
LinearRegression
,SVR
, etc. - If you want to adapt customized estimator, please provide APIs of
model.fit()
andmodel.predict()
just like sklearn's APIs.
You can set our datasets or input your datasets.
You can load a dataset using the load_dataset
method. The dataset will include both features and labels, and can be processed optionally.
model.load_dataset("setcover", processed=True, verbose=True)
- dataset: The dataset to be loaded (e.g.,
setcover
,indset
,nn_verification
, etc.). - processed: If set to
True
, it loads the processed version of the dataset. - verbose: If set to
True
, it prints information about the dataset.
After loading the dataset, the feat
(features) and label
(labels) are stored, and preprocessing can be applied as needed.
model.set_train_data(Feature, Label)
#model.set_test_data(Test_Feature, Test_Label) (OPTIONAL)
Both Feature
and Label
should be the type of pandas.DataFrame
. Please follow our formats!
If you want use model.evaluation
, you can either set test datasets or only set train datasets and use train-test-split.
If you have set test dataset, please skip this step.
To split the dataset into training and testing sets, you can use the train_test_split
method.
model.train_test_split(test_size)
- test_size: Fraction of the data to be used as the test set (e.g.,
0.2
means 20% for testing, 80% for training).
This will divide the features and labels into training and testing sets and preprocess the data accordingly.
Also, you can use lists including the partition to split train and test datasets.
model.train_test_split_by_name(train_subset_names, test_subset_names)
Once the dataset is loaded and the model is set, you can fit the model using the fit
method.
model.fit()
The fit
method will train the model using the processed features (X
) and labels (Y
).
- The model is trained on the processed training data using the
trainner
model set earlier.
After training, you can evaluate the model using the evaluate
method. This method will make predictions on both the training and testing datasets and return a DataFrame with evaluation results.
evaluation_results = model.evaluate()
- evaluation_results: A DataFrame containing the evaluation results, including the baseline, default time, model prediction time, and oracle performance for both training and testing datasets.
The evaluation method returns the following columns:
- baseline_name: Name of the baseline method used.
- default_time: Default time using the baseline method.
- baseline_time: Time predicted by the baseline method.
- rf_time: Time predicted by the random forest model.
- ipv_default: The improvement in performance relative to the baseline (train).
- ipv_baseline: The improvement in performance relative to the baseline (test).
- oracle: The optimal oracle time.
- ipv_oracle: The improvement relative to the oracle.
we support use sklearn's model-selection methods to select parameters.
model.cross_validation(CVer, parameters_space log_name)
if you use RandomForest
, LightGBM
or GBDT
, we will support our parameters space as default.
Use following commands to check our parameters space.
model.rfr_parameter_space
model.lgbm_parameter_space
model.gbdt_parameter_space
# Step 1: Initialize the model with parameters
from BenLOC.params import Params
params = Params(default="default", label_type="log_scaled", shift_scale=10)
model = BenLOC(params)
# Step 2: Set the machine learning model (e.g., Random Forest)
from sklearn.ensemble import RandomForestRegressor
model.set_trainner(RandomForestRegressor(verbose=1))
# Step 3: Load the dataset and preprocess
model.load_dataset("setcover", processed=True, verbose=True)
# Step 4: Split the dataset into training and test sets
model.train_test_split(test_size=0.2)
# Step 5: Train the model
model.fit()
# Step 6: Evaluate the model and get results
evaluation_results = model.evaluate()
# Output evaluation results
print(evaluation_results)
Before running the script, ensure the following:
- Required packages (e.g., PyTorch, wandb, etc.) are installed.
- Your dataset is prepared and accessible.
- Set up the directory for saving models and results.
Create a params
object to store the configurations required for training. These include paths, training hyperparameters, and model settings.
params = {
"use_wandb": True, # Enable wandb logging
"modeGNN": "GAT", # Type of GNN to use (e.g., "GAT", "GCN")
"fold": 2, # Fold for cross-validation
"reproData": False, # Enable reproducibility
"default_index": 7, # Default problem index
"report_root_path": "repo/reports", # Path for reports
"problem_root_path": "./data", # Path to problem data
"batchsize": 32, # Batch size for training
"epoch": 100, # Number of training epochs
"lr": 0.001, # Learning rate
"stepsize": 10, # Step size for LR scheduler
"gamma": 0.9, # Gamma for LR scheduler
"save_model_path": "./models", # Directory to save models
"result_root_path": "./results", # Directory to save results
"nn_config": { # Model-specific configurations
"hidden_dim": 64,
"num_heads": 8,
},
"seed": 42 # Random seed for reproducibility
}
Initialize the GNNTrainer
with the defined parameters.
trainer = GNNTrainer(params)
If you want to customize model, you can set your model by:
trainer.set_model(YOUR_MODEL)
Configure seeds and initialize wandb if enabled. Run this step to ensure consistent results across runs.
trainer.setup_environment()
Load your dataset and split it into training, validation, and test sets. Dataloaders are prepared internally.
trainer.load_data()
Initialize the GNN model, optimizer, and scheduler. Ensure the model is set to the appropriate device (CPU/GPU).
trainer.build_model()
Train the model for the specified number of epochs and validate performance at the end of each epoch. The best model is saved automatically.
trainer.train_and_evaluate()
Evaluate the final model on the test dataset and print or save the results.
test_results = trainer.test_model()
print("Test Results:", test_results)
Save the results of training and testing to a specified directory for further analysis.
trainer.save_results()
For convenience, you can call the run
method, which executes all the above steps sequentially.
trainer.run()
# Import necessary libraries
from trainer import GNNTrainer # Assuming your trainer is saved as trainer.py
# Define parameters
params = {
"use_wandb": True,
"modeGNN": "GAT",
"fold": 2,
"reproData": False,
"default_index": 7,
"report_root_path": "repo/reports",
"problem_root_path": "./data",
"batchsize": 32,
"epoch": 100,
"lr": 0.001,
"stepsize": 10,
"gamma": 0.9,
"save_model_path": "./models",
"result_root_path": "./results",
"nn_config": {
"hidden_dim": 64,
"num_heads": 8,
},
"seed": 42,
}
# Initialize and run the trainer
trainer = GNNTrainer(params)
trainer.run()