The repository shares code for the paper Towards AI Analyst: Querying Costly Features for Fraud and Money Laundering Detection.
Source code lives in the src
folder. All scripts are included in the scripts
folder.
Install the requirements from the requirements.txt
file. The code has been run on Python/3.12.
Public data and models live in the local_caches
folder. You can find all the necessary data on Google Drive here: towards-ai-analyst-data. The structure is divided into data
folder and results
folder.
The original Amaretto dataset can be found at https://github.com/necst/amaretto_dataset. Download from the data
subfolder on the Google Drive.
amaretto_dataset_anon.csv.zip
: file that is just the original Amaretto dataset processed into on .csv fileamaretto.pq
: full dataframe with extracted features via feature engineering (as inscripts/data_prep/feature_engineering.py
)amaretto.tar.gz
that unpacks to aamaretto
folder withprior.pq
,val_prior.pq
, andtest_prior.pq
that are time split of prior featurescostly_features.pq
,val_costly_features.pq
, andtest_costly_features.pq
that are time split of costly features
Results are to be unpacked in local_caches/results
folder from the file amaretto_results.tar.gz
. The results include calculated scores and other data from the best trained models in all settings. If you'd like to get the trained models themselves, don't hesitate to write to us and we can provide them.
❯ tar -tzf amaretto_results.tar.gz
amaretto/
amaretto/dime/
amaretto/nn_classifier_prior_probabilities.pkl
amaretto/nn_classifier_full_probabilities.pkl
amaretto/nn_classifier_2f_probabilities.pkl
amaretto/dime/test.npz
amaretto/dime/val.npz
You can load the data and the model results and try for yourself using the notebook notebooks/costly_features_results.ipynb
.
To cite the paper, please use
[tbd]
Gadgil, S., Covert, I.C. and Lee, S.I., Estimating Conditional Mutual Information for Dynamic Feature Selection. In The Twelfth International Conference on Learning Representations.