A program designed to find the best Minimally Complex Model (MCM) for a given discrete dataset. As MCMs are spin models containing spin operators of arbitrary interaction order characterized by a low model complexity, this program can detect relevant higher-order interactions within the dataset. The program can handle datasets with up to 128 variables.
The search process is divided into two steps. The first step is to determine the best set of variables that can be used as the basis representation for the data.
For a system of size
The exhaustive search involves generating all possible partitions and calculating their log evidence.
As this algorithm goes through all possible partitions, it is guaranteed to find the best one.
For a system with
The greedy search method consists of an iterative merging procedure.
We start with the partition where every variable forms a component, i.e. an MCM with
Because it does not go through all possible partitions as in the exhaustive search, there is no guarantee of finding the best MCM. However, by continuously selecting the merge that increases the log evidence the most, the final partition will be a good estimation of the best MCM.
In the divide and conquer method, we start from the complete model, i.e. an MCM with 1 component of size
- The code uses C++ 11.
- CMake v3.26.4+ (Optional)
Run the script build_and_test.sh
to compile the program and execute some tests to ensure that the code works correctly:
sh build_and_test.sh
mkdir build
cd build
g++ -std=c++11 -O3 ../src/main.cpp ../src/model/*.cpp ../src/search_algorithms/*.cpp -o ./mcm_discrete.exe
After running the script build_and_test.sh
, a build folder is created containing the program mcm_discrete
, which can be executed from the command line:
cd build
./mcm_discrete -f filename -q val_of_q -n n_var -search_method
The result of the search is written to a file in the output
folder.
-
-f filename
: path to the file containing the data relative to theinput
folder (without the.dat
). -
-q val_of_q
: integer that specifies the number of values each variable can take. -
-n n_var
: number of variables in the system. -
-search_method
: the chosen search algorithm. Options are-es
for an exhaustive search,-gs
for a greedy search and-dc
for the divide and conquer approach. Multiple options are possible. -
-gt
: (Optional) Indicates if a transformation to the best basis should be done before one of the search algorithms. Without this option, the program finds the best partition using the original$n$ variables -
-l
: (Optional) Indicates if the intermediate steps of the search algorithm should be written to a separate file in theoutput
folder. Only in the case of the greedy search and divide and conquer method.
The input folder contains the US Supreme Court voting dataset [1]. Running the program for this dataset using a basis transformation and all three search algorithms:
cd build
./mcm_discrete -f US_SupremeCourt_n9_N895 -q 2 -n 9 -gt -es -gs -dc
The result of this calculating is written to the file US_SupremeCourt_n9_N895_output.dat
.
The first part contains the spin operators that form the best basis written as bitstrings:
Best basis:
000100100
000001010
000000011
100010000
100000100
101000000
010000010
001000001
000000100
For example, the first spin operator, 000100100
, is a pairwise interaction between variable 4 and 7.
The spin operators are ordered from low to high entropy.
The second part of the output file is the result of every search algorithm:
#####################
# Exhaustive search #
#####################
Number of equivalent best MCMs found : 1
Best MCM(s):
Component 0 : 100000000
Component 1 : 011000100
Component 2 : 000111011
For the exhaustive search, the best MCM has three components. The first component contains only the first variable. Variables 2,3 and 7 form the second component and the third component consists of variables 4,5,6,8 and 9. In the rare case that multiple MCMs have the same lowest log evidence, all of them will be found (only for the exhaustive search).
[1] H.J. Spaeth, L. Epstein, T.W. Ruger, K. Whittington, J.A. Segal, A.D. Martin:SupremeCourtdatabase. Version 2011 Release 3 (2011). http://scdb.wustl.edu/index.php.