Interpreting supervised machine learning inferences in population genomics using haplotype matrix permutations
Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetic features drive their predictions—a critical limitation for method development and biological interpretation. Here we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference.
In this repository, we use the term "ConfuseNN" to refer to our permutation approach, since we are attempting to "confuse" the networks by testing them on disrupted data.
To reproduce the result for each of the three CNNs evaluated in this work, refer to the three respective subdirs, each with their own specifications.
For ease of adoptability, we have provided the code to perform all permutations described in our paper in minimal_example.ipynb
, with visualization.
How these permutations are applied in practice likely varies depending on the simulation and training procedure.
For examples of how we customized our permutation approach to each CNN, refer to corresponding subdir with further descriptions.