Skip to content

Commit 76ec84f

Browse files
gwaybiocgreene
andauthored
Adding Simulation Experiment - Importance of Compression Feature Number (#188)
* add simulation command to full analysis pipeline * add notebook to generate simulated data * add simulated data * add notebook to perform biobombe approach on simulated data * add results of simulated biobombe * add notebook to visualize results of the simulation * add figure * add README describing experiment * add analysis script * run analysis pipeline and add converted scripts * output covariance structure data * rerun pipeline after adding covariance data output * update readme for updated pipeline run of new figure * Update 11.simulation-feature-number/README.md Co-Authored-By: Casey Greene <cgreene@users.noreply.github.com> * Update 11.simulation-feature-number/README.md Co-Authored-By: Casey Greene <cgreene@users.noreply.github.com> * Update 11.simulation-feature-number/README.md Co-Authored-By: Casey Greene <cgreene@users.noreply.github.com> * add summary figure also make the feature number index at 1 instead of 0 * add updated figures and rerun * visualize only k = 6 row * rerun command and add regenerated figures * rename y axis in summary figure Co-authored-by: Casey Greene <cgreene@users.noreply.github.com>
1 parent 843923e commit 76ec84f

19 files changed

+54343
-0
lines changed

11.simulation-feature-number/0.generate-simulated-data.ipynb

Lines changed: 452 additions & 0 deletions
Large diffs are not rendered by default.

11.simulation-feature-number/1.compression-simulation.ipynb

Lines changed: 614 additions & 0 deletions
Large diffs are not rendered by default.

11.simulation-feature-number/2.visualize-feature-importance.ipynb

Lines changed: 442 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Simulate Compressed Feature Extraction
2+
3+
## Simulated Data
4+
5+
In the following analysis module, we generate data with two groups of artificial signals.
6+
We generate a 10,000 x 10 matrix where two groups of three features each are highly correlated.
7+
The remaining features are sampled with random Gaussian noise.
8+
9+
The two groups of features have different correlation structures.
10+
The first group (features 1, 2, and 3) is simulated to be highly correlated with a range 0.90 - 0.95 (Pearson's correlation).
11+
The second group of features (features 5, 6, 7) is simulated to be correlated to a slightly lesser degree (range: 0.85 - 0.90).
12+
13+
| Group | Correlated Features | Correlation Range |
14+
| :---- | :------------------ | :---------------- |
15+
| 1 | 1, 2, 3 | 0.90 - 0.95 |
16+
| 2 | 5, 6, 7 | 0.85 - 0.90 |
17+
18+
## Experimental Design
19+
20+
For this experiment we apply the BioBombe suite of algorithms (PCA, ICA, NMF, DAE, VAE) across 6 different latent dimensionalities (k = 1, 2, 3, 4, 5, 6).
21+
We fit each model and extract the resulting weight matrix.
22+
We observe the contributions (weights or importance scores) of each input raw feature to each compressed feature.
23+
24+
Our goal is to determine the **number** of the compressed feature that best represents each simulated feature in both groups.
25+
26+
## Results
27+
28+
![summary](figures/simulated_feature_number.png)
29+
30+
We observe that most algorithms (from k = 2 and larger) aggregate both groups of artificial signals.
31+
As expected, we observe that PCA models identify the same solution across latent dimensionalities, and that the **compressed feature number** associated with each signal does not change.
32+
Furthermore, the top compressed feature (`pca_0`) isolates signal from the top correlating group.
33+
`pca_1` represents the second, and lower correlated, group.
34+
We see a similar pattern (although flipped by artificial signal strength) in NMF models.
35+
36+
We observe that most ICA, DAE, and VAE models successfully isolate both artificial signals.
37+
DAE with k = 5 and k = 6, and VAE with k = 4 and k = 6 are the only suboptimal models.
38+
In these three algorithms, the **top compressed feature number** is also not consistent.
39+
For example, for VAE k = 5 the top feature isolating signal group 1 is represented by VAE compressed feature 2, while VAE compressed feature 1 represents signal group 2.
40+
We see a similar randomized pattern for other initializations of these models.
41+
42+
## Conclusion
43+
44+
For algorithms other than PCA (such algorithms that rely on random initializations before training), the compressed feature number is unrelated to variance explained in the model.
45+
In these models, the compressed feature number does not contain information about the importance of the extracted representation.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
feature_num feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 feature_10
2+
feature_1 1 0.95 0.93 0 0 0 0 0 0 0
3+
feature_2 0.95 1 0.9 0 0 0 0 0 0 0
4+
feature_3 0.93 0.9 1 0 0 0 0 0 0 0
5+
feature_4 0 0 0 1 0 0 0 0 0 0
6+
feature_5 0 0 0 0 1 0.9 0.88 0 0 0
7+
feature_6 0 0 0 0 0.9 1 0.85 0 0 0
8+
feature_7 0 0 0 0 0.88 0.85 1 0 0 0
9+
feature_8 0 0 0 0 0 0 0 1 0 0
10+
feature_9 0 0 0 0 0 0 0 0 1 0
11+
feature_10 0 0 0 0 0 0 0 0 0 1

11.simulation-feature-number/data/simulated_signal_n1000_p10.tsv

Lines changed: 10001 additions & 0 deletions
Large diffs are not rendered by default.
Binary file not shown.
Loading
Loading

11.simulation-feature-number/results/compression_simulation_results.tsv

Lines changed: 106 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)