|
| 1 | +# Simulate Compressed Feature Extraction |
| 2 | + |
| 3 | +## Simulated Data |
| 4 | + |
| 5 | +In the following analysis module, we generate data with two groups of artificial signals. |
| 6 | +We generate a 10,000 x 10 matrix where two groups of three features each are highly correlated. |
| 7 | +The remaining features are sampled with random Gaussian noise. |
| 8 | + |
| 9 | +The two groups of features have different correlation structures. |
| 10 | +The first group (features 1, 2, and 3) is simulated to be highly correlated with a range 0.90 - 0.95 (Pearson's correlation). |
| 11 | +The second group of features (features 5, 6, 7) is simulated to be correlated to a slightly lesser degree (range: 0.85 - 0.90). |
| 12 | + |
| 13 | +| Group | Correlated Features | Correlation Range | |
| 14 | +| :---- | :------------------ | :---------------- | |
| 15 | +| 1 | 1, 2, 3 | 0.90 - 0.95 | |
| 16 | +| 2 | 5, 6, 7 | 0.85 - 0.90 | |
| 17 | + |
| 18 | +## Experimental Design |
| 19 | + |
| 20 | +For this experiment we apply the BioBombe suite of algorithms (PCA, ICA, NMF, DAE, VAE) across 6 different latent dimensionalities (k = 1, 2, 3, 4, 5, 6). |
| 21 | +We fit each model and extract the resulting weight matrix. |
| 22 | +We observe the contributions (weights or importance scores) of each input raw feature to each compressed feature. |
| 23 | + |
| 24 | +Our goal is to determine the **number** of the compressed feature that best represents each simulated feature in both groups. |
| 25 | + |
| 26 | +## Results |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | +We observe that most algorithms (from k = 2 and larger) aggregate both groups of artificial signals. |
| 31 | +As expected, we observe that PCA models identify the same solution across latent dimensionalities, and that the **compressed feature number** associated with each signal does not change. |
| 32 | +Furthermore, the top compressed feature (`pca_0`) isolates signal from the top correlating group. |
| 33 | +`pca_1` represents the second, and lower correlated, group. |
| 34 | +We see a similar pattern (although flipped by artificial signal strength) in NMF models. |
| 35 | + |
| 36 | +We observe that most ICA, DAE, and VAE models successfully isolate both artificial signals. |
| 37 | +DAE with k = 5 and k = 6, and VAE with k = 4 and k = 6 are the only suboptimal models. |
| 38 | +In these three algorithms, the **top compressed feature number** is also not consistent. |
| 39 | +For example, for VAE k = 5 the top feature isolating signal group 1 is represented by VAE compressed feature 2, while VAE compressed feature 1 represents signal group 2. |
| 40 | +We see a similar randomized pattern for other initializations of these models. |
| 41 | + |
| 42 | +## Conclusion |
| 43 | + |
| 44 | +For algorithms other than PCA (such algorithms that rely on random initializations before training), the compressed feature number is unrelated to variance explained in the model. |
| 45 | +In these models, the compressed feature number does not contain information about the importance of the extracted representation. |
0 commit comments