You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/final_presentation/final_report.Rmd
+6-3Lines changed: 6 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -45,12 +45,15 @@ Answering these questions has provided some insights that may inform the develop
45
45
46
46
### Unsupervised Learning
47
47
48
-
Unsupervised learning is a realm of data science methods used to find previously unknown patterns in a data set without explicit labels. By learning features from the data, these types of algorithms allows us to group similar data points together based on similarity. Unsupervised learning requires a set of numeric values, therefore we must first convert our graph data into a more appropriate format. In the same way documents are built up of sentences and words, graphs are made up of subgraphs and nodes. Knowing this, we took inspiration from Natural Language Processing’s (NLP) concept of word embeddings. After some research, we settled on a Doc2Vec based algorithm called Graph2Vec^[4]^. In the same way that Doc2Vec can take a collection of documents and generate a vector of word embeddings, Graph2Vec can take a collection of graphs and generate a vector of graph embeddings. The generation of this embeddings means that we transform our graphs into a more contextual representation of their structure.
48
+
Unsupervised learning is a realm of data science methods used to find previously unknown patterns in a data set without explicit labels. By learning features from the data, these types of algorithms allows us to group similar data points together based on similarity. GIVE SIMPLE EXAMPLE HERE OF AN APPLICATION TO YOUR DATA (E.G., GROUPING SIMILAR GIT COMMIT GRAPHS TOGETHER).
49
+
50
+
51
+
The most common and widely used unsupervised learning methods require a matrix of numeric values, therefore we must first convert our graph data into a more appropriate format. In the same way documents are built up of sentences and words, graphs are made up of subgraphs and nodes. Knowing this, we took inspiration from Natural Language Processing’s (NLP) concept of word embeddings. After some research, we settled on a Doc2Vec based algorithm called Graph2Vec^[4]^. In the same way that Doc2Vec can take a collection of documents and generate a vector of word embeddings, Graph2Vec can take a collection of graphs and generate a vector of graph embeddings. The generation of this embeddings means that we transform our graphs into a matrix of numeric values - a more useful format for the unsupervised learning methods we intended to use.
49
52
50
53
51
54
### Cluster Analysis
52
55
53
-
After generating embeddings for each graph in our sample, we used K-Means Clustering and various metrics such as the AIC, BIC and gap statistic to choose an optimal number of clusters. We ultimately settled on 19. The results of this clustering is shown below in a dimensionality reduced T-SNE plot.
56
+
Kmeans clustering is an unsupervised learning method that works by... (explain in simple terms and justify your choice of it). After generating embeddings for each graph in our sample, we used K-Means Clustering and various metrics such as the AIC, BIC and gap statistic to choose an optimal number of clusters (Supplemental Table 1). We ultimately settled on 19. The results of this clustering is shown below in a dimensionality reduced T-SNE plot.
54
57
55
58

56
59
@@ -62,7 +65,7 @@ The above shows very clear clusters indicating that there are clear groups and d
62
65
63
66
*Fig 3: Radial Language Plots of Global Clusters*
64
67
65
-
We hypothesized that different types of git users such as Software Developers and Data Scientists would have fundamentally different usage patterns. Although the homogenous view of the languages above disproves that theory. At this point we took a step back and examined the number of commits within our sampled projects.
68
+
We hypothesized that different types of git users such as Software Developers and Data Scientists would have fundamentally different usage patterns. However, the fairly homogenous spread of programming languages across the clusters indicates that this not the case. At this point we took a step back and examined the number of commits within our sampled projects.
0 commit comments