Skip to content

Commit 19077d0

Browse files
authored
Suggested edits and comments on question 1
1 parent a78aacb commit 19077d0

File tree

1 file changed

+3
-5
lines changed

1 file changed

+3
-5
lines changed

docs/final_presentation/final_report.Rmd

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -72,19 +72,17 @@ We hypothesized that different types of git users such as Software Developers an
7272

7373
*Fig 4: Number of Commits of Global Clusters*
7474

75-
Looking at the above figure, we can see that only a very small portion of projects reach a significant number of commits. We also notice that large projects tend to fall within the top right corner of the T-SNE embedding space.
75+
Looking at the above figure, we can see that only a very small portion of projects have > 100 commits. We also notice that large projects tend to fall within the top right corner of the T-SNE embedding space. WHAT DOES THAT MEAN? EXPLAIN HERE. ALSO, ARE THEY ALL IN ONE CLUSTER (AS DEFINED BY KMEANS) OR ACROSS A FEW (AND IF SO, WHICH FEW)?
7676

7777
### Clustering of > 100 commits
7878

79-
Using the insights from our analysis that the majority of projects had less than 100 commits, we decided to focus our efforts on larger projects. Clustering again on projects with more than 100 commits results in the following clustering process represented in the T-SNE plot below.
79+
Using the insights from our analysis that the majority of projects had less than 100 commits, we decided to focus our efforts on larger projects. WHY? JUSTIFY (I.E. MORE COMMITS LIKELY MEANS MORE INTERESTING? REAL PROJECT?). Clustering again on projects with more than 100 commits results in the following clustering process represented in the T-SNE plot below.
8080

8181
![](imgs/blob.png)
8282

8383
*Fig 5: Clustering of Projects > 100 Commits*
8484

85-
The above plot does not show clear cluster boundaries in projects with greater than 100 commits. This is indicative of high variability in the workflows of large projects. A possible interpretation of this result is that every workflow is represented and that there are no identifiable workflows in large projects using our current method. This leads us to conclude in this question that there are identifiable workflow patterns although they are not that useful. At the global level we can see that the majority of projects fall into short, single chain projects. This tells us that the main workflow on Github is to just commit to master and work alone. This is backed up by the fact that out of 36.4 million projects, 48% have only a single commit, and 85% have only a single author.
86-
87-
For the rest of our analysis, we just looked at projects with >100 commits.
85+
The above plot does not show clear cluster boundaries in projects with greater than 100 commits. This is indicative of an absence of distinct subgroups of ~~high variability~~ workflows within large projects. COMMENT ON WHETHER THIS IS DUE TO ONE LARGE SPREADOUT CLUSTER OR ONE TIGHT NARROW CLUSTER (*Total within sum of square and between sum of squares should answer this*). A possible interpretation of this result is that every workflow is represented (and represented with roughly equal frequency) and that there are no identifiable distinct sub-types of workflows in large projects that could be indentified using our current method. This leads us to conclude in this question that there are identifiable workflow patterns although they are not that useful (NOT USEFUL FOR WHAT?). At the global level we can see that the majority of projects fall into short, single chain projects (WHERE IS THE DATA DEMONSTRATING THESE ARE SINGLE CHAINS?). This tells us that the main workflow on Github is to just commit to master and work alone. This is backed up by the fact that out of 36.4 million projects we sampled, 48% have only a single commit, and 85% have only a single author. For the rest of our analysis, we just looked at projects with >100 commits and considered them one homogenous group.
8886

8987
## What are common subgraphs that account for a large fraction of everyday use?
9088

0 commit comments

Comments
 (0)