You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For this study we had the hypothesis that there were common Git workflows that account for a large fraction of everyday use. Our project aims to identify these workflows, with the end goal of using our understanding of these workflows to provide recommendations for features that should or should not be included in an easy-to-use Git alternative.
14
+
Our project aims to identify common Git workflows, with the end goal of using our understanding of these workflows to provide recommendations for features that should or should not be included in an easy-to-use Git alternative. For this study we had the hypothesis that there were common Git workflows that account for a large fraction of everyday use.
15
15
16
16
# Introduction
17
17
18
18
Git is a version control system used to record how files change over time^[1]^. Many people use Git for tracking individual work and as a tool for collaboration. However, users ranging from novices to experts have argued that the tool is not user friendly and needs to be improved.
19
19
20
20
RStudio is interested in developing a new tool for Git users that improves and consolidates common Git workflows. Our partner, Dr. Greg Wilson from RStudio, suggested that to understand what should be included in the alternative tool, data analysis should be performed first on what is currently being done by Git users. This is where our project comes in.
21
21
22
-
We get our data from GitHub Torrent, which mines the GitHub API to track all public GitHub repositories and makes it available as a database. To build the data structure of repositories, we get the commit history and use a Python package called NetworkX^[2]^ to transform the data into a Directed Acyclic Graph (DAG), where:
22
+
Our data was sourced from GitHub Torrent, which mines the GitHub API to track all public GitHub repositories and makes it available as a database. From this database we created a data set of where the observational unit was a GitHub repository. To do this we retrieved the commit history for XXX GitHub repositories and for each repository we used the Python NetworkX^[2]^ package to transform the data into a Directed Acyclic Graph (DAG), where:
23
23
24
24
- Each graph represents one repository;
25
25
- Each node in the graph is one commit;
26
26
- Each directed edge in the graph is connection from one commit to the other (chronological order).
27
27
28
-
We also query other data tables from GitHub Torrent for important features such as authors, programming language, code reviews, etc., to support deeper analysis.
28
+
We also queried other data tables from GitHub Torrent for important features such as authors, programming language, code reviews, et cetera, to support deeper analysis.
29
29
30
30

31
31
32
32
*Fig 1: Data Transformation*
33
33
34
-
With this project, we aim to answer two fundamental questions that can enable the development of the new tool. By studying the Git repositories as graphs, along with the features for each repository, we try to identify common patterns in the graphs for specific user groups.
34
+
With this project, we aim to answer two fundamental questions that can enable the development of the new tool:
35
35
36
-
- The first question we aim to answer is **"Are there identifiable workflow patterns in the way people use Git?"**. This question will enable us to understand how different workflows are used in different contexts. To answer this question, we identify the patterns by analyzing the complete graphs of each repo.
37
-
- The second question we aim to answer is **"What are common subgraphs that account for a large fraction of everyday use?"**. With this question we want to see if we can confirm that users follow workflows such as the Gitflow or if they follow other common workflows that are more intuitive for them. We extract subgraphs of certain lengths to find out if there are certain sub-patterns appear to be common among users.
36
+
1. The first question we aimed to answer was **"are there identifiable workflow patterns in the way people use Git?"**. We anticipated that answering this question would enable us to understand how different workflows are used in different contexts. To answer this question, we worked to identify distinct subgroups within our sample of GitHub repositories when considering the complete graphs of each repository.
38
37
39
-
By answering these questions we will gain insights that will enable the development of a new tool that improves and consolidates workflows for users of Version Control Systems.
38
+
2. The second question we aimed to answer was **"what are common subgraphs that account for a large fraction of everyday use?"**. With this question we wanted to see if we could confirm the hypotheses that distinct subgroups of users follow workflows such as the Gitflow or if they follow other common workflows that are more intuitive for them. To answer this question, we extract subgraphs of defined lengths and studied whether certain sub-patterns appear to be distinct and common among users.
40
39
40
+
Answering these questions has provided some insights that may inform the development of a new tool that improves and consolidates workflows for users of Version Control Systems, as well as led us to specific recommendations on which additional studies should be done to better understand how people use Git.
41
41
42
42
# Data Science Methods
43
43
@@ -169,4 +169,4 @@ The project had the objective of understanding if there were identifiable workfl
0 commit comments