Skip to content

Commit fa152de

Browse files
authored
Merge pull request #9 from UBC-MDS/presentation_structure
Proposal Presentation
2 parents 237cf4b + e9beec8 commit fa152de

22 files changed

+894
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
*DS_Store*
33
*Rhistory*
44
*.json
5+
.Rproj.user

RStudio-GitHub-Analysis.Rproj

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Version: 1.0
2+
3+
RestoreWorkspace: Default
4+
SaveWorkspace: Default
5+
AlwaysSaveHistory: Default
6+
7+
EnableCodeIndexing: Yes
8+
UseSpacesForTab: Yes
9+
NumSpacesForTab: 4
10+
Encoding: UTF-8
11+
12+
RnwWeave: Sweave
13+
LaTeX: pdfLaTeX
File renamed without changes.

docs/imgs/branch_test1.png

25.6 KB
Loading

docs/imgs/degree_distribution.png

5.77 KB
Loading

docs/imgs/dna_encoding.png

16.3 KB
Loading

docs/imgs/dna_image.png

99.9 KB
Loading

docs/imgs/dna_matrix.png

37.6 KB
Loading

docs/imgs/g2v_clusters.png

95.2 KB
Loading

docs/imgs/g2v_flow.png

102 KB
Loading

docs/imgs/g2v_flow2.png

26.9 KB
Loading

docs/imgs/g2v_paper.png

116 KB
Loading

docs/imgs/g2v_repo.png

97.5 KB
Loading

docs/imgs/graph_1.png

8.68 KB
Loading

docs/imgs/graph_2.png

8.98 KB
Loading

docs/imgs/sub2vec.png

287 KB
Loading

docs/imgs/wl_kernel.png

250 KB
Loading
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
2+
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
3+
@import url(https://fonts.googleapis.com/css?family=Source+Code+Pro:400,700);
4+
5+
body { font-family: 'Droid Serif', 'Palatino Linotype', 'Book Antiqua', Palatino, 'Microsoft YaHei', 'Songti SC', serif; }
6+
h1, h2, h3 {
7+
font-family: 'Yanone Kaffeesatz';
8+
font-weight: normal;
9+
}
10+
.remark-code, .remark-inline-code { font-family: 'Source Code Pro', 'Lucida Console', Monaco, monospace; }

docs/libs/remark-css/default.css

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
a, a > code {
2+
color: rgb(249, 38, 114);
3+
text-decoration: none;
4+
}
5+
.footnote {
6+
position: absolute;
7+
bottom: 3em;
8+
padding-right: 4em;
9+
font-size: 90%;
10+
}
11+
.remark-code-line-highlighted { background-color: #ffff88; }
12+
13+
.inverse {
14+
background-color: #272822;
15+
color: #d6d6d6;
16+
text-shadow: 0 0 20px #333;
17+
}
18+
.inverse h1, .inverse h2, .inverse h3 {
19+
color: #f3f3f3;
20+
}
21+
/* Two-column layout */
22+
.left-column {
23+
color: #777;
24+
width: 20%;
25+
height: 92%;
26+
float: left;
27+
}
28+
.left-column h2:last-of-type, .left-column h3:last-child {
29+
color: #000;
30+
}
31+
.right-column {
32+
width: 75%;
33+
float: right;
34+
padding-top: 1em;
35+
}
36+
.pull-left {
37+
float: left;
38+
width: 47%;
39+
}
40+
.pull-right {
41+
float: right;
42+
width: 47%;
43+
}
44+
.pull-right ~ * {
45+
clear: both;
46+
}
47+
img, video, iframe {
48+
max-width: 100%;
49+
}
50+
blockquote {
51+
border-left: solid 5px lightgray;
52+
padding-left: 1em;
53+
}
54+
.remark-slide table {
55+
margin: auto;
56+
border-top: 1px solid #666;
57+
border-bottom: 1px solid #666;
58+
}
59+
.remark-slide table thead th { border-bottom: 1px solid #ddd; }
60+
th, td { padding: 5px; }
61+
.remark-slide thead, .remark-slide tfoot, .remark-slide tr:nth-child(even) { background: #eee }
62+
63+
@page { margin: 0; }
64+
@media print {
65+
.remark-slide-scaler {
66+
width: 100% !important;
67+
height: 100% !important;
68+
transform: scale(1) !important;
69+
top: 0 !important;
70+
left: 0 !important;
71+
}
72+
}

docs/proposal_presentation.Rmd

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
---
2+
title: "What the Git is going on here? <br>"
3+
subtitle: "<br>RStudio Capstone Project Proposal"
4+
#author: "Juno Chen, Ian Flores Siaca, Rayce Rossum, Richie Zitomer"
5+
date: "2019/04/24"
6+
output:
7+
xaringan::moon_reader:
8+
lib_dir: libs
9+
css: xaringan-themer.css
10+
nature:
11+
highlightStyle: github
12+
highlightLines: true
13+
countIncrementalSlides: false
14+
---
15+
16+
class: inverse, center, middle
17+
18+
# Introduction
19+
20+
```{r setup, include=FALSE}
21+
options(htmltools.dir.version = FALSE)
22+
library(xaringanthemer)
23+
duo(primary_color = "#D8CEC5", secondary_color = "#49475B")
24+
```
25+
26+
---
27+
# Introduction
28+
29+
- Git is a Version Control System to track changes to different files
30+
- People use Git to collaborate from SE to DS
31+
- However when using Git we might encounter some problems
32+
33+
--
34+
35+
<img src='https://knightlab.northwestern.edu/wp-content/uploads/2014/12/1.png' height='350'>
36+
37+
---
38+
# Introduction
39+
40+
<img src='https://knightlab.northwestern.edu/wp-content/uploads/2014/12/2.png' height='350'>
41+
42+
---
43+
# Introduction
44+
45+
- RStudio is interested in developing a new tool for Git users
46+
- For this we want to understand how people use Git
47+
- What works for workflows
48+
- What is hindering workflows
49+
- **What are those workflows?**
50+
51+
- We only have data to answer one of these questions
52+
- Access to commit history
53+
54+
---
55+
# Introduction - Getting the data
56+
57+
- GitHub API
58+
- Sampling & Rate Limiting
59+
- GitHub Torrent
60+
- Mines the GitHub API for all latest pushs
61+
- Tracks all of the repos and makes it available in a MySQL database
62+
- This means 4TB of overall data
63+
--
64+
65+
![](https://cdn-images-1.medium.com/max/1200/1*A8liBoeAwAZg7rDu394jYg.png)
66+
67+
---
68+
69+
# Introduction - Getting the data
70+
71+
- Multiple tables containing information about projects, commits, users, issues, etc.
72+
- Pipeline process:
73+
- Sample 1 million projects in the DB
74+
- Get the commits for all the projects
75+
- Get the parents of the commits for all the projects
76+
- Save to Buckets for export and storage
77+
- Reproducibility in scope
78+
- SQL Versioning
79+
- Data Versioning
80+
81+
---
82+
83+
# Introduction - Data Structure
84+
85+
- How do we represent a history of commits?
86+
87+
--
88+
89+
#### Graphs
90+
- Git is not any type of graph, it is a Directed Acyclic Graph (DAG)
91+
- Nodes/Vertices --> Commits
92+
- Edges --> Connection from one commit to the other
93+
94+
<img src='https://upload.wikimedia.org/wikipedia/commons/c/c6/Topological_Ordering.svg' height='300'>
95+
96+
---
97+
98+
# Introduction - EDA: Simple Repo
99+
100+
![](imgs/branch_test1.png)
101+
102+
---
103+
# Introduction - EDA: Complex Repo
104+
105+
106+
![](imgs/branch_test.png)
107+
108+
---
109+
# Introduction - Questions
110+
111+
- With the scope of designing a new tool to fix issues with Git and with the data that we have available we try to answer two questions:
112+
113+
--
114+
115+
### What are common sub-patterns in the way people use Git?
116+
117+
--
118+
119+
### What are workflow patterns across Git repositories?
120+
121+
---
122+
123+
class: inverse, center, middle
124+
# Analysis
125+
## What are common sub-patterns in the way people use Git?
126+
127+
---
128+
129+
## Inspiration - genetic data
130+
131+
- comparing to git workflow representation
132+
133+
- similarity: sequence, i.e. directed
134+
135+
- difference: fixed length, fixed variation (can apply one-hot encoding)
136+
137+
![](imgs/dna_matrix.png)
138+
139+
![](imgs/dna_encoding.png)
140+
141+
---
142+
143+
## Inspiration - genetic data
144+
145+
- current trend of genetic data study
146+
147+
- DeepVariant
148+
149+
- converting DNA sequences to images and feeding them through a convolutional neural network
150+
151+
![](imgs/dna_image.png)
152+
153+
[Source: https://blog.floydhub.com/exploring-dna-with-deep-learning/]
154+
155+
---
156+
157+
## Inspiration - social network analysis (SNA)
158+
159+
- comparing to git workflow representation
160+
161+
- similarity: directed
162+
163+
- difference: goal is to predict linkage existence
164+
165+
- can learn from
166+
167+
- the first step of SNA: learning structural features of connected graph
168+
169+
- using sequence generating algorithms: node2vec
170+
171+
[Source: http://terpconnect.umd.edu/~kpzhang/paper/INFOCOMM2018.pdf]
172+
---
173+
174+
## Approach - `Node2vec`
175+
176+
.pull-left[- Samples network neighborhoods of each node using the biased random walks
177+
- Based on `Weisfeiler-Lehman Graph Kernels`
178+
- iterate nodes and edges, relabel and group, represent the features in a vector]
179+
180+
.pull-right[![](imgs/wl_kernel.png)]
181+
182+
---
183+
184+
## Approach - `sub2vec`
185+
186+
- learn a feature representation of each subgraph, maximize properties in the latent feature space
187+
188+
- preserve two properties
189+
190+
- `Neighborhood`: neighborhood information of all the nodes, sets of all paths(annotated by node IDs)
191+
192+
- `Structural`: the subgraph structure (clique, degree, size of subgraph)
193+
194+
--
195+
196+
- advantage: better accuracy, incorporate the properties of entire subgraphs
197+
198+
- disadvantage: assume unweighted undirected graphs, but can be extended
199+
200+
![](imgs/sub2vec.png)
201+
202+
[Source: https://link.springer.com/chapter/10.1007/978-3-319-93037-4_14]
203+
204+
---
205+
206+
## Approach - Motifs
207+
208+
- What is a Motif?
209+
210+
- A subgraph which occurs in a network at a much higher frequency than random chance
211+
212+
.pull-left[<img src="imgs/graph_1.png" width="250" /> <img src="imgs/graph_2.png" width="250" />]
213+
.pull-right[<img src="imgs/degree_distribution.png" width="500" />]
214+
215+
216+
---
217+
class: inverse, center, middle
218+
# Analysis
219+
## What are workflow patterns across Git repositories?
220+
221+
---
222+
223+
## Graph2Vec Background
224+
225+
> "[Node2Vec and Sub2Vec] only model local similarity within a confined neighborhood and fails to learn global structural similarities that help to classify similar graphs together"
226+
227+
--
228+
229+
> "a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs."
230+
231+
--
232+
233+
> "graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic."
234+
235+
---
236+
## Graph2Vec Background
237+
238+
![](imgs/g2v_flow2.png)
239+
240+
[Source: https://arxiv.org/pdf/1707.05005.pdf]
241+
---
242+
243+
## Clustering Embeddings from Graph2Vec Model
244+
245+
![](imgs/g2v_clusters.png)
246+
247+
[Source: https://www.datascience.com/blog/k-means-clustering]
248+
249+
---
250+
## Graph2Vec Limitations
251+
252+
> Graph2Vec currently works with undirected graphs, therefore we will have to make modifications to support directed graphs.
253+
254+
--
255+
256+
> Graph2Vec only helps us address the first question (unless we can find a way to extract the learned subgraphs from the neural network).
257+
258+
259+
---
260+
261+
# Projected Timeline
262+
263+
| Milestone | Date |
264+
|---|---|
265+
| Proposal Presentation | 4/26 |
266+
| Proposal Report (to mentor) | 4/30 |
267+
| Proposal Report (to partner) | 5/3 |
268+
| End-to-end analysis | 5/10 |
269+
| Complete workflow patterns across Git repositories | 5/24 |
270+
| Choose best method for subgraph analysis | 5/31 |
271+
| Choose and demonstrate output from subgraph analysis | 6/7 |
272+
| Complete subgraph analysis | 6/14 |
273+
| Final Presentation | 6/17-18 |
274+
| Final Report (to mentor) | 6/21 |
275+
| Final Report (to partner) and Data Product | 6/26 |
276+
277+
---
278+
class: inverse, middle
279+
280+
# Acknowledgments
281+
282+
- RStudio
283+
- Greg Wilson
284+
285+
- UBC-MDS Teaching Team
286+
- Tiffany Timbers
287+
288+
- UBC-MDS Students

0 commit comments

Comments
 (0)