|
| 1 | +--- |
| 2 | +title: "What the Git is going on here? <br>" |
| 3 | +subtitle: "<br>RStudio Capstone Project Proposal" |
| 4 | +#author: "Juno Chen, Ian Flores Siaca, Rayce Rossum, Richie Zitomer" |
| 5 | +date: "2019/04/24" |
| 6 | +output: |
| 7 | + xaringan::moon_reader: |
| 8 | + lib_dir: libs |
| 9 | + css: xaringan-themer.css |
| 10 | + nature: |
| 11 | + highlightStyle: github |
| 12 | + highlightLines: true |
| 13 | + countIncrementalSlides: false |
| 14 | +--- |
| 15 | + |
| 16 | +class: inverse, center, middle |
| 17 | + |
| 18 | +# Introduction |
| 19 | + |
| 20 | +```{r setup, include=FALSE} |
| 21 | +options(htmltools.dir.version = FALSE) |
| 22 | +library(xaringanthemer) |
| 23 | +duo(primary_color = "#D8CEC5", secondary_color = "#49475B") |
| 24 | +``` |
| 25 | + |
| 26 | +--- |
| 27 | +# Introduction |
| 28 | + |
| 29 | +- Git is a Version Control System to track changes to different files |
| 30 | +- People use Git to collaborate from SE to DS |
| 31 | +- However when using Git we might encounter some problems |
| 32 | + |
| 33 | +-- |
| 34 | + |
| 35 | +<img src='https://knightlab.northwestern.edu/wp-content/uploads/2014/12/1.png' height='350'> |
| 36 | + |
| 37 | +--- |
| 38 | +# Introduction |
| 39 | + |
| 40 | +<img src='https://knightlab.northwestern.edu/wp-content/uploads/2014/12/2.png' height='350'> |
| 41 | + |
| 42 | +--- |
| 43 | +# Introduction |
| 44 | + |
| 45 | +- RStudio is interested in developing a new tool for Git users |
| 46 | +- For this we want to understand how people use Git |
| 47 | + - What works for workflows |
| 48 | + - What is hindering workflows |
| 49 | + - **What are those workflows?** |
| 50 | + |
| 51 | +- We only have data to answer one of these questions |
| 52 | + - Access to commit history |
| 53 | + |
| 54 | +--- |
| 55 | +# Introduction - Getting the data |
| 56 | + |
| 57 | +- GitHub API |
| 58 | + - Sampling & Rate Limiting |
| 59 | +- GitHub Torrent |
| 60 | + - Mines the GitHub API for all latest pushs |
| 61 | + - Tracks all of the repos and makes it available in a MySQL database |
| 62 | + - This means 4TB of overall data |
| 63 | +-- |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +# Introduction - Getting the data |
| 70 | + |
| 71 | +- Multiple tables containing information about projects, commits, users, issues, etc. |
| 72 | +- Pipeline process: |
| 73 | + - Sample 1 million projects in the DB |
| 74 | + - Get the commits for all the projects |
| 75 | + - Get the parents of the commits for all the projects |
| 76 | + - Save to Buckets for export and storage |
| 77 | +- Reproducibility in scope |
| 78 | + - SQL Versioning |
| 79 | + - Data Versioning |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +# Introduction - Data Structure |
| 84 | + |
| 85 | +- How do we represent a history of commits? |
| 86 | + |
| 87 | +-- |
| 88 | + |
| 89 | +#### Graphs |
| 90 | +- Git is not any type of graph, it is a Directed Acyclic Graph (DAG) |
| 91 | + - Nodes/Vertices --> Commits |
| 92 | + - Edges --> Connection from one commit to the other |
| 93 | + |
| 94 | +<img src='https://upload.wikimedia.org/wikipedia/commons/c/c6/Topological_Ordering.svg' height='300'> |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +# Introduction - EDA: Simple Repo |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | +--- |
| 103 | +# Introduction - EDA: Complex Repo |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | + |
| 108 | +--- |
| 109 | +# Introduction - Questions |
| 110 | + |
| 111 | +- With the scope of designing a new tool to fix issues with Git and with the data that we have available we try to answer two questions: |
| 112 | + |
| 113 | +-- |
| 114 | + |
| 115 | +### What are common sub-patterns in the way people use Git? |
| 116 | + |
| 117 | +-- |
| 118 | + |
| 119 | +### What are workflow patterns across Git repositories? |
| 120 | + |
| 121 | +--- |
| 122 | + |
| 123 | +class: inverse, center, middle |
| 124 | +# Analysis |
| 125 | +## What are common sub-patterns in the way people use Git? |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## Inspiration - genetic data |
| 130 | + |
| 131 | +- comparing to git workflow representation |
| 132 | + |
| 133 | + - similarity: sequence, i.e. directed |
| 134 | + |
| 135 | + - difference: fixed length, fixed variation (can apply one-hot encoding) |
| 136 | + |
| 137 | +  |
| 138 | + |
| 139 | +  |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Inspiration - genetic data |
| 144 | + |
| 145 | +- current trend of genetic data study |
| 146 | + |
| 147 | + - DeepVariant |
| 148 | + |
| 149 | + - converting DNA sequences to images and feeding them through a convolutional neural network |
| 150 | + |
| 151 | +  |
| 152 | + |
| 153 | +[Source: https://blog.floydhub.com/exploring-dna-with-deep-learning/] |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## Inspiration - social network analysis (SNA) |
| 158 | + |
| 159 | +- comparing to git workflow representation |
| 160 | + |
| 161 | + - similarity: directed |
| 162 | + |
| 163 | + - difference: goal is to predict linkage existence |
| 164 | + |
| 165 | +- can learn from |
| 166 | + |
| 167 | + - the first step of SNA: learning structural features of connected graph |
| 168 | + |
| 169 | + - using sequence generating algorithms: node2vec |
| 170 | + |
| 171 | +[Source: http://terpconnect.umd.edu/~kpzhang/paper/INFOCOMM2018.pdf] |
| 172 | +--- |
| 173 | + |
| 174 | +## Approach - `Node2vec` |
| 175 | + |
| 176 | +.pull-left[- Samples network neighborhoods of each node using the biased random walks |
| 177 | +- Based on `Weisfeiler-Lehman Graph Kernels` |
| 178 | + - iterate nodes and edges, relabel and group, represent the features in a vector] |
| 179 | + |
| 180 | + .pull-right[] |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Approach - `sub2vec` |
| 185 | + |
| 186 | +- learn a feature representation of each subgraph, maximize properties in the latent feature space |
| 187 | + |
| 188 | +- preserve two properties |
| 189 | + |
| 190 | + - `Neighborhood`: neighborhood information of all the nodes, sets of all paths(annotated by node IDs) |
| 191 | + |
| 192 | + - `Structural`: the subgraph structure (clique, degree, size of subgraph) |
| 193 | + |
| 194 | +-- |
| 195 | + |
| 196 | +- advantage: better accuracy, incorporate the properties of entire subgraphs |
| 197 | + |
| 198 | +- disadvantage: assume unweighted undirected graphs, but can be extended |
| 199 | + |
| 200 | + |
| 201 | + |
| 202 | +[Source: https://link.springer.com/chapter/10.1007/978-3-319-93037-4_14] |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## Approach - Motifs |
| 207 | + |
| 208 | +- What is a Motif? |
| 209 | + |
| 210 | + - A subgraph which occurs in a network at a much higher frequency than random chance |
| 211 | + |
| 212 | + .pull-left[<img src="imgs/graph_1.png" width="250" /> <img src="imgs/graph_2.png" width="250" />] |
| 213 | + .pull-right[<img src="imgs/degree_distribution.png" width="500" />] |
| 214 | + |
| 215 | + |
| 216 | +--- |
| 217 | +class: inverse, center, middle |
| 218 | +# Analysis |
| 219 | +## What are workflow patterns across Git repositories? |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +## Graph2Vec Background |
| 224 | + |
| 225 | +> "[Node2Vec and Sub2Vec] only model local similarity within a confined neighborhood and fails to learn global structural similarities that help to classify similar graphs together" |
| 226 | +
|
| 227 | +-- |
| 228 | + |
| 229 | +> "a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs." |
| 230 | +
|
| 231 | +-- |
| 232 | + |
| 233 | +> "graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic." |
| 234 | +
|
| 235 | +--- |
| 236 | +## Graph2Vec Background |
| 237 | + |
| 238 | + |
| 239 | + |
| 240 | +[Source: https://arxiv.org/pdf/1707.05005.pdf] |
| 241 | +--- |
| 242 | + |
| 243 | +## Clustering Embeddings from Graph2Vec Model |
| 244 | + |
| 245 | + |
| 246 | + |
| 247 | +[Source: https://www.datascience.com/blog/k-means-clustering] |
| 248 | + |
| 249 | +--- |
| 250 | +## Graph2Vec Limitations |
| 251 | + |
| 252 | +> Graph2Vec currently works with undirected graphs, therefore we will have to make modifications to support directed graphs. |
| 253 | +
|
| 254 | +-- |
| 255 | + |
| 256 | +> Graph2Vec only helps us address the first question (unless we can find a way to extract the learned subgraphs from the neural network). |
| 257 | +
|
| 258 | + |
| 259 | +--- |
| 260 | + |
| 261 | +# Projected Timeline |
| 262 | + |
| 263 | +| Milestone | Date | |
| 264 | +|---|---| |
| 265 | +| Proposal Presentation | 4/26 | |
| 266 | +| Proposal Report (to mentor) | 4/30 | |
| 267 | +| Proposal Report (to partner) | 5/3 | |
| 268 | +| End-to-end analysis | 5/10 | |
| 269 | +| Complete workflow patterns across Git repositories | 5/24 | |
| 270 | +| Choose best method for subgraph analysis | 5/31 | |
| 271 | +| Choose and demonstrate output from subgraph analysis | 6/7 | |
| 272 | +| Complete subgraph analysis | 6/14 | |
| 273 | +| Final Presentation | 6/17-18 | |
| 274 | +| Final Report (to mentor) | 6/21 | |
| 275 | +| Final Report (to partner) and Data Product | 6/26 | |
| 276 | + |
| 277 | +--- |
| 278 | +class: inverse, middle |
| 279 | + |
| 280 | +# Acknowledgments |
| 281 | + |
| 282 | +- RStudio |
| 283 | + - Greg Wilson |
| 284 | + |
| 285 | +- UBC-MDS Teaching Team |
| 286 | + - Tiffany Timbers |
| 287 | + |
| 288 | +- UBC-MDS Students |
0 commit comments