Skip to content

Commit 2f4348e

Browse files
committed
add snakemake rule for pulling data into repo
1 parent b1300b6 commit 2f4348e

File tree

3 files changed

+31
-8
lines changed

3 files changed

+31
-8
lines changed

Snakefile

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
11
configfile: "config.json"
22

3+
rule get_ght_data:
4+
params:
5+
download_url = config["data_download_url"]
6+
output:
7+
output_file = "data/commits_by_org.feather"
8+
shell: "python src/github_analysis/make_report.py -du {params.download_url} -of {output.output_file}"
9+
310
rule run_analysis:
411
input:
5-
data_path = "/Users/richiezitomer/Documents/RStudio-Data-Repository/clean_data/commits_by_org.feather"
12+
data_path = "data/commits_by_org.feather"
613
output:
714
results_path = directory("results/")
815
params:
@@ -20,12 +27,7 @@ rule run_analysis:
2027

2128
rule generate_images:
2229
input:
23-
data_path="/Users/richiezitomer/Documents/RStudio-Data-Repository/clean_data/commits_by_org.feather",
30+
data_path="data/commits_by_org.feather",
2431
embedding_path="results/embeddings.csv"
2532
shell:
2633
"python src/github_analysis/make_report.py -dp {input.data_path} -ep {input.embedding_path}"
27-
28-
29-
# Commented out because repo is currently over bandwidth: https://help.github.com/en/articles/about-storage-and-bandwidth-usage
30-
#rule clone_data_repo:
31-
# shell: "git clone https://github.com/UBC-MDS/RStudio-Data-Repository.git"

config.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
{"python_hash_seed": 0,
1+
{"data_download_url": "https://api.figshare.com/v2/file/download/15593951",
2+
"python_hash_seed": 0,
23
"n_workers": 1,
34
"n_projects": 1000,
45
"min_commits": "None",
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import requests
2+
import shutil
3+
4+
5+
def download_file(download_URL, filename):
6+
"""Download file from CURL url using request.
7+
download_URL: """
8+
with requests.get(download_URL, stream=True) as r:
9+
with open(filename, 'wb') as f:
10+
shutil.copyfileobj(r.raw, f)
11+
return filename
12+
13+
14+
if __name__ == '__main__':
15+
parser = argparse.ArgumentParser()
16+
parser.add_argument("-du", "--download_URL", help="The URL to download the file.", default='https://api.figshare.com/v2/file/download/15593951')
17+
parser.add_argument("-of", "--output_file", help="The number of workers to use when running the analysis.", default='data/commits_by_org.feather')
18+
args = parser.parse_args()
19+
20+
download_file(args.download_URL, args.output_file)

0 commit comments

Comments
 (0)