Skip to content

Commit 26c45dc

Browse files
authored
30 version bump (#31)
* Version bump to `lxml` library c/o dependabot * Update to R scripts, to reduce memory usage * Correction to `runner-startup.sh` to ensure QA files are sent to GCP bucket
1 parent 1fff98e commit 26c45dc

File tree

9 files changed

+630
-353
lines changed

9 files changed

+630
-353
lines changed

cloud/functions/update_sources/src/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ wheel==0.35.0
1515
# Additional local requirements
1616
google-cloud-storage==1.33.0
1717
BeautifulSoup4==4.9.3
18-
lxml==4.6.3
18+
lxml==4.6.5
1919

2020
python-json-logger==2.0.1
2121
google-cloud-logging==2.0.2

cloud/vm/README_RScripts.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Overview of R scripts used to generate time series data for Faster Indicators
2+
3+
This guide covers an overview of the main R scripts used.
4+
5+
# Table of Contents
6+
1. [data_impute_run.R](#data_impute_runr)
7+
2. [data_seats_run.R](#data_seats_runr)
8+
3. [data_impute_and_seats_functions.R](#data_impute_and_seats_functionsr)
9+
10+
## `data_impute_run.R`
11+
12+
This script is in charge of downloading data from BigQuery, tidying and imputing the data and outputting QA files.
13+
Each location is processed separately and will be stored in separate `.Rda` files, e.g. `TfL-images.Rda`. The CSV
14+
file `list_of_location_datasets.csv` will list the location names of datasets currently being processed weekly and
15+
their corresponding latest date in each dataset. This file is used so the R script knows which locations to download
16+
and process for the weekly query.
17+
The CSV file `Traffic_cameras_locations_and_start_dates.csv` lists the locations present in BigQuery and their
18+
corresponding start dates. This file is used so the R script knows when a new location is added to BigQuery and when
19+
to download the data.
20+
21+
There will be a copy of each of the following QA files for each location processed. There will also be a copy for any
22+
new locations added in BigQuery but which do not have at least 5 weeks of data:
23+
* `status_report_YYMMDD_location_name.png`
24+
* `Status_report_per_camera_YYMMDD_location_name.pdf`
25+
* `Status_report_per_camera_long_version_YYMMDD_location_name.pdf`
26+
* `Status_report_per_location_YYMMDD_location_name.pdf`
27+
28+
Below is an outline of the code for script `data_impute_run.R`:
29+
30+
* Get list of locations present in BigQuery
31+
* `cache/Traffic_cameras_locations_and_start_dates.csv` exists?
32+
* No – download locations and corresponding start dates from BigQuery and create CSV
33+
* Yes – Read in CSV
34+
* Compare locations from CSV and those downloaded from BigQuery. Are there new locations?
35+
* Yes – get corresponding start date for new locations from BigQuery and update CSV
36+
* `cache/list_of_location_datasets.csv` exists?
37+
* No – for each new data source is there at least five weeks of data?
38+
* Yes – download data up to ‘today – 8’ (i.e. download up to but not including the latest week of data), tidy,
39+
impute and save as Rda file. Add location name and last processed date to
40+
`cache/list_of_location_datasets.csv`.
41+
* No – Download latest week and output PDFs for QA purposes.
42+
* Yes – Read in existing `cache/list_of_location_datasets.csv`. Check if locations in
43+
`cache/list_of_location_datasets.csv` match with the ones in
44+
`cache/Traffic_cameras_locations_and_start_dates.csv`. Do locations match?
45+
* No – check start data of new location, is there at least five weeks of data?
46+
* Yes – download data up to ‘today – 8’ (i.e. download up to but not including the latest week of data),
47+
tidy, impute and save as Rda file. Add location name and last processed date to
48+
`cache/list_of_location_datasets.csv`.
49+
* No – Download latest week and output PDFs for QA purposes.
50+
* `cache/list_of_location_datasets.csv` exists?
51+
* Yes – read in CSV. For each location in CSV download latest week of data, tidy and output QA files. Read in
52+
existing dataset for location, merge last 4 weeks of data onto new data and impute missing data in new data.
53+
Append new week of data onto existing dataset and save Rda. Save updated last processed data to CSV.
54+
* No – exit script
55+
56+
## `data_seats_run.R`
57+
58+
This script is run after `data_imputes_run.R` and is in charge of aggregating the data and performing SEATS. Each
59+
location is aggregated and has SEATS applied separately. The results of SEATS then gets merged together to produce
60+
the CSV file for Faster Indicators. Next, a PDF showing a four week history is produced using this data.
61+
62+
This script outputs 2 files:
63+
* `Trafcam_data_YYMMDD.csv`
64+
* `Four_week_history_YYMMDD.pdf`
65+
66+
## `data_impute_and_seats_functions.R`
67+
68+
This script houses all the functions used in the other two scripts.

0 commit comments

Comments
 (0)