|
| 1 | +# Overview of R scripts used to generate time series data for Faster Indicators |
| 2 | + |
| 3 | +This guide covers an overview of the main R scripts used. |
| 4 | + |
| 5 | +# Table of Contents |
| 6 | +1. [data_impute_run.R](#data_impute_runr) |
| 7 | +2. [data_seats_run.R](#data_seats_runr) |
| 8 | +3. [data_impute_and_seats_functions.R](#data_impute_and_seats_functionsr) |
| 9 | + |
| 10 | +## `data_impute_run.R` |
| 11 | + |
| 12 | +This script is in charge of downloading data from BigQuery, tidying and imputing the data and outputting QA files. |
| 13 | +Each location is processed separately and will be stored in separate `.Rda` files, e.g. `TfL-images.Rda`. The CSV |
| 14 | +file `list_of_location_datasets.csv` will list the location names of datasets currently being processed weekly and |
| 15 | +their corresponding latest date in each dataset. This file is used so the R script knows which locations to download |
| 16 | +and process for the weekly query. |
| 17 | +The CSV file `Traffic_cameras_locations_and_start_dates.csv` lists the locations present in BigQuery and their |
| 18 | +corresponding start dates. This file is used so the R script knows when a new location is added to BigQuery and when |
| 19 | +to download the data. |
| 20 | + |
| 21 | +There will be a copy of each of the following QA files for each location processed. There will also be a copy for any |
| 22 | +new locations added in BigQuery but which do not have at least 5 weeks of data: |
| 23 | +* `status_report_YYMMDD_location_name.png` |
| 24 | +* `Status_report_per_camera_YYMMDD_location_name.pdf` |
| 25 | +* `Status_report_per_camera_long_version_YYMMDD_location_name.pdf` |
| 26 | +* `Status_report_per_location_YYMMDD_location_name.pdf` |
| 27 | + |
| 28 | +Below is an outline of the code for script `data_impute_run.R`: |
| 29 | + |
| 30 | +* Get list of locations present in BigQuery |
| 31 | +* `cache/Traffic_cameras_locations_and_start_dates.csv` exists? |
| 32 | + * No – download locations and corresponding start dates from BigQuery and create CSV |
| 33 | + * Yes – Read in CSV |
| 34 | +* Compare locations from CSV and those downloaded from BigQuery. Are there new locations? |
| 35 | + * Yes – get corresponding start date for new locations from BigQuery and update CSV |
| 36 | +* `cache/list_of_location_datasets.csv` exists? |
| 37 | + * No – for each new data source is there at least five weeks of data? |
| 38 | + * Yes – download data up to ‘today – 8’ (i.e. download up to but not including the latest week of data), tidy, |
| 39 | + impute and save as Rda file. Add location name and last processed date to |
| 40 | + `cache/list_of_location_datasets.csv`. |
| 41 | + * No – Download latest week and output PDFs for QA purposes. |
| 42 | + * Yes – Read in existing `cache/list_of_location_datasets.csv`. Check if locations in |
| 43 | + `cache/list_of_location_datasets.csv` match with the ones in |
| 44 | + `cache/Traffic_cameras_locations_and_start_dates.csv`. Do locations match? |
| 45 | + * No – check start data of new location, is there at least five weeks of data? |
| 46 | + * Yes – download data up to ‘today – 8’ (i.e. download up to but not including the latest week of data), |
| 47 | + tidy, impute and save as Rda file. Add location name and last processed date to |
| 48 | + `cache/list_of_location_datasets.csv`. |
| 49 | + * No – Download latest week and output PDFs for QA purposes. |
| 50 | +* `cache/list_of_location_datasets.csv` exists? |
| 51 | + * Yes – read in CSV. For each location in CSV download latest week of data, tidy and output QA files. Read in |
| 52 | + existing dataset for location, merge last 4 weeks of data onto new data and impute missing data in new data. |
| 53 | + Append new week of data onto existing dataset and save Rda. Save updated last processed data to CSV. |
| 54 | + * No – exit script |
| 55 | + |
| 56 | +## `data_seats_run.R` |
| 57 | + |
| 58 | +This script is run after `data_imputes_run.R` and is in charge of aggregating the data and performing SEATS. Each |
| 59 | +location is aggregated and has SEATS applied separately. The results of SEATS then gets merged together to produce |
| 60 | +the CSV file for Faster Indicators. Next, a PDF showing a four week history is produced using this data. |
| 61 | + |
| 62 | +This script outputs 2 files: |
| 63 | +* `Trafcam_data_YYMMDD.csv` |
| 64 | +* `Four_week_history_YYMMDD.pdf` |
| 65 | + |
| 66 | +## `data_impute_and_seats_functions.R` |
| 67 | + |
| 68 | +This script houses all the functions used in the other two scripts. |
0 commit comments