- Generate forecast for latest day.
- If delta between actual vs forecast exceeds preset threshold an alert is raised.
The repo is structured into a dataloader module (R) and an alerting module (Python). A DBT project currently serves the single purpose of maintaining lookup files.
The anomaly detection dataloader module is not intended to run directly on large tables (anything in the range of hundreds of GB and more). That is because the module is currently a prototype and the data load mechanism might change over time. Therefore, you cannot expect tables generated from the dataloader to exist for a long time and always have to expect data to be corrupted or truncated due to ongoing development. It is therefore important to develop anomaly configurations in a way that backfills are not prohibitively expensive.
The best way to generate smaller intermediate tables that are more stable is to use the monitoring module from the monitoring repo. These are automatically refreshed daily and receive better support than the data that is loaded through the anomaly detection dataloader module.
Forecasts are configured in config.yml. There are two modes:
- Specify date and forecast column
- Manually specify an SQL script
#1 is only possible, if the source table already has the available columns and allows to SUM/COUNT forecast column and GROUP BY date_column
. If this is not possible, a custom SQL script can be provided that needs to return 2 columns: timestamp
and value
.
For production purposes a Dockerfile has been provided.
For development and manually generating forecasts you may use the R project which includes generate_forecasts.R
, which is a wrapper to quickly generate a full load of forecasts.
Each anomaly configuration requires a unique definition for loading the data. The data definition is uniquely specified by:
- dataset
- table
- timestamp column
- value column
- period length
A dashboard has been created to explore new monitoring opportunities, tweak existing anomaly detection configurations, and debug anomalies after alerts have been triggered.
The infrastructure is deployed using Terraform. The Terraform code related with cloudbuild is located in the cloudbuild
folder.
The environments
folder contains the deploy infrastructure for each environment (dev, staging, release(new tag)).
IMPORTANT: All changes related with permissions, SA, BigQuery, GCS should be done in the gfw-terraform-gcp
project. In the environments folder, we only add configuration of Gcloud services like cloud run, scheduler, pubsub, etc
To deploy the cloudbuild, run the following command inside of the cloudbuild folder (This is the unique command of terraform that you need to run and you only need to run it when you change something in the cloudbuild folder):
# only the first time
terraform init
terraform apply
The cloudbuild define 2 triggers:
- one for push to dev and main branch
- one for push to any tag (Release a new version).
Cloudbuild will generate the docker image and deploy the infrastructure for the specified environment, depenging on the branch.
- dev branch: Cloudbuild will deploy the infrastructure for the dev environment (
dev
folder). - main branch: Cloudbuild will deploy the infrastructure for the main environment (
main
folder). - new tag: Cloudbuild will deploy the infrastructure for the release environment (
release
folder). - any other branch: Cloudbuild will skip the
apply
step.
The deploy
folder contains the Terraform code for the infrastructure. The folder contains the following folders:
environments
: Contains the Terraform code for each environment (dev, staging, release).template
: Contains the Terraform code for the Services template.
Each environment has its own folder. The folder contains the Terraform code for the environment. The folder contains the following files:
main.tf
: Contains the Terraform code for the main environment. This file import the template and specify the differet variables for the environment.backend.tf
: Contains the Terraform backend configuration for the environment.
The template
folder contains the Terraform code for the template. The folder contains the following files:
main.tf
: Contains the GCP services that you want to deploy. It use theenvironment
variable to determine which environment to deploy.variables.tf
: Contains the Terraform variables for the template.