diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 05c83d8e..9271c383 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -31,7 +31,7 @@ Steps to reproduce the behavior: ## Related Issue - + ## Additional context diff --git a/.github/workflows/docpublish.yml b/.github/workflows/docpublish.yml new file mode 100644 index 00000000..c6d8842b --- /dev/null +++ b/.github/workflows/docpublish.yml @@ -0,0 +1,27 @@ +name: Publish documentation + +on: + push: + branches: + - master + +jobs: + publish-documentation: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Install the latest version of uv + uses: astral-sh/setup-uv@v6 + - name: Build docs + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + make build-docs + - name: Push doc to Github Page + uses: peaceiris/actions-gh-pages@v4 + with: + github_token: ${{ secrets.GITHUB_TOKEN }} + publish_branch: gh-pages + publish_dir: ./site + user_name: "github-actions[bot]" + user_email: "github-actions[bot]@users.noreply.github.com" diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..f9044668 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,25 @@ +default_install_hook_types: + - pre-commit + +default_stages: + - pre-commit + +repos: +- repo: https://github.com/codespell-project/codespell + rev: v2.4.1 + hooks: + - id: codespell + name: Run codespell to check for common misspellings in files + # config section is within pyproject.toml + language: python + types: [ text ] + args: + + - --ignore-words=spelling_wordlist.txt + - --write-changes + exclude: > + (?x) + uv.lock$| + ^dags/fixtures/data_questionnaire.csv$ + additional_dependencies: + - tomli diff --git a/Makefile b/Makefile index 83453dfe..80197994 100644 --- a/Makefile +++ b/Makefile @@ -27,3 +27,9 @@ deploy-prod: down-prod: docker-compose -f ./docker-compose.yml down + +build-docs: + uv run --group docs mkdocs build + +serve-docs: + uv run --group docs mkdocs serve diff --git a/contrib/README.md b/contrib/README.md index dd4238f5..592b2f42 100644 --- a/contrib/README.md +++ b/contrib/README.md @@ -2,7 +2,7 @@ ## Upload KKTIX -![](../docs/kktix.png) +![kktix](../docs/images/kktix.png) 1. Navigate to KKTIX's attendees page 2. Download the CSV diff --git a/dags/ods/kktix_ticket_orders/udfs/kktix_api.py b/dags/ods/kktix_ticket_orders/udfs/kktix_api.py index c8ba0650..55c10b9a 100644 --- a/dags/ods/kktix_ticket_orders/udfs/kktix_api.py +++ b/dags/ods/kktix_ticket_orders/udfs/kktix_api.py @@ -46,7 +46,7 @@ def main(**context): def _extract(year: int, timestamp: float) -> list[dict]: """ get data from KKTIX's API - 1. condition_filter_callb: use this callbacl to filter out unwanted event! + 1. condition_filter_callb: use this callback to filter out unwanted event! 2. right now schedule_interval_seconds is a hardcoded value! """ event_raw_data_array: list[dict] = [] diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 30c69248..fedf3375 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -2,74 +2,124 @@ ## How to Contribute to this Project -1. Clone this repository: +### 1. Clone this repository: - ```bash - git clone https://github.com/pycontw/pycon-etl - ``` +```bash +git clone https://github.com/pycontw/pycon-etl +``` -2. Create a new branch: +### 2. Create a new branch: - ```bash - git checkout -b - ``` +Please checkout your branch from the latest master branch before doing any code change. -3. Make your changes. +```bash +# Checkout to the master branch +git checkout master - > **NOTICE:** We are still using Airflow v1, so please read the official document [Apache Airflow v1.10.15 Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.15/) to ensure your changes are compatible with our current version. +# Ensure that's you're on the latest master branch +git pull origin master - If your task uses an external service, add the connection and variable in the Airflow UI. +# Create a new branch +git checkout -b +``` -4. Test your changes in your local environment: +### 3. Make your changes. - - Ensure the DAG file is loaded successfully. - - Verify that the task runs successfully. - - Confirm that your code is correctly formatted and linted. - - Check that all necessary dependencies are included in `requirements.txt`. +If your task uses an external service, add the connection and variable in the Airflow UI. -5. Push your branch: +### 4. Test your changes in your local environment: - ```bash - git push origin - ``` +- Ensure that the dag files are loaded successfully. +- Verify that the tasks run without errors. +- Confirm that your code is properly formatted and linted. See [Convention](#convention) section for more details. +- Check that all necessary dependencies are included in the `pyproject.toml` file. + - Airflow dependencies are managed by [uv]. +- Ensure that all required documentation is provided. -6. Create a Pull Request (PR). +### 5. Push your branch: -7. Wait for the review and merge. +```bash +git push origin +``` -8. Write any necessary documentation. +### 6. Create a Pull Request (PR). -## Release Management +If additional steps are required after merging and deploying (e.g., add new connections or variables), please list them in the PR description. -Please use [GitLab Flow](https://about.gitlab.com/topics/version-control/what-is-gitlab-flow/); otherwise, you cannot pass Docker Hub CI. +### 7. Wait for the review and merge. -## Dependency Management +## Login to Airflow Web UI -Airflow dependencies are managed by [uv]. For more information, refer to the [Airflow Installation Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.15/installation.html). +Ask maintainers for your Airflow account and credentials. -## Code Convention + -### Airflow DAG +## Convention -- Please refer to [this article](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines. - - - Examples: - 1. `ods/opening_crawler`: Crawlers written by @Rain. These openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy. - 2. `ods/survey_cake`: A manually triggered uploader that uploads questionnaires to BigQuery. The uploader should be invoked after we receive the SurveyCake questionnaire. +### Airflow Dags +- Please refer to [「大數據之路:阿里巴巴大數據實戰」 讀書心得](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines. - Table name convention: ![img](https://miro.medium.com/max/1400/1*bppuEKMnL9gFnvoRHUO8CQ.png) -### Format +### Code Formatting +<<<<<<< HEAD -Please use `make format` to format your code before committing, otherwise, the CI will fail. +======= +>>>>>>> 745d95a880ccd8ab0c3960252cbd28cd366381c2 +Please run `make format` to ensure your code is properly formatted before committing; otherwise, the CI will fail. ### Commit Message - It is recommended to use [Commitizen](https://commitizen-tools.github.io/commitizen/). -### CI/CD - -Please check the [.github/workflows](.github/workflows) directory for details. - -[uv]: https://docs.astral.sh/uv/ \ No newline at end of file +## Release Management (CI/CD) + +We use [Python CI] and [Docker Image CI] to ensure our code quality meets specific standards and that Docker images can be published automatically. + +When a pull request is created, [Python CI] checks whether the code quality is satisfactory. At the same time, we build a `cache` image using `Dockerfile` and a `test` image with `Dockerfile.test`, which are then pushed to the [GCP Artifact Registry]. + +After a pull request is merged into the `master` branch, the two image tags mentioned above are created, along with a new `staging` tag for the image generated from `Dockerfile`. + +Once we verify that the `staging` image functions correctly, we merge the `master` branch into the `prod` branch through the following commands. + + + +```bash +git checkout prod +git pull origin prod + +git merge origin/master + +git pull origin prod +``` + +This triggers the [Docker Image CI] again to update the `cache`, `test`, and `staging` images, as well as to create a `latest` image that we will later use for deploying to our production instance. See the [Deployment Guide](./DEPLOYMENT.md) for the following steps. + +```mermaid +--- +config: + theme: 'base' + gitGraph: + mainBranchName: 'prod' + tagLabelFontSize: '25px' + branchLabelFontSize: '20px' +--- + gitGraph + commit id:"latest features" tag:"latest" + branch master + commit id:"staging features" tag:"staging" + checkout prod + commit id:"prod config" + checkout master + branch feature-1 + commit id: "new features" tag:"cache" tag:"test" +``` + +[uv]: https://docs.astral.sh/uv/ +[Python CI]: https://github.com/pycontw/pycon-etl/actions/workflows/python.yml +[Docker Image CI]: https://github.com/pycontw/pycon-etl/actions/workflows/dockerimage.yml +<<<<<<< HEAD +[GCP Artifact Registry]: https://cloud.google.com/artifact-registry/ +======= +[GCP Artifact Registry]: https://cloud.google.com/artifact-registry/ +>>>>>>> 745d95a880ccd8ab0c3960252cbd28cd366381c2 diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index c2ed3cd8..759904d3 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -1,16 +1,29 @@ # Deployment Guide -1. Login to the data team's server: - 1. Run: `gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"` - 2. Services: - * ETL: `/srv/pycon-etl` - * Metabase is located here: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server` +## Start Deploying -2. Pull the latest codebase to this server: `git pull` +### 1. Login to the data team's GCE server -3. Add credentials to the `.env.production` file (only needs to be done once). +```bash +gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217" +``` + +* Location of the Services: + * ETL (airflow): `/srv/pycon-etl` + * Metabase: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server` + +### 2. Pull the latest codebase and image to this server + +```bash +git checkout prod +git pull origin prod + +docker pull asia-east1-docker.pkg.dev/pycontw-225217/data-team/pycon-etl:latest +``` + +### 3. Add credentials to the `.env.production` file (only needs to be done once). -4. Start the services: +### 4. Restart the services: ```bash # Start production services @@ -18,4 +31,22 @@ docker-compose -f ./docker-compose.yml up # Stop production services # docker-compose -f ./docker-compose.yml down -``` \ No newline at end of file +``` + +### 5. Check whether the services are up + +```bash +# For Airflow, the following services should be included: +# * airflow-api-server +# * airflow-dag-processor +# * airflow-scheduler +# * airflow-triggerer +docker ps + +# Check the resource usage if needed +docker stats +``` + +### 6. Login to the service + +For security reasons, our Airflow instance is not publicly accessible. You will need an authorized GCP account to perform port forwarding for the webserver and an authorized Airflow account to access it. diff --git a/docs/MAINTENANCE.md b/docs/MAINTENANCE.md index 74774cc0..bea00d4d 100644 --- a/docs/MAINTENANCE.md +++ b/docs/MAINTENANCE.md @@ -2,23 +2,31 @@ ## Disk Space + + Currently, the disk space is limited, so please check the disk space before running any ETL jobs. This section will be deprecated if we no longer encounter out-of-disk issues. -1. Find the largest folders: - ```bash - du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20 - ``` -2. Show the folder size: - ```bash - du -hs xxxx - ``` -3. Delete the large folders identified. -4. Check disk space: - ```bash - df -h - ``` +### 1. Find the largest folders: + +```bash +du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20 +``` + +### 2. Show the folder size: + +```bash +du -hs +``` + +### 3. Delete the large folders identified. + +### 4. Check disk space: + +```bash +df -h +``` ## Token Expiration diff --git a/docs/README.md b/docs/README.md index 4a1b0f4a..0f156eeb 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,20 +1,11 @@ -# PyConTW ETL +# PyCon TW ETL ![Python CI](https://github.com/pycontw/PyCon-ETL/workflows/Python%20CI/badge.svg) ![Docker Image CI](https://github.com/pycontw/PyCon-ETL/workflows/Docker%20Image%20CI/badge.svg) Using Airflow to implement our ETL pipelines. -## Table of Contents - -- [Prerequisites](#prerequisites) -- [Installation](#installation) -- [Configuration](#configuration) -- [BigQuery (Optional)](#bigquery-optional) -- [Running the Project](#running-the-project) - - [Local Environment with Docker](#local-environment-with-docker) - - [Production](#production) -- [Contact](#contact) +[TOC] ## Prerequisites @@ -29,50 +20,48 @@ We use [uv] to manage dependencies and virtual environment. Below are the steps to create a virtual environment using [uv]: -1. Create a Virtual Environment with Dependencies Installed +### 1. Create a Virtual Environment with Dependencies Installed - To create a virtual environment, run the following command: +To create a virtual environment, run the following command: - ```bash - uv sync - ``` +```bash +uv sync +``` - By default, [uv] sets up the virtual environment in `.venv` +By default, [uv] sets up the virtual environment in `.venv` -2. Activate the Virtual Environment +### 2. Activate the Virtual Environment - After creating the virtual environment, activate it using the following command: +After creating the virtual environment, activate it using the following command: - ```bash - source .venv/bin/activate - ``` +```bash +source .venv/bin/activate +``` -3. Deactivate the Virtual Environment +### 3. Deactivate the Virtual Environment - When you're done working in the virtual environment, you can deactivate it with: +When you're done working in the virtual environment, you can deactivate it with: - ```bash - deactivate - ``` +```bash +deactivate +``` ## Configuration 1. For development or testing, run `cp .env.template .env.staging`. For production, run `cp .env.template .env.production`. - -2. Follow the instructions in `.env.` and fill in your secrets. - If you are running the staging instance for development as a sandbox and do not need to access any specific third-party services, leaving `.env.staging` as-is should be fine. +2. Follow the instructions in `.env.` and fill in your secrets. If you are running the staging instance for development as a sandbox and do not need to access any specific third-party services, leaving `.env.staging` as-is should be fine. > Contact the maintainer if you don't have these secrets. - +> > **⚠ WARNING: About .env** > Please do not use the .env file for local development, as it might affect the production tables. ### BigQuery (Optional) -Set up the Authentication for GCP: - -- After running `gcloud auth application-default login`, you will get a credentials.json file located at `$HOME/.config/gcloud/application_default_credentials.json`. Run `export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"` if you have it. -- service-account.json: Please contact @david30907d via email or Discord. You do not need this json file if you are running the sandbox staging instance for development. +- Set up the Authentication for GCP: + - After running `gcloud auth application-default login`, you will get a credentials.json file located at `$HOME/.config/gcloud/application_default_credentials.json`. + - Run `export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"` if you have it. +- `service-account.json`: Please contact @david30907d via email or Discord. You do not need this json file if you are running the sandbox staging instance for development. ## Running the Project @@ -80,9 +69,23 @@ If you are a developer 👨‍💻, please check the [Contributing Guide](./CONT If you are a maintainer 👨‍🔧, please check the [Maintenance Guide](./MAINTENANCE.md). -### Local Environment with Docker +### Local Development with uv + +```bash +# point the database to local "sqlite/airflow.db" +# Run "uv run airflow db migrate" if the file does not exist +export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=`pwd`/sqlite/airflow.db -For development/testing: +# point the airflow home to current directory +export AIRFLOW_HOME=`pwd` + +# Run standalone airflow +# Note that there may be slight differences between using this command and running through docker compose +# However, the difference should not be noticeable in most cases. +uv run airflow standalone +``` + +### Local Development with docker-compose ```bash # Build the local dev/test image @@ -97,9 +100,11 @@ make down-dev > The difference between production and dev/test compose files is that the dev/test compose file uses a locally built image, while the production compose file uses the image from Docker Hub. -If you are an authorized maintainer, you can pull the image from the GCP Artifact Registry. +#### Use images from Artifacts + +If you are an authorized maintainer, you can pull the image from the [GCP Artifact Registry]. -Docker client must be configured to use the GCP Artifact Registry. +Docker client must be configured to use the [GCP Artifact Registry]. ```bash gcloud auth configure-docker asia-east1-docker.pkg.dev @@ -111,17 +116,13 @@ Then, pull the image: docker pull asia-east1-docker.pkg.dev/pycontw-225217/data-team/pycon-etl:{tag} ``` -There are several tags available: +Available tags: - `cache`: cache the image for faster deployment - `test`: for testing purposes, including the test dependencies - `staging`: when pushing to the staging environment - `latest`: when pushing to the production environment -### Production - -Please check the [Production Deployment Guide](./DEPLOYMENT.md). - ## Contact [PyCon TW Volunteer Data Team - Discord](https://discord.com/channels/752904426057892052/900721883383758879) diff --git a/docs/airflow.png b/docs/airflow.png deleted file mode 100644 index ec97a8ea..00000000 Binary files a/docs/airflow.png and /dev/null differ diff --git a/docs/kktix.png b/docs/images/kktix.png similarity index 100% rename from docs/kktix.png rename to docs/images/kktix.png diff --git a/docs/youtube-connection.png b/docs/youtube-connection.png deleted file mode 100644 index a682e8c7..00000000 Binary files a/docs/youtube-connection.png and /dev/null differ diff --git a/mkdocs.yml b/mkdocs.yml index 6ae6df72..fd222dea 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -8,3 +8,9 @@ nav: - Contributing: "CONTRIBUTING.md" - Maintenance: "MAINTENANCE.md" - Deployment: "DEPLOYMENT.md" +markdown_extensions: + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format diff --git a/spelling_wordlist.txt b/spelling_wordlist.txt new file mode 100644 index 00000000..42394bed --- /dev/null +++ b/spelling_wordlist.txt @@ -0,0 +1 @@ +Checkin