Skip to content

Rewrite documentation and publish it to github-pages using mkdocs #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Steps to reproduce the behavior:
<!--A clear and concise description of what you expected to happen-->

## Related Issue
<!--If applicable, refernce to the issue related to this pull request.-->
<!--If applicable, reference to the issue related to this pull request.-->

## Additional context
<!--Add any other context or screenshots about the pull request here.-->
27 changes: 27 additions & 0 deletions .github/workflows/docpublish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Publish documentation

on:
push:
branches:
- master

jobs:
publish-documentation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v6
- name: Build docs
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
make build-docs
- name: Push doc to Github Page
uses: peaceiris/actions-gh-pages@v4
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_branch: gh-pages
publish_dir: ./site
user_name: "github-actions[bot]"
user_email: "github-actions[bot]@users.noreply.github.com"
25 changes: 25 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
default_install_hook_types:
- pre-commit

default_stages:
- pre-commit

repos:
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
name: Run codespell to check for common misspellings in files
# config section is within pyproject.toml
language: python
types: [ text ]
args:

- --ignore-words=spelling_wordlist.txt
- --write-changes
exclude: >
(?x)
uv.lock$|
^dags/fixtures/data_questionnaire.csv$
additional_dependencies:
- tomli
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,9 @@ deploy-prod:

down-prod:
docker-compose -f ./docker-compose.yml down

build-docs:
uv run --group docs mkdocs build

serve-docs:
uv run --group docs mkdocs serve
2 changes: 1 addition & 1 deletion contrib/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Upload KKTIX

![](../docs/kktix.png)
![kktix](../docs/images/kktix.png)

1. Navigate to KKTIX's attendees page
2. Download the CSV
Expand Down
2 changes: 1 addition & 1 deletion dags/ods/kktix_ticket_orders/udfs/kktix_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def main(**context):
def _extract(year: int, timestamp: float) -> list[dict]:
"""
get data from KKTIX's API
1. condition_filter_callb: use this callbacl to filter out unwanted event!
1. condition_filter_callb: use this callback to filter out unwanted event!
2. right now schedule_interval_seconds is a hardcoded value!
"""
event_raw_data_array: list[dict] = []
Expand Down
134 changes: 92 additions & 42 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,74 +2,124 @@

## How to Contribute to this Project

1. Clone this repository:
### 1. Clone this repository:

```bash
git clone https://github.com/pycontw/pycon-etl
```
```bash
git clone https://github.com/pycontw/pycon-etl
```

2. Create a new branch:
### 2. Create a new branch:

```bash
git checkout -b <branch-name>
```
Please checkout your branch from the latest master branch before doing any code change.

3. Make your changes.
```bash
# Checkout to the master branch
git checkout master

> **NOTICE:** We are still using Airflow v1, so please read the official document [Apache Airflow v1.10.15 Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.15/) to ensure your changes are compatible with our current version.
# Ensure that's you're on the latest master branch
git pull origin master

If your task uses an external service, add the connection and variable in the Airflow UI.
# Create a new branch
git checkout -b <branch-name>
```

4. Test your changes in your local environment:
### 3. Make your changes.

- Ensure the DAG file is loaded successfully.
- Verify that the task runs successfully.
- Confirm that your code is correctly formatted and linted.
- Check that all necessary dependencies are included in `requirements.txt`.
If your task uses an external service, add the connection and variable in the Airflow UI.

5. Push your branch:
### 4. Test your changes in your local environment:

```bash
git push origin <branch-name>
```
- Ensure that the dag files are loaded successfully.
- Verify that the tasks run without errors.
- Confirm that your code is properly formatted and linted. See [Convention](#convention) section for more details.
- Check that all necessary dependencies are included in the `pyproject.toml` file.
- Airflow dependencies are managed by [uv].
- Ensure that all required documentation is provided.

6. Create a Pull Request (PR).
### 5. Push your branch:

7. Wait for the review and merge.
```bash
git push origin <branch-name>
```

8. Write any necessary documentation.
### 6. Create a Pull Request (PR).

## Release Management
If additional steps are required after merging and deploying (e.g., add new connections or variables), please list them in the PR description.

Please use [GitLab Flow](https://about.gitlab.com/topics/version-control/what-is-gitlab-flow/); otherwise, you cannot pass Docker Hub CI.
### 7. Wait for the review and merge.

## Dependency Management
## Login to Airflow Web UI

Airflow dependencies are managed by [uv]. For more information, refer to the [Airflow Installation Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.15/installation.html).
Ask maintainers for your Airflow account and credentials.

## Code Convention
<!--TODO: GitHub or Google OAuth login maybe a good way to reduce maintenance overhead, but it requires more setup. -->

### Airflow DAG
## Convention

- Please refer to [this article](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines.

- Examples:
1. `ods/opening_crawler`: Crawlers written by @Rain. These openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy.
2. `ods/survey_cake`: A manually triggered uploader that uploads questionnaires to BigQuery. The uploader should be invoked after we receive the SurveyCake questionnaire.
### Airflow Dags

- Please refer to [「大數據之路:阿里巴巴大數據實戰」 讀書心得](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines.
- Table name convention:
![img](https://miro.medium.com/max/1400/1*bppuEKMnL9gFnvoRHUO8CQ.png)

### Format
### Code Formatting
<<<<<<< HEAD

Please use `make format` to format your code before committing, otherwise, the CI will fail.
=======
>>>>>>> 745d95a880ccd8ab0c3960252cbd28cd366381c2
Please run `make format` to ensure your code is properly formatted before committing; otherwise, the CI will fail.

### Commit Message

It is recommended to use [Commitizen](https://commitizen-tools.github.io/commitizen/).

### CI/CD

Please check the [.github/workflows](.github/workflows) directory for details.

[uv]: https://docs.astral.sh/uv/
## Release Management (CI/CD)

We use [Python CI] and [Docker Image CI] to ensure our code quality meets specific standards and that Docker images can be published automatically.

When a pull request is created, [Python CI] checks whether the code quality is satisfactory. At the same time, we build a `cache` image using `Dockerfile` and a `test` image with `Dockerfile.test`, which are then pushed to the [GCP Artifact Registry].

After a pull request is merged into the `master` branch, the two image tags mentioned above are created, along with a new `staging` tag for the image generated from `Dockerfile`.

Once we verify that the `staging` image functions correctly, we merge the `master` branch into the `prod` branch through the following commands.

<!--TODO: This is not ideal. The "master" and "prod" branches should be protected and should not allow human pushes. We should create a GitHub action for this..-->

```bash
git checkout prod
git pull origin prod

git merge origin/master

git pull origin prod
```

This triggers the [Docker Image CI] again to update the `cache`, `test`, and `staging` images, as well as to create a `latest` image that we will later use for deploying to our production instance. See the [Deployment Guide](./DEPLOYMENT.md) for the following steps.

```mermaid
---
config:
theme: 'base'
gitGraph:
mainBranchName: 'prod'
tagLabelFontSize: '25px'
branchLabelFontSize: '20px'
---
gitGraph
commit id:"latest features" tag:"latest"
branch master
commit id:"staging features" tag:"staging"
checkout prod
commit id:"prod config"
checkout master
branch feature-1
commit id: "new features" tag:"cache" tag:"test"
```

[uv]: https://docs.astral.sh/uv/
[Python CI]: https://github.com/pycontw/pycon-etl/actions/workflows/python.yml
[Docker Image CI]: https://github.com/pycontw/pycon-etl/actions/workflows/dockerimage.yml
<<<<<<< HEAD
[GCP Artifact Registry]: https://cloud.google.com/artifact-registry/
=======
[GCP Artifact Registry]: https://cloud.google.com/artifact-registry/
>>>>>>> 745d95a880ccd8ab0c3960252cbd28cd366381c2
49 changes: 40 additions & 9 deletions docs/DEPLOYMENT.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,52 @@
# Deployment Guide

1. Login to the data team's server:
1. Run: `gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"`
2. Services:
* ETL: `/srv/pycon-etl`
* Metabase is located here: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server`
## Start Deploying

2. Pull the latest codebase to this server: `git pull`
### 1. Login to the data team's GCE server

3. Add credentials to the `.env.production` file (only needs to be done once).
```bash
gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"
```

* Location of the Services:
* ETL (airflow): `/srv/pycon-etl`
* Metabase: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server`

### 2. Pull the latest codebase and image to this server

```bash
git checkout prod
git pull origin prod

docker pull asia-east1-docker.pkg.dev/pycontw-225217/data-team/pycon-etl:latest
```

### 3. Add credentials to the `.env.production` file (only needs to be done once).

4. Start the services:
### 4. Restart the services:

```bash
# Start production services
docker-compose -f ./docker-compose.yml up

# Stop production services
# docker-compose -f ./docker-compose.yml down
```
```

### 5. Check whether the services are up

```bash
# For Airflow, the following services should be included:
# * airflow-api-server
# * airflow-dag-processor
# * airflow-scheduler
# * airflow-triggerer
docker ps

# Check the resource usage if needed
docker stats
```

### 6. Login to the service

For security reasons, our Airflow instance is not publicly accessible. You will need an authorized GCP account to perform port forwarding for the webserver and an authorized Airflow account to access it.
34 changes: 21 additions & 13 deletions docs/MAINTENANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,31 @@

## Disk Space

<!--TODO: we probably can make this check a dag-->

Currently, the disk space is limited, so please check the disk space before running any ETL jobs.

This section will be deprecated if we no longer encounter out-of-disk issues.

1. Find the largest folders:
```bash
du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20
```
2. Show the folder size:
```bash
du -hs xxxx
```
3. Delete the large folders identified.
4. Check disk space:
```bash
df -h
```
### 1. Find the largest folders:

```bash
du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20
```

### 2. Show the folder size:

```bash
du -hs
```

### 3. Delete the large folders identified.

### 4. Check disk space:

```bash
df -h
```

## Token Expiration

Expand Down
Loading