Skip to content

Commit a7e5913

Browse files
committed
docs: rewrite documentations, fix typos and introduce spell-check to avoid typos
1 parent 5f13802 commit a7e5913

File tree

13 files changed

+215
-101
lines changed

13 files changed

+215
-101
lines changed

.github/pull_request_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Steps to reproduce the behavior:
3131
<!--A clear and concise description of what you expected to happen-->
3232

3333
## Related Issue
34-
<!--If applicable, refernce to the issue related to this pull request.-->
34+
<!--If applicable, reference to the issue related to this pull request.-->
3535

3636
## Additional context
3737
<!--Add any other context or screenshots about the pull request here.-->

.pre-commit-config.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
default_install_hook_types:
2+
- pre-commit
3+
4+
default_stages:
5+
- pre-commit
6+
7+
repos:
8+
- repo: https://github.com/codespell-project/codespell
9+
rev: v2.4.1
10+
hooks:
11+
- id: codespell
12+
name: Run codespell to check for common misspellings in files
13+
# config section is within pyproject.toml
14+
language: python
15+
types: [ text ]
16+
args:
17+
18+
- --ignore-words=spelling_wordlist.txt
19+
- --write-changes
20+
exclude: >
21+
(?x)
22+
uv.lock$|
23+
^dags/fixtures/data_questionnaire.csv$
24+
additional_dependencies:
25+
- tomli

contrib/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Upload KKTIX
44

5-
![](../docs/kktix.png)
5+
![kktix](../docs/images/kktix.png)
66

77
1. Navigate to KKTIX's attendees page
88
2. Download the CSV

dags/ods/kktix_ticket_orders/udfs/kktix_api.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ def main(**context):
4646
def _extract(year: int, timestamp: float) -> list[dict]:
4747
"""
4848
get data from KKTIX's API
49-
1. condition_filter_callb: use this callbacl to filter out unwanted event!
49+
1. condition_filter_callb: use this callback to filter out unwanted event!
5050
2. right now schedule_interval_seconds is a hardcoded value!
5151
"""
5252
event_raw_data_array: list[dict] = []

docs/CONTRIBUTING.md

Lines changed: 76 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -2,74 +2,108 @@
22

33
## How to Contribute to this Project
44

5-
1. Clone this repository:
5+
### 1. Clone this repository:
66

7-
```bash
8-
git clone https://github.com/pycontw/pycon-etl
9-
```
7+
```bash
8+
git clone https://github.com/pycontw/pycon-etl
9+
```
1010

11-
2. Create a new branch:
11+
### 2. Create a new branch:
1212

13-
```bash
14-
git checkout -b <branch-name>
15-
```
13+
Please checkout your branch from the latest master branch before doing any code change.
1614

17-
3. Make your changes.
15+
```bash
16+
# Checkout to the master branch
17+
git checkout master
1818

19-
> **NOTICE:** We are still using Airflow v1, so please read the official document [Apache Airflow v1.10.15 Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.15/) to ensure your changes are compatible with our current version.
19+
# Ensure that's you're on the latest master branch
20+
git pull origin master
2021

21-
If your task uses an external service, add the connection and variable in the Airflow UI.
22+
# Create a new branch
23+
git checkout -b <branch-name>
24+
```
2225

23-
4. Test your changes in your local environment:
26+
### 3. Make your changes.
2427

25-
- Ensure the DAG file is loaded successfully.
26-
- Verify that the task runs successfully.
27-
- Confirm that your code is correctly formatted and linted.
28-
- Check that all necessary dependencies are included in `requirements.txt`.
28+
If your task uses an external service, add the connection and variable in the Airflow UI.
2929

30-
5. Push your branch:
30+
### 4. Test your changes in your local environment:
3131

32-
```bash
33-
git push origin <branch-name>
34-
```
32+
- Ensure that the dag files are loaded successfully.
33+
- Verify that the tasks run without errors.
34+
- Confirm that your code is properly formatted and linted. See [Convention](#convention) section for more details.
35+
- Check that all necessary dependencies are included in the `pyproject.toml` file.
36+
- Airflow dependencies are managed by [uv].
37+
- Ensure that all required documentation is provided.
3538

36-
6. Create a Pull Request (PR).
39+
### 5. Push your branch:
3740

38-
7. Wait for the review and merge.
41+
```bash
42+
git push origin <branch-name>
43+
```
3944

40-
8. Write any necessary documentation.
45+
### 6. Create a Pull Request (PR).
4146

42-
## Release Management
47+
If additional steps are required after merging and deploying (e.g., add new connections or variables), please list them in the PR description.
4348

44-
Please use [GitLab Flow](https://about.gitlab.com/topics/version-control/what-is-gitlab-flow/); otherwise, you cannot pass Docker Hub CI.
49+
### 7. Wait for the review and merge.
4550

46-
## Dependency Management
51+
## Convention
4752

48-
Airflow dependencies are managed by [uv]. For more information, refer to the [Airflow Installation Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.15/installation.html).
53+
### Airflow Dags
54+
- Please refer to [「大數據之路:阿里巴巴大數據實戰」 讀書心得](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines.
55+
- Table name convention:
56+
![img](https://miro.medium.com/max/1400/1*bppuEKMnL9gFnvoRHUO8CQ.png)
4957

50-
## Code Convention
58+
### Code Formatting
59+
Please run `make format` to ensure your code is properly formatted before committing; otherwise, the CI will fail.
5160

52-
### Airflow DAG
61+
### Commit Message
62+
It is recommended to use [Commitizen](https://commitizen-tools.github.io/commitizen/).
5363

54-
- Please refer to [this article](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines.
64+
## Release Management (CI/CD)
65+
We use [Python CI] and [Docker Image CI] to ensure our code quality meets specific standards and that Docker images can be published automatically.
5566

56-
- Examples:
57-
1. `ods/opening_crawler`: Crawlers written by @Rain. These openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy.
58-
2. `ods/survey_cake`: A manually triggered uploader that uploads questionnaires to BigQuery. The uploader should be invoked after we receive the SurveyCake questionnaire.
67+
When a pull request is created, [Python CI] checks whether the code quality is satisfactory. At the same time, we build a `cache` image using `Dockerfile` and a `test` image with `Dockerfile.test`, which are then pushed to the [GCP Artifact Registry].
5968

60-
- Table name convention:
61-
![img](https://miro.medium.com/max/1400/1*bppuEKMnL9gFnvoRHUO8CQ.png)
69+
After a pull request is merged into the `master` branch, the two image tags mentioned above are created, along with a new `staging` tag for the image generated from `Dockerfile`.
6270

63-
### Format
71+
Once we verify that the `staging` image functions correctly, we merge the `master` branch into the `prod` branch through the following commands.
6472

65-
Please use `make format` to format your code before committing, otherwise, the CI will fail.
73+
<!--TODO: This is not ideal. The "master" and "prod" branches should be protected and should not allow human pushes. We should create a GitHub action for this..-->
6674

67-
### Commit Message
75+
```bash
76+
git checkout prod
77+
git pull origin prod
6878

69-
It is recommended to use [Commitizen](https://commitizen-tools.github.io/commitizen/).
79+
git merge origin/master
80+
81+
git pull origin prod
82+
```
7083

71-
### CI/CD
84+
This triggers the [Docker Image CI] again to update the `cache`, `test`, and `staging` images, as well as to create a `latest` image that we will later use for deploying to our production instance. See the [Deployment Guide](./DEPLOYMENT.md) for the following steps.
7285

73-
Please check the [.github/workflows](.github/workflows) directory for details.
86+
```mermaid
87+
---
88+
config:
89+
theme: 'base'
90+
gitGraph:
91+
mainBranchName: 'prod'
92+
tagLabelFontSize: '25px'
93+
branchLabelFontSize: '20px'
94+
---
95+
gitGraph
96+
commit id:"latest features" tag:"latest"
97+
branch master
98+
commit id:"staging features" tag:"staging"
99+
checkout prod
100+
commit id:"prod config"
101+
checkout master
102+
branch feature-1
103+
commit id: "new features" tag:"cache" tag:"test"
104+
```
74105

75-
[uv]: https://docs.astral.sh/uv/
106+
[uv]: https://docs.astral.sh/uv/
107+
[Python CI]: https://github.com/pycontw/pycon-etl/actions/workflows/python.yml
108+
[Docker Image CI]: https://github.com/pycontw/pycon-etl/actions/workflows/dockerimage.yml
109+
[GCP Artifact Registry]: https://cloud.google.com/artifact-registry/

docs/DEPLOYMENT.md

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,51 @@
11
# Deployment Guide
22

3-
1. Login to the data team's server:
4-
1. Run: `gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"`
5-
2. Services:
6-
* ETL: `/srv/pycon-etl`
7-
* Metabase is located here: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server`
3+
## Start Deploying
84

9-
2. Pull the latest codebase to this server: `git pull`
5+
### 1. Login to the data team's GCE server
106

11-
3. Add credentials to the `.env.production` file (only needs to be done once).
7+
```bash
8+
gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"
9+
```
10+
11+
* Location of the Services:
12+
* ETL (airflow): `/srv/pycon-etl`
13+
* Metabase: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server`
14+
15+
### 2. Pull the latest codebase and image to this server
16+
17+
```bash
18+
git checkout prod
19+
git pull origin prod
20+
21+
docker pull asia-east1-docker.pkg.dev/pycontw-225217/data-team/pycon-etl:latest
22+
```
1223

13-
4. Start the services:
24+
### 3. Add credentials to the `.env.production` file (only needs to be done once).
25+
26+
### 4. Restart the services:
1427

1528
```bash
1629
# Start production services
1730
docker-compose -f ./docker-compose.yml up
1831

1932
# Stop production services
2033
# docker-compose -f ./docker-compose.yml down
21-
```
34+
```
35+
36+
### 5. Check whether the services are up
37+
38+
```bash
39+
# For Airflow, the following services should be included:
40+
# * airflow-api-server
41+
# * airflow-dag-processor
42+
# * airflow-scheduler
43+
# * airflow-triggerer
44+
docker ps
45+
46+
# Check the resource usage if needed
47+
docker stats
48+
```
49+
50+
### 6. Login to the service
51+
For security reasons, our Airflow instance is not publicly accessible. You will need an authorized GCP account to perform port forwarding for the webserver and an authorized Airflow account to access it.

docs/MAINTENANCE.md

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,30 @@
22

33
## Disk Space
44

5+
<!--TODO: we probably can make this check a dag-->
6+
57
Currently, the disk space is limited, so please check the disk space before running any ETL jobs.
68

79
This section will be deprecated if we no longer encounter out-of-disk issues.
810

9-
1. Find the largest folders:
10-
```bash
11-
du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20
12-
```
13-
2. Show the folder size:
14-
```bash
15-
du -hs xxxx
16-
```
17-
3. Delete the large folders identified.
18-
4. Check disk space:
19-
```bash
20-
df -h
21-
```
11+
### 1. Find the largest folders:
12+
13+
```bash
14+
du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20
15+
```
16+
17+
### 2. Show the folder size:
18+
19+
```bash
20+
du -hs
21+
```
22+
23+
### 3. Delete the large folders identified.
24+
### 4. Check disk space:
25+
26+
```bash
27+
df -h
28+
```
2229

2330
## Token Expiration
2431

0 commit comments

Comments
 (0)