Skip to content

Commit 4f797b8

Browse files
authored
Merge pull request #538 from pietroalbini/cratesio-db-maintenance
Add crates.io database maintenance checklist
2 parents 53abab6 + 449dfc9 commit 4f797b8

File tree

2 files changed

+137
-0
lines changed

2 files changed

+137
-0
lines changed

src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
- [How to run a design meeting](./compiler/steering-meeting/how-to-run-design.md)
3131
- [crates.io](./crates-io/README.md)
3232
- [Crate removal](./crates-io/crate-removal.md)
33+
- [Database maintenance](./crates-io/db-maintenance.md)
3334
- [docs.rs](./docs-rs/README.md)
3435
- [Adding dependencies to the build environment](./docs-rs/add-dependencies.md)
3536
- [Developing without docker-compose](./docs-rs/no-docker-compose.md)

src/crates-io/db-maintenance.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Database maintenance
2+
3+
There are times when Heroku needs to perform a maintenance on our database
4+
instances, for example to apply system updates or upgrade to a newer database
5+
server.
6+
7+
We must **not** let Heroku run maintenances during the maintenance window to
8+
avoid disrupting production users (move the maintenance window if necessary).
9+
This page contains the instructions on how to perform the maintenance with the
10+
minimum amount of disruption.
11+
12+
# Primary database
13+
14+
Performing maintenance on the primary database requires us to temporarily put
15+
the application in read-only mode. Heroku performs maintenances by creating a
16+
hidden database follower and switching over to it, so we need to prevent writes
17+
on the primary to let the follower catch up.
18+
19+
Maintenance should take less than 5 minutes of read-only time, but we should
20+
still announce it ahead of time on our status page. This is a sample message we
21+
can use:
22+
23+
> The crates.io team will perform a database maintenance on YYYY-MM-DD from
24+
> hh:mm to hh:mm UTC.
25+
>
26+
> We expect this to take less than 5 minutes to complete. During maintenance
27+
> crates.io will only be available in read-only mode: downloading crates and
28+
> visiting the website will still work, but logging in, publishing crates,
29+
> yanking crates or changing owners will not work.
30+
31+
## Primary Database Checklist
32+
33+
**1 hour before the maintenance**
34+
35+
1. Go into the Heroku Scheduler and disable the job enqueueing the downloads
36+
count updater. You can "disable" it by changing its schedule not to run
37+
during the maintenance window. The job uses a lot of database resources, and
38+
we should not run it during maintenance.
39+
40+
**5 minutes before the maintenance**
41+
42+
2. Scale the background worker to 0 instances:
43+
44+
```
45+
heroku ps:scale -a crates-io background_worker=0
46+
```
47+
48+
**At the start of the maintenance**
49+
50+
3. Update the status page with this message:
51+
52+
> Scheduled maintenance on our database is starting.
53+
>
54+
> We expect this to take less than 5 minutes to complete. During maintenance
55+
> crates.io will only be available in read-only mode: downloading crates and
56+
> visiting the website will still work, but logging in, publishing crates,
57+
> yanking crates or changing owners will not work.
58+
59+
3. Configure the application to be in read-only mode without the follower:
60+
61+
```
62+
heroku config:set -a crates-io READ_ONLY_MODE=1 DB_OFFLINE=follower
63+
```
64+
65+
The follower is removed because while Heroku tries to prevent connections to
66+
the primary database from failing during maintenance we observed that the
67+
same does not apply to the follower database, and there could be brief
68+
periods while the follower is not available.
69+
70+
3. Confirm the application is in read-only mode by trying to publish a crate
71+
and logging in.
72+
73+
3. Run the database maintenance:
74+
75+
```
76+
heroku pg:maintenance:run --force -a crates-io
77+
```
78+
79+
3. Confirm all the databases are online:
80+
81+
```
82+
heroku pg:info -a crates-io
83+
```
84+
85+
3. Confirm the primary database fully recovered (should output `false`):
86+
87+
```
88+
echo "SELECT pg_is_in_recovery();" | heroku pg:psql -a crates-io DATABASE
89+
```
90+
91+
3. Switch off read-only mode:
92+
93+
```
94+
heroku config:unset -a crates-io READ_ONLY_MODE
95+
```
96+
97+
**WARNING:** the Heroku Dashboard's UI is misleading when removing an
98+
environment variable. A red badge with a "-" (minus) in it means the
99+
variable was *successfully removed*, it doesn't mean removing the variable
100+
failed. Failures are indicated with a red badge with a "x" (cross) in it.
101+
102+
3. Confirm the application is working by trying to publish a crate and logging
103+
in.
104+
105+
3. Update the status page and mark the maintenance as completed with this
106+
message:
107+
108+
> Scheduled maintenance finished successfully.
109+
110+
The message is posted right now and not at the end because this is when
111+
production users are not impacted by the maintenance anymore.
112+
113+
3. Scale the background worker up again:
114+
115+
```
116+
heroku ps:scale -a crates-io background_worker=1
117+
```
118+
119+
3. Confirm the follower database is available:
120+
121+
```
122+
echo "SELECT 1;" | heroku pg:psql -a crates-io READ_ONLY_REPLICA
123+
```
124+
125+
3. Enable connections to the follower:
126+
127+
```
128+
heroku config:unset -a crates-io DB_OFFLINE
129+
```
130+
131+
3. Re-enable the background job disabled during step 1.
132+
133+
# Follower database
134+
135+
Instructions and checklists for follower database maintenace aren't written
136+
yet.

0 commit comments

Comments
 (0)