|
| 1 | +# Database maintenance |
| 2 | + |
| 3 | +There are times when Heroku needs to perform a maintenance on our database |
| 4 | +instances, for example to apply system updates or upgrade to a newer database |
| 5 | +server. |
| 6 | + |
| 7 | +We must **not** let Heroku run maintenances during the maintenance window to |
| 8 | +avoid disrupting production users (move the maintenance window if necessary). |
| 9 | +This page contains the instructions on how to perform the maintenance with the |
| 10 | +minimum amount of disruption. |
| 11 | + |
| 12 | +# Primary database |
| 13 | + |
| 14 | +Performing maintenance on the primary database requires us to temporarily put |
| 15 | +the application in read-only mode. Heroku performs maintenances by creating a |
| 16 | +hidden database follower and switching over to it, so we need to prevent writes |
| 17 | +on the primary to let the follower catch up. |
| 18 | + |
| 19 | +Maintenance should take less than 5 minutes of read-only time, but we should |
| 20 | +still announce it ahead of time on our status page. This is a sample message we |
| 21 | +can use: |
| 22 | + |
| 23 | +> The crates.io team will perform a database maintenance on YYYY-MM-DD from |
| 24 | +> hh:mm to hh:mm UTC. |
| 25 | +> |
| 26 | +> We expect this to take less than 5 minutes to complete. During maintenance |
| 27 | +> crates.io will only be available in read-only mode: downloading crates and |
| 28 | +> visiting the website will still work, but logging in, publishing crates, |
| 29 | +> yanking crates or changing owners will not work. |
| 30 | +
|
| 31 | +## Primary Database Checklist |
| 32 | + |
| 33 | +**1 hour before the maintenance** |
| 34 | + |
| 35 | +1. Go into the Heroku Scheduler and disable the job enqueueing the downloads |
| 36 | + count updater. You can "disable" it by changing its schedule not to run |
| 37 | + during the maintenance window. The job uses a lot of database resources, and |
| 38 | + we should not run it during maintenance. |
| 39 | + |
| 40 | +**5 minutes before the maintenance** |
| 41 | + |
| 42 | +2. Scale the background worker to 0 instances: |
| 43 | + |
| 44 | + ``` |
| 45 | + heroku ps:scale -a crates-io background_worker=0 |
| 46 | + ``` |
| 47 | + |
| 48 | +**At the start of the maintenance** |
| 49 | + |
| 50 | +3. Update the status page with this message: |
| 51 | + |
| 52 | + > Scheduled maintenance on our database is starting. |
| 53 | + > |
| 54 | + > We expect this to take less than 5 minutes to complete. During maintenance |
| 55 | + > crates.io will only be available in read-only mode: downloading crates and |
| 56 | + > visiting the website will still work, but logging in, publishing crates, |
| 57 | + > yanking crates or changing owners will not work. |
| 58 | +
|
| 59 | +3. Configure the application to be in read-only mode without the follower: |
| 60 | + |
| 61 | + ``` |
| 62 | + heroku config:set -a crates-io READ_ONLY_MODE=1 DB_OFFLINE=follower |
| 63 | + ``` |
| 64 | + |
| 65 | + The follower is removed because while Heroku tries to prevent connections to |
| 66 | + the primary database from failing during maintenance we observed that the |
| 67 | + same does not apply to the follower database, and there could be brief |
| 68 | + periods while the follower is not available. |
| 69 | + |
| 70 | +3. Confirm the application is in read-only mode by trying to publish a crate |
| 71 | + and logging in. |
| 72 | + |
| 73 | +3. Run the database maintenance: |
| 74 | + |
| 75 | + ``` |
| 76 | + heroku pg:maintenance:run --force -a crates-io |
| 77 | + ``` |
| 78 | + |
| 79 | +3. Confirm all the databases are online: |
| 80 | + |
| 81 | + ``` |
| 82 | + heroku pg:info -a crates-io |
| 83 | + ``` |
| 84 | + |
| 85 | +3. Confirm the primary database fully recovered (should output `false`): |
| 86 | + |
| 87 | + ``` |
| 88 | + echo "SELECT pg_is_in_recovery();" | heroku pg:psql -a crates-io DATABASE |
| 89 | + ``` |
| 90 | + |
| 91 | +3. Switch off read-only mode: |
| 92 | + |
| 93 | + ``` |
| 94 | + heroku config:unset -a crates-io READ_ONLY_MODE |
| 95 | + ``` |
| 96 | + |
| 97 | + **WARNING:** the Heroku Dashboard's UI is misleading when removing an |
| 98 | + environment variable. A red badge with a "-" (minus) in it means the |
| 99 | + variable was *successfully removed*, it doesn't mean removing the variable |
| 100 | + failed. Failures are indicated with a red badge with a "x" (cross) in it. |
| 101 | + |
| 102 | +3. Confirm the application is working by trying to publish a crate and logging |
| 103 | + in. |
| 104 | + |
| 105 | +3. Update the status page and mark the maintenance as completed with this |
| 106 | + message: |
| 107 | + |
| 108 | + > Scheduled maintenance finished successfully. |
| 109 | +
|
| 110 | + The message is posted right now and not at the end because this is when |
| 111 | + production users are not impacted by the maintenance anymore. |
| 112 | + |
| 113 | +3. Scale the background worker up again: |
| 114 | + |
| 115 | + ``` |
| 116 | + heroku ps:scale -a crates-io background_worker=1 |
| 117 | + ``` |
| 118 | + |
| 119 | +3. Confirm the follower database is available: |
| 120 | + |
| 121 | + ``` |
| 122 | + echo "SELECT 1;" | heroku pg:psql -a crates-io READ_ONLY_REPLICA |
| 123 | + ``` |
| 124 | + |
| 125 | +3. Enable connections to the follower: |
| 126 | + |
| 127 | + ``` |
| 128 | + heroku config:unset -a crates-io DB_OFFLINE |
| 129 | + ``` |
| 130 | + |
| 131 | +3. Re-enable the background job disabled during step 1. |
| 132 | + |
| 133 | +# Follower database |
| 134 | + |
| 135 | +Instructions and checklists for follower database maintenace aren't written |
| 136 | +yet. |
0 commit comments