|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: DNS Outage on 2023-01-25 |
| 4 | +author: Jan David Nose |
| 5 | +team: The Rust Infrastructure Team <https://www.rust-lang.org/governance/teams/infra> |
| 6 | +--- |
| 7 | + |
| 8 | +On Wednesday, 2023-01-25 at 09:15 UTC, we deployed changes to the production |
| 9 | +infrastructure for crates.io. During the deployment, the DNS record for |
| 10 | +`static.crates.io` failed to resolve for an estimated time of 10-15 minutes. |
| 11 | +Users experienced build failures during this time, because crates could not be |
| 12 | +downloaded. Around 9:30 UTC, the DNS record started to get propagated again and |
| 13 | +by 9:40 UTC traffic had returned to normal levels. |
| 14 | + |
| 15 | +## Root Cause of the Outage |
| 16 | + |
| 17 | +The Rust infrastructure is managed with Terraform, a tool to configure and |
| 18 | +provision infrastructure-as-code. The [Infrastructure team] recently made |
| 19 | +changes to this configuration to separate the `staging` and `production` |
| 20 | +environments for crates.io so that both can be deployed independently of each |
| 21 | +other. |
| 22 | + |
| 23 | +This feature was used to develop and test the infrastructure for a second |
| 24 | +Content Delivery Network (CDN) for `static.crates.io` in the `staging` |
| 25 | +environment. When the configuration was ready, we |
| 26 | +[scheduled and announced](https://blog.rust-lang.org/inside-rust/2023/01/24/content-delivery-networks.html) |
| 27 | +the rollout for January 25th. |
| 28 | + |
| 29 | +The deployment to `production` contained two changes that were developed, |
| 30 | +deployed, and tested individually on `staging`: a new TLS certificate for the |
| 31 | +current Content Delivery Network and updated DNS records. |
| 32 | + |
| 33 | +When we deployed this configuration to `production`, Terraform first removed the |
| 34 | +current certificate and DNS records. It then started to issue a new certificate, |
| 35 | +which took around 10 minutes. During this time, there was no DNS record for |
| 36 | +`static.crates.io` and users experienced build failures. After the new |
| 37 | +certificate was provisioned, Terraform recreated the DNS records. |
| 38 | + |
| 39 | +## Resolution |
| 40 | + |
| 41 | +The outage resolved itself after Terraform finished the deployment and created a |
| 42 | +new DNS record for `static.crates.io`. For some users, the outage lasted a few |
| 43 | +minutes longer due to caches in their DNS server. |
| 44 | + |
| 45 | +## Postmortem |
| 46 | + |
| 47 | +The outage could have been avoided by deploying the changes to the TLS |
| 48 | +certificate and DNS records individually. We have identified two reasons why |
| 49 | +this did not happen as well as lessons that we are taking from this. |
| 50 | + |
| 51 | +This was one of the first times that we used the new tooling around environments |
| 52 | +to deploy changes to `production`. One of its features is that the `production` |
| 53 | +environment is locked to a specific Git commit. When deploying in the past, we |
| 54 | +set this to the latest commit on `master`. This was done here as well, with the |
| 55 | +consequence that the deployment applied multiple changes simultaneously. |
| 56 | + |
| 57 | +Another way to look at this is that `production` and `staging` diverged too much |
| 58 | +over time, because we did not deploy the changes when we merged them into the |
| 59 | +main branch. If we had deployed the changes when they were merged into the main |
| 60 | +branch, we would have isolated the DNS change. But given the importance of |
| 61 | +crates.io to the Rust ecosystem, we were hesitant to deploy multiple times |
| 62 | +without announcing the changes to the community first. |
| 63 | + |
| 64 | +The lessons that we are taking away from this incident are as follows: |
| 65 | + |
| 66 | + - We need to document the process of deploying changes to production, in |
| 67 | + particular how to pick the Git commit and how to review the changeset. |
| 68 | + Defining a process will enable us to iterate and improve it over time, and |
| 69 | + avoid the same issue in the future. |
| 70 | + - Changes that have been developed and tested in isolation on `staging` should |
| 71 | + be deployed individually and in sequence to `production`. We need to add |
| 72 | + this to the documentation. |
| 73 | + - When we merge changes into the main branch, we need to ensure that they get |
| 74 | + deployed to `production` as well. This avoids a drift between the |
| 75 | + configuration in Git and what is deployed. |
| 76 | + |
| 77 | +[infrastructure team]: https://www.rust-lang.org/governance/teams/infra |
0 commit comments