Skip to content

Commit 3d6ed4b

Browse files
authored
Merge pull request #1071 from jdno/dns-postmortem
Create postmortem for DNS outage
2 parents 017a8ae + 8d63fde commit 3d6ed4b

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
layout: post
3+
title: DNS Outage on 2023-01-25
4+
author: Jan David Nose
5+
team: The Rust Infrastructure Team <https://www.rust-lang.org/governance/teams/infra>
6+
---
7+
8+
On Wednesday, 2023-01-25 at 09:15 UTC, we deployed changes to the production
9+
infrastructure for crates.io. During the deployment, the DNS record for
10+
`static.crates.io` failed to resolve for an estimated time of 10-15 minutes.
11+
Users experienced build failures during this time, because crates could not be
12+
downloaded. Around 9:30 UTC, the DNS record started to get propagated again and
13+
by 9:40 UTC traffic had returned to normal levels.
14+
15+
## Root Cause of the Outage
16+
17+
The Rust infrastructure is managed with Terraform, a tool to configure and
18+
provision infrastructure-as-code. The [Infrastructure team] recently made
19+
changes to this configuration to separate the `staging` and `production`
20+
environments for crates.io so that both can be deployed independently of each
21+
other.
22+
23+
This feature was used to develop and test the infrastructure for a second
24+
Content Delivery Network (CDN) for `static.crates.io` in the `staging`
25+
environment. When the configuration was ready, we
26+
[scheduled and announced](https://blog.rust-lang.org/inside-rust/2023/01/24/content-delivery-networks.html)
27+
the rollout for January 25th.
28+
29+
The deployment to `production` contained two changes that were developed,
30+
deployed, and tested individually on `staging`: a new TLS certificate for the
31+
current Content Delivery Network and updated DNS records.
32+
33+
When we deployed this configuration to `production`, Terraform first removed the
34+
current certificate and DNS records. It then started to issue a new certificate,
35+
which took around 10 minutes. During this time, there was no DNS record for
36+
`static.crates.io` and users experienced build failures. After the new
37+
certificate was provisioned, Terraform recreated the DNS records.
38+
39+
## Resolution
40+
41+
The outage resolved itself after Terraform finished the deployment and created a
42+
new DNS record for `static.crates.io`. For some users, the outage lasted a few
43+
minutes longer due to caches in their DNS server.
44+
45+
## Postmortem
46+
47+
The outage could have been avoided by deploying the changes to the TLS
48+
certificate and DNS records individually. We have identified two reasons why
49+
this did not happen as well as lessons that we are taking from this.
50+
51+
This was one of the first times that we used the new tooling around environments
52+
to deploy changes to `production`. One of its features is that the `production`
53+
environment is locked to a specific Git commit. When deploying in the past, we
54+
set this to the latest commit on `master`. This was done here as well, with the
55+
consequence that the deployment applied multiple changes simultaneously.
56+
57+
Another way to look at this is that `production` and `staging` diverged too much
58+
over time, because we did not deploy the changes when we merged them into the
59+
main branch. If we had deployed the changes when they were merged into the main
60+
branch, we would have isolated the DNS change. But given the importance of
61+
crates.io to the Rust ecosystem, we were hesitant to deploy multiple times
62+
without announcing the changes to the community first.
63+
64+
The lessons that we are taking away from this incident are as follows:
65+
66+
- We need to document the process of deploying changes to production, in
67+
particular how to pick the Git commit and how to review the changeset.
68+
Defining a process will enable us to iterate and improve it over time, and
69+
avoid the same issue in the future.
70+
- Changes that have been developed and tested in isolation on `staging` should
71+
be deployed individually and in sequence to `production`. We need to add
72+
this to the documentation.
73+
- When we merge changes into the main branch, we need to ensure that they get
74+
deployed to `production` as well. This avoids a drift between the
75+
configuration in Git and what is deployed.
76+
77+
[infrastructure team]: https://www.rust-lang.org/governance/teams/infra

0 commit comments

Comments
 (0)