Skip to content

Commit f361958

Browse files
committed
docs: explain all the triggers, ACM and ALB implementations, fix headings
1 parent ebf2c8b commit f361958

File tree

1 file changed

+48
-24
lines changed

1 file changed

+48
-24
lines changed

docs/dev/custom-domains.md

Lines changed: 48 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
11
# The Hitchhiker's Guide to Custom Domains
22
Lets territory owners attach a domain to their territories.
33
TODO: change the title
4-
TODO: add links to various documentations
54

65
Index:
76
TODO
87

98
# Middleware
109
Every time we hit a custom domain, middleware checks if it's allowed via a cached list of `ACTIVE` domains, coupled with their `subName`.
1110
If it's allowed, we redirect and rewrite to give custom domains a seamless territory-centered SN experience.
12-
###### Main middleware
11+
##### Main middleware
1312
Referral cookies and security headers gets applied the same way as before on SN, with the exception of being their own functions, so that now we can apply them also to the customDomainMiddleware resulting response.
14-
###### customDomainMiddleware
13+
##### customDomainMiddleware
1514
A `x-stacker-news-subname` header with the `subName` is injected into the request headers to give the SN code awareness of the territory attached to a custom domain.
1615

1716
Since SN has several paths that depends on the `sub` parameter or the `~subName/` paths, it manipulates the URL to always stay on the right territory:
@@ -43,7 +42,7 @@ domain is PENDING
4342
IMPLICIT: DNS is OK
4443
2. Certificate Management
4544
KO: critical AWS error, schedule up to 3 retries
46-
every step can throw an error
45+
every step can throw an error
4746
4847
CONDITION: certificate is not issued
4948
2a. issue certificate
@@ -71,7 +70,7 @@ domain is PENDING
7170
It uses the `Resolver` class from `node:dns/promises` to resolve CNAME records on a domain.
7271

7372
If the CNAME record is correct, it logs a `DomainVerificationAttempt` tied with the `DomainVerificationRecord`, having status `VERIFIED`. This resulting status is shared with the connected `DomainVerificationRecord` thanks to a trigger.
74-
###### dnsmasq
73+
##### dnsmasq
7574
In local, **dnsmasq** is used as a DNS server to mock records for the domain verification job.
7675
To have a dedicated IP for the `node:dns` Resolver, the `worker` container is part of a dedicated docker network that gives dnsmasq the `172.30.0.2` IP address.
7776

@@ -85,81 +84,106 @@ The domain verification job also handles critical AWS operations, such as:
8584
- certificate polling
8685
- certificate attachment to ELB
8786

88-
###### Certificate issuance
87+
##### Certificate issuance
8988
After DNS checks, if we don't have a certificate already, we request ACM a new certificate for the domain.
9089
ACM will return a `certificateArn`, which is the unique ID of an ACM certificate, that is immediately used to check its status. These informations are then stored in the `DomainCertificate` table.
9190

9291
If we couldn't request a certificate, check its status or store it in the DB, it throws an error so that pgboss can retry the job.
9392

94-
###### Certificate validation values
93+
##### Certificate validation values
9594
ACM needs to verify domain ownership in order to validate the certificate, in this case we use the DNS method.
9695

9796
We ask ACM for the DNS records so that we can store them as a `DomainVerificationRecord` and present them to the user. Finally, we re-schedule the job so that the user can adjust their DNS configuration.
9897

9998
If we couldn't get validation values or store them in the DB, it throws an error so that pgboss can retry the job.
10099

101-
###### Certificate validation polling
100+
##### Certificate validation polling
102101
We asked ACM for a certificate, got its validation values and presented them to the user. Now we need to poll ACM to know if the verification was successful.
103102

104103
Since we're directly checking the certificate status, we also update DomainCertificate on our DB with the new status.
105104

106105
AWS timings are unpredictable, if the verification returns a negative result, we re-schedule the job to repeat this step.
107-
108106
And If we couldn't contact ACM, it throws an error so that pgboss can retry the job.
109107

110-
###### Certificate attachment to the ALB listener
108+
##### Certificate attachment to the ALB listener
111109
This is the last step regarding AWS in our domain verification job, it attaches a completely verified ACM certificate to our load balancer listener.
112110

113111
The ALB listener is the gatekeeper of the application load balancer (ALB), it determines how incoming requests should be routed to the target server.
114112

115113
In the case of Stacker News, the domain points directly at the load balancer listener, this means that we can both direct the user to point their `CNAME` record to `stacker.news` and that we can serve their ACM certificate directly from the load balancer.
116114

117-
- how jobs are sent
118-
- on errors use the received job id to see if we couldn't verify a domain after 3 attempts, we put it on hold and delete any jobs left
119-
120115
### End of the job
121116
When we finish a step in the domain verification job, and the resulting status is still `PENDING`, we re-schedule a job using `sendDebounced` by pgboss.
122117

123-
Since we use a singletonKey to avoid same-domain concurrent jobs, and you can't schedule another job if one is already running, sendDebounced will try to schedule a job when it can, e.g. when the job finishes or after 30 seconds.
118+
Since we use a `singletonKey` to avoid same-domain concurrent jobs, and you can't schedule another job if one is already running, `sendDebounced` will try to schedule a job when it can, e.g. when the job finishes or after 30 seconds.
124119

125120
### Error handling
126121
If something throws an error, we catch it to log the attempt and then re-throw it to let pgboss retry up to 3 times.
127122
Using the `jobId` that we pass with each job, we can know if we're reaching 3 retries using pgboss' `getJobById`. And if we did reach 3 retries, we put the domain on `HOLD`, stopping and deleting future jobs tied to this domain.
128123

129124
### Domain Verification logger
130-
We need to be able to track where, when and what went wrong during domain verification. To do this, every step of the job calls logAttempt
131-
###### logAttempt
125+
We need to be able to track where, when and what went wrong during domain verification. To do this, every step of the job calls `logAttempt`
126+
##### logAttempt
132127
This is a simple function that logs a message returned by a domain verification step in the DB.
133128
Some steps, like DNS and SSL verification, calls `logAttempt` by also passing the interested record in `DomainVerificationRecord`, triggering a synchronization of `status` by the result of a step.
134129

135130
# AWS
136131
### ACM certificates
137-
TODOs
138-
- on domain or domain certificate removal detach and delete certificate
139-
- implementation
140-
- implementation thinking
132+
We don't expect territory owners to set up their own SSL certificates but we expect their custom domain to have SSL to reach Stacker News.
133+
With ACM we can request a certificate for a domain and serve it via the Application Load Balancer, effectively giving the custom domain a SSL certificate.
134+
135+
The implemented functions are non-destructive to the original Stacker News configuration:
136+
- Request Certificate
137+
- Describe Certificate
138+
- Get Certificate Status
139+
- Delete Certificate
140+
141+
We request a certificate intentionally asking for DNS validation as it's the most reliable method of verification, and also fits nicely with the CNAME record we ask the user to insert in their DNS configuration.
142+
143+
Describe Certificate is crucial to get SSL validation values that the user needs to put in their domain's DNS configuration; also to get the **status** of a certificate.
144+
145+
We can only delete a certificate if we have the `certificateArn` (unique ID). In fact, when a domain gets deleted, we trigger a job that takes the `certificateArn` as parameter before we lose it forever, trying up to 3 times if ACM gives an error.
146+
147+
In local, Localstack provides bare-minimum ACM operations and OK responses.
141148

142149
### AWS Application Load Balancer
143-
TODO
150+
The Application Load Balancer distributes the incoming requests across Stacker News servers. We can attach up to **25** ACM certificates (per default quota).
151+
152+
After creating and verifying an ACM certificate, the next step is to attach this certificate to our ALB Listener, so that the load balancer can serve the right certificate for the right domain.
153+
154+
Since in local we don't have the possibility to use Localstack to mock the ALB, there's a new class called `MockELBv2`: it provides bare minimum response mocks for attach/detach operations.
155+
156+
As the ALB is really important to reach stacker.news, we only implemented Attach/Detach certificate functions that takes a specific `certificateArn` (unique ID). This way we can't possibly mess with the default ALB configuration.
144157

145158
# plpgSQL functions and triggers
146159
### Clear Long Held Domains
147160
Every midnight, the `clearLongHeldDomains` job gets executed to remove domains that have been on `HOLD` for more than 30 days.
148161

162+
A domain removal also means the certificate removal, which triggers **Ask ACM to delete certificate**.
163+
149164
### Update `DomainVerificationRecord` status
150165
The `DomainVerification` job logs every step into `DomainVerificationAttempt`, when it comes to steps that involves DNS records like the `CNAME` record or ACM validation records, a connection between `DomainVerificationAttempt` and `DomainVerificationRecord` gets established.
151166

152167
If the result of a DNS verification on the `CNAME` record is `VERIFIED`, it triggers a field `status` update to the related `DomainVerificationRecord`, keeping the record **statuses** in sync with the `DomainVerification` job results.
153168

154169

155170
### HOLD domain on territory STOP
156-
TODO
171+
Let's say the territory owner doesn't renew their territory, and they have a custom domain attached to it. We can't let the custom domain access Stacker News as the domain can be transferred or out of original owner's control.
172+
173+
A territory stop triggers a function that puts the custom domain on `HOLD`, effectively stopping the custom domain functionality.
174+
175+
If the territory owner comes back and renews, they have to repeat the Domain Verification process just to make sure that everything is alright. The verification values will be the same and the certificate hasn't been deleted, so it should just take 30 seconds.
157176

158177
### Clear domain on territory takeover
159-
TODO
178+
If a new territory owner comes up, we delete every trace of the custom domain. This also deletes its certificates, verification attempts, DNS records and customizations.
179+
180+
The main reason is safety: as we don't delete this stuff when a territory gets stopped, in hope that the original territory owner renews it, it's best to delete everything - above all, validation values. This will also trigger **Ask ACM to delete certificate**.
160181

161182
### Ask ACM to delete certificate
162-
TODO
183+
Whenever a domain or domain certificate gets deleted, we run a job called `deleteCertificateExternal`.
184+
It detaches the ACM certificate from our ALB listener and then deletes the ACM certificate from ACM.
185+
186+
It's a necessary step to ensure that we don't waste AWS resources and also provide safety regarding the custom domain access to Stacker News.
163187

164188
# Neat stuff
165189

0 commit comments

Comments
 (0)