Welcome to the source code repo of my DNS check API implementation!
The goal of this project is to perform DNS queries and give an idea to the client whether the domain name they provided is healthy or not. To determine its health, we consider the following criteria:
- Does it resolve? (Yes/No).
- How long does it take to resolve? (A number representing milliseconds)
- What TTL (time-to-live) settings are applied? (Numbers in seconds for the Start of Authority and the individual "A" records).
For each metric, we have the following states:
- HEALTHY - everything's alright, no need to worry about this aspect.
- WARNING - there's room for improvement, but it is not necessarily a problem (e.g. for TTL, we use this, as we don't know what kind of website is behind the domain name, and we can only apply general advice, but none of those should "break" a check).
- UNHEALTHY - the value is bad, action/optimization is required.
The project's entry point is a RESTful API at https://cf-worker-router.kristofsiket.workers.dev. Requests can be made using the following signature:
GET https://cf-worker-router.kristofsiket.workers.dev/check-dns/:domain?region=selected_region
An example:
GET https://cf-worker-router.kristofsiket.workers.dev/check-dns/github.com?region=india
The regions that can be selected are india, europe, usa and australia.
Response: an example response looks the following way (this is the result of a query from India to a Hungarian economic magazine):
{
"domain": "portfolio.hu",
"result": [
{
"address": "195.70.35.159",
"ttl": 30
}
],
"serverInfo": {
"soa": {
"nsname": "ns.portfolio.hu",
"hostmaster": "root.portfolio.hu",
"serial": 2019040322,
"refresh": 86400,
"retry": 7200,
"expire": 3600000,
"minttl": 300
},
"ns": [
"ns.portfolio.hu",
"ns-slave.m.glbns.com"
]
},
"metrics": {
"resolutionTime": 308.3278307914734,
"health": {
"resolution": {
"status": "HEALTHY",
"message": "DNS resolution is OK!"
},
"latency": {
"status": "UNHEALTHY",
"message": "DNS resolution time is too high (308.3278307914734ms)!"
},
"ttlSoa": {
"status": "WARNING",
"message": "TTL seems to be a bit low (300 seconds)! Consider increasing it to at least 3600 seconds for better resolution performance!"
},
"ttlARecords": [
{
"status": "WARNING",
"message": "TTL seems to be a bit low (30 seconds)! Consider increasing it to at least 300 seconds for better resolution performance!"
}
]
}
}
}
As you can see, I stuffed two projects inside this repository to make things a bit simpler. I will explain the role of the two projects in the Architecture section, for now, let's focus on starting and trying out the app locally.
Only the dns-check-api
project is relevant for this scenario. This is a Node.js REST API using the Hapi.js framework.
- Check out the repo (e.g.
gh repo clone kristof-siket/dns-check-service
) - Install dependencies (
npm i
). - Start the Hapi Server:
npm run dev
. - You can test the app with the local version of the above-mentioned example:
GET http://localhost:3000/check-dns/github.com
Notice that you don't need to pass the region in this case - this is because the API runs on your local computer, so it will always query from your location using your internet provider's DNS resolver.
If you run into any incompatibility issues, I ran the service locally using Node.js v20.10.0
and npm 10.2.5
. Happy testing!
Now, let me elaborate a bit on the details of the project. First, let's delve into the different metrics the API responds to the client, then talk a bit about the solution architecture.
The check collects essential metrics about the queried domain to determine its health status.
-
Resolution status: This is quite a simple yes/no type of metric. Whether the domain name resolves or not. If it doesn't resolve, the status is
UNHEALTHY
as the user needs to immediately act in case one of their domain names doesn't resolve. -
Resolution time: This metric is about the time it takes to resolve the domain name. This is a bit more complex because of two things. One thing is that the time is just a number, we have to be able to tell the user whether it is good or not. The other thing is that DNS is a distributed system so we will receive different results from different locations.
-
For the first problem, I did some research and read a bit about healthy resolution times. Most sources say that anything below 120-150ms is considered quite good. Based on my experimenting, sometimes for a first (uncached) resolution time, it is more realistic to expect 300ms. Therefore I introduced two stages: 120ms as a
WARNING
status threshold, and 300ms as anUNHEALTHY
check status. -
I solved the second problem outside of the application code. In my opinion, the DNS check service shouldn't care too much about its location. Its single responsibility is to send a DNS query and tell the client its observations. What I did was deploy multiple instances of this service to different parts of the world. The clients access the service through a Router (API Gateway) which receives the selected region from the client and then the service instance living in that region does the heavy lifting. More about this in the Architecture section.
-
-
TTL metrics: (TTL SoA and TTL for individual records): one thing that the clients can optimize in their DNS setup is caching. TTL (Time-to-live) tells for how long the resolved values of the domain name should be cached. There is a dramatic difference between cached vs. non-cached resolution times. While a non-cached resolution time is around 20-50ms (in a healthy case), a cached time can be below 1 millisecond! When it comes to TTL, a "good" value highly depends on the goals. I went through a few articles suggesting values for SoA (which provides a minimal TTL) and for individual "A" records (actual TTL that counts back). However, the check doesn't give
UNHEALTHY
status for any TTL value as there are significant differences based on how often we apply changes on a network level. But we giveWARNING
based on the info from the guideline (this part of course requires some fine tuning).
- A healthy domain
kiszamolo.hu is a Hungarian magazine that is hosted at Kinsta. Their domain is managed by Cloudflare so it should be quite fine. Try querying it from different regions (it will be very fast from Europe but still acceptable from Australia):
GET https://cf-worker-router.kristofsiket.workers.dev/check-dns/kiszamolo.hu?region=europe
GET https://cf-worker-router.kristofsiket.workers.dev/check-dns/kiszamolo.hu?region=australia
Their TTLs are a bit low as they need to apply changes quite often, so that warning doesn't mean any problem.
- A European domain that is slow from far regions
portfolio.hu is also a Hungarian magazine but instead of using Cloudflare, they manage their own DNS. Their latency from Europe is quite good, while it performs quite poorly from India, and Australia.
GET https://cf-worker-router.kristofsiket.workers.dev/check-dns/portfolio.hu?region=europe
GET https://cf-worker-router.kristofsiket.workers.dev/check-dns/portfolio.hu?region=india
Please, note that if you try a single domain from a single region multiple times consequently, you will see very low latency as the domain name is being cached. We hint the client that the resolution comes from the cache if it is below 2ms!
As I've already suggested, this system is realized in a distributed way. It has two layers:
-
A Router, deployed as a Cloudflare Worker, written in the Hono framework (general API framework for the Edge). This lives on the Edge - a globally distributed network that serves requests from a location that is close to the end user. This Router does nothing but select the right DNS Check API instance based on the region provided in the request body.
-
A DNS Check API, written in Node.js, TypeScript, Hapi.js that actually performs the DNS queries. This is deployed to Kinsta, which runs applications in Google Kubernetes Engine under the hood.
In real life, the check would be most probably scheduled, so a scheduler component should also be added. Now, this proof-of-concept only focuses on triggering the check in a selected region and retrieving the results.
The available regions are the following (regions are GCP regions):
- USA: dns-check-service-1yy3b.kinsta.app (us-east1)
- India: dns-check-service-delhi-hw17z.kinsta.app (asia-south2)
- Australia: dns-check-service-sydney-3zeg3.kinsta.app (australia-southeast1)
- Netherlands: dns-check-service-1yy3b.kinsta.app (eu-west4)
As this is a proof-of-concept, naturally there is room for improvements
- Storing historical data: currently the check provides an on-the-fly health status but it doesn't use previous check results as a reference, only static numbers. By storing the measurements, we could e.g. set WARNING for values that are above or below the standard deviation, etc.
- Fine-tuning current thresholds: by analyzing historical data, we could make the current thresholds more adaptive.
- Increase measurement accuracy: currently we are using the
dns
package and measure its query runtime via the Node.jsperf_hooks
(performance.now()
). As per my experimentation, this is quite accurate but we could implement these on a lower level to be even more accurate. - Get DNSSEC results. Security is also a crucial point in DNS health analysis, so e.g. adding the dnssec-js package could add some value.