Frequent 503s #324

benheb · 2020-07-01T20:14:17Z

benheb
Jul 1, 2020

Describe the bug
I'm not sure if this is considered an "outage" or not, but I receive a lot of 503s from the forecast API, and am wondering if there's awareness of this?

It's incredibly inconsistent, but if I make a request for say 10 forecast locations in a row, 90%+ of the time at least one of those requests will fail with a 503, usually multiple ones.

To Reproduce
Access a forecast url such as https://api.weather.gov/gridpoints/CYS/108,12/forecast, note that it often will fail with a 503 (screenshot below).

CorrelationId: 557b9407-22a7-4be2-ad8e-964793886976

Expected behavior
No 503s, or only rarely.

Screenshots

Environment
Javascript request, and hitting the url directly in the browser.

try {
			const url = `https://api.weather.gov/points/${lat},${lon}`
			const pointMetaData = await axios.get(url, {
				headers: {
					'Accept': 'application/json',
					'Accept-Charset': 'utf-8',
					'User-Agent': <my email>
				}
			})

			const forecastUrl = pointMetaData.data.properties.forecast
			const results = await axios.get(forecastUrl, {
				headers: {
					'Accept': 'application/json',
					'Accept-Charset': 'utf-8',
					'User-Agent': <my email>
				}
			})

			const data = this.formatForecastData(results.data, days)
			return data
		} catch (error) {
			console.log('error: ', error) // 503!
			return {'error': 'There was an error fetching forecast data, check back shortly.'}
		}

**Additional context**
Add any other context about the problem here.

StephenClouse · 2020-07-01T20:33:21Z

StephenClouse
Jul 1, 2020
Maintainer

503s are definitely outages. Please report these to operational monitoring.

0 replies

benheb · 2020-07-01T20:34:24Z

benheb
Jul 1, 2020
Author

@StephenClouse thanks. Is this project being actively maintained?

0 replies

StephenClouse · 2020-07-01T20:40:32Z

StephenClouse
Jul 1, 2020
Maintainer

Yes, the API is an actively-developed operational system.

0 replies

charliesneath · 2020-07-02T22:54:54Z

charliesneath
Jul 2, 2020

@StephenClouse I experienced these outages as well and happy to report as I get them. How can I report to operational monitoring?

0 replies

StephenClouse · 2020-07-02T23:17:07Z

StephenClouse
Jul 2, 2020
Maintainer

@charliesneath https://weather-gov.github.io/api/reporting-issues

0 replies

swraife · 2020-07-08T13:59:22Z

swraife
Jul 8, 2020

Does anyone know why they are having these issues or when it might be fixed? I've been getting a lot of 503s for the past couple weeks.

0 replies

scadergit · 2020-07-08T14:22:37Z

scadergit
Jul 8, 2020
Maintainer

NCO is working on infrastructure improvements that should resolve 503s, but will take through July to be fully complete.

0 replies

benheb · 2020-07-08T15:50:03Z

benheb
Jul 8, 2020
Author

Thanks @scadergit. I filed a ticket with NCO and until this morning the thread was just other people submitting the same issues. This update, however, came through this morning -- in line with your update as well:

Since an upgrade API isn't behaving optimally, the 503 errors are a known issue protecting the application. A fix for the application's database is expected no later than late this summer. We apologize for the inconvenience. This ticket will remain open until the issue is resolved. If anyone would like to be removed from the distribution list, please let us know via reply email. Otherwise, you will remain on the distribution list and updates will be provided when available, perhaps in several weeks. (CJJ)

0 replies

scadergit · 2020-07-12T14:55:34Z

scadergit
Jul 12, 2020
Maintainer

That message is a little miss-leading (albeit unintentionally, think telephone game). The upgrade is not the cause of the issue, but it did change how a key issue is communicated. Those that have been around a while know that nginx "Your's truly" error pages were a familiar sight. This came from nginx timing out waiting on the API to respond, which itself came from the API waiting on the database to respond (and not the application itself). The upgrade implemented a timeout so the API would return a machine-friendly 503 vs an the unfriendly HTML error page. These are the now common 503's (granted, 503 existed before for other reasons like missing forecast data that happend last week). That error is any block that prevents the API from fulfilling the request. They've also increased because the timeout is 10 seconds, which is intentionally lower than the nginx timeout. Given the performance requirements of the API, and general user expectation, responses that keep connectionis open longer than 10s ultimately cause the API to stumble over itself. So circling back around, the upgrade didn't cause the 503's, it's just being more transparent and more machine friendly when unable to fulfill the request due to the timeout.

If you still see an nginx error message, it's because the endpoint makes multiple requests to the database. Each of those took less than new timeout, but collectively took longer than the nginx timeout. We might change the default error message to a machine-friendly response as well.

The infrastructure upgrade is the actual resolution needed to improve the latency of the database. That is what we're hoping happens by end of July.

0 replies

RoosterBooster007 · 2020-08-14T16:22:25Z

RoosterBooster007
Aug 14, 2020

@scadergit
It's mid-August, and I'm still getting massive amounts of 503s... Has there been an infrastructure upgrade yet, or am I doing smth very wrong?

0 replies

scadergit · 2020-08-14T17:21:09Z

scadergit
Aug 14, 2020
Maintainer

No infrastructure upgrades yet (that's another department with different pirorities), but we have spent considerable effort to find performance improvement on what we manage (software/configuration). There were changes earlier this week that made a big difference, but not enough to make it visible externally. More refinements today that should help.

0 replies

RoosterBooster007 · 2020-08-14T17:33:29Z

RoosterBooster007
Aug 14, 2020

Ok, thanks for the quick reply! Just let me know when there are any changes to the API (software/hardware) that may help reduce the number of 503s.

0 replies

netbymatt · 2021-12-29T18:00:29Z

netbymatt
Dec 29, 2021

I've been experiencing 503s as well. But I've noticed that they tend to clear themselves up after several seconds. This self-clearing behavior is so consistent that I've had a lot of success by implementing a one-time (I don't want to hammer the servers with requests if they're truly down) automatic retry in my code after a 1-2 second delay.

Here's an example URL: https://api.weather.gov/gridpoints/LOT/59,73

I can't say for sure, but it seems like it happens after the forecast is updated for a specific gridpoint (typically 15 after the hour from this WFO) and you're the first one to load the forecast. To try and prove out this theory I've been finding really small towns (less people likely to be loading that specific forecast/gridpoint) and loading them up. It's not 100% of the time, but it's pretty often that the first fetch will fail with a 503 and a second fetch a few seconds later will succeeded.

Here's the response headers that I've received for two consecutive fetches to the endpoint above, about 3s apart per the timestamps in the data.

Failed response

Request URL: https://api.weather.gov/gridpoints/LOT/59,73
Request Method: GET
Status Code: 503 
Referrer Policy: strict-origin-when-cross-origin
access-control-allow-origin: *
access-control-expose-headers: X-Correlation-Id, X-Request-Id, X-Server-Id
cache-control: private, must-revalidate, max-age=900
content-length: 311
content-type: application/problem+json
date: Wed, 29 Dec 2021 17:42:50 GMT
expires: Wed, 29 Dec 2021 17:57:50 GMT
pragma: no-cache
server: nginx/1.20.1
strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
vary: Accept,Feature-Flags,Accept-Language
x-correlation-id: 1011e111
x-edge-request-id: 7d0b49b7
x-request-id: dd64a351-5ffe-42c4-ba6d-c02b911a001b
x-server-id: vm-bldr-nids-apiapp2.ncep.noaa.gov

Successful

Request URL: https://api.weather.gov/gridpoints/LOT/59,73
Request Method: GET
Status Code: 200 
Referrer Policy: strict-origin-when-cross-origin
access-control-allow-origin: *
access-control-expose-headers: X-Correlation-Id, X-Request-Id, X-Server-Id
cache-control: public, max-age=818, s-maxage=3600
content-encoding: gzip
content-length: 10004
content-type: application/geo+json
date: Wed, 29 Dec 2021 17:42:53 GMT
expires: Wed, 29 Dec 2021 17:56:31 GMT
last-modified: Wed, 29 Dec 2021 17:15:31 GMT
server: nginx/1.20.1
strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
vary: Accept-Encoding, Accept,Feature-Flags,Accept-Language
x-correlation-id: 1011d3e3
x-edge-request-id: 7d0b6471
x-request-id: b5990799-49c6-4ee0-8577-0424be30b712
x-server-id: vm-bldr-nids-apiapp8.ncep.noaa.gov

0 replies

Frequent 503s #324

Uh oh!

benheb Jul 1, 2020

Replies: 13 comments

Uh oh!

StephenClouse Jul 1, 2020 Maintainer

Uh oh!

benheb Jul 1, 2020 Author

Uh oh!

StephenClouse Jul 1, 2020 Maintainer

Uh oh!

charliesneath Jul 2, 2020

Uh oh!

StephenClouse Jul 2, 2020 Maintainer

Uh oh!

swraife Jul 8, 2020

Uh oh!

scadergit Jul 8, 2020 Maintainer

Uh oh!

benheb Jul 8, 2020 Author

Uh oh!

scadergit Jul 12, 2020 Maintainer

Uh oh!

RoosterBooster007 Aug 14, 2020

Uh oh!

scadergit Aug 14, 2020 Maintainer

Uh oh!

Uh oh!

RoosterBooster007 Aug 14, 2020

Uh oh!

netbymatt Dec 29, 2021

benheb
Jul 1, 2020

StephenClouse
Jul 1, 2020
Maintainer

benheb
Jul 1, 2020
Author

StephenClouse
Jul 1, 2020
Maintainer

charliesneath
Jul 2, 2020

StephenClouse
Jul 2, 2020
Maintainer

swraife
Jul 8, 2020

scadergit
Jul 8, 2020
Maintainer

benheb
Jul 8, 2020
Author

scadergit
Jul 12, 2020
Maintainer

RoosterBooster007
Aug 14, 2020

scadergit
Aug 14, 2020
Maintainer

RoosterBooster007
Aug 14, 2020

netbymatt
Dec 29, 2021