-
Notifications
You must be signed in to change notification settings - Fork 334
Static runner hanging on aws ecs #4460
Description
Describe the bug
I have a runner installed on aws ecs using waypoint runner install
, pointing to the prod HCP waypoint server.
Currently, every remote operation behaves like this:
$ wp deploy
» Deploying acmeapp1...
» Operation is queued waiting for job "01GQTP02MT4PDYE8SCSFSP9CHC". Waiting for runner assignment...
If you interrupt this command, the job will still run in the background.
According to waypoint job list
, we're waiting for the static runner to take the StartTask job.
Here are the runner's most recent logs, according to cloudwatch:
| 2023-01-25T18:16:57.441-05:00 | 2023-01-25T23:16:57.441Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
-- | -- | --
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
It's currently 2023-01-27, so it looks like the HCP server went down briefly on 2023-01-26T18:11:00.128-05:00
, and it caused the runner to become stuck.
I've tcping'd
the runner's health check port 1234
, and it's still open.
I'd like to get in there and take a thread dump, but it looks like enabling exec on aws ecs is non-trivial, and needs to be set up before the task is launched: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html
Based on the logs, I bet it's hanging somewhere in here: https://github.com/hashicorp/waypoint/blob/main/internal/runner/accept.go#L191-L267
My money is on here:
waypoint/internal/runner/accept.go
Line 206 in 5d8a671
streamCtxLock.Lock() |
Or here:
waypoint/internal/runner/accept.go
Line 221 in 5d8a671
if r.waitStateGreater(&r.stateConfig, stateGen) { |
If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.
Workaround
Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.
NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with waypoint job cancel
first.
Steps to Reproduce
- Run a static runner on ecs
- Wait for an eventual hang
Expected behavior
Waypoint runner should not hang
Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:
- Waypoint CLI Version: 0.10.5
- Waypoint Server Platform and Version: (like
docker
,nomad
,kubernetes
): HCP
Additional context
If anyone else sees this, add a 👍