Skip to content
This repository was archived by the owner on Jan 8, 2024. It is now read-only.
This repository was archived by the owner on Jan 8, 2024. It is now read-only.

Static runner hanging on aws ecs #4460

@izaaklauer

Description

@izaaklauer

Describe the bug
I have a runner installed on aws ecs using waypoint runner install, pointing to the prod HCP waypoint server.

Currently, every remote operation behaves like this:

$ wp deploy

» Deploying acmeapp1...

» Operation is queued waiting for job "01GQTP02MT4PDYE8SCSFSP9CHC". Waiting for runner assignment...
  If you interrupt this command, the job will still run in the background.

According to waypoint job list, we're waiting for the static runner to take the StartTask job.

Here are the runner's most recent logs, according to cloudwatch:



  | 2023-01-25T18:16:57.441-05:00 | 2023-01-25T23:16:57.441Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
-- | -- | --
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect

It's currently 2023-01-27, so it looks like the HCP server went down briefly on 2023-01-26T18:11:00.128-05:00, and it caused the runner to become stuck.

I've tcping'd the runner's health check port 1234, and it's still open.

I'd like to get in there and take a thread dump, but it looks like enabling exec on aws ecs is non-trivial, and needs to be set up before the task is launched: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html

Based on the logs, I bet it's hanging somewhere in here: https://github.com/hashicorp/waypoint/blob/main/internal/runner/accept.go#L191-L267

My money is on here:

streamCtxLock.Lock()

Or here:

if r.waitStateGreater(&r.stateConfig, stateGen) {

If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.

Workaround

Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.

NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with waypoint job cancel first.

Steps to Reproduce

  • Run a static runner on ecs
  • Wait for an eventual hang

Expected behavior
Waypoint runner should not hang

Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:

  • Waypoint CLI Version: 0.10.5
  • Waypoint Server Platform and Version: (like docker, nomad, kubernetes): HCP

Additional context
If anyone else sees this, add a 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions