Static runner hanging on aws ecs

**Describe the bug**
I have a runner installed on aws ecs using `waypoint runner install`, pointing to the prod HCP waypoint server. 

Currently, every remote operation behaves like this:

```
$ wp deploy

» Deploying acmeapp1...

» Operation is queued waiting for job "01GQTP02MT4PDYE8SCSFSP9CHC". Waiting for runner assignment...
  If you interrupt this command, the job will still run in the background.
```

According to `waypoint job list`, we're waiting for the static runner to take the StartTask job. 

Here are the runner's most recent logs, according to cloudwatch:

```


  | 2023-01-25T18:16:57.441-05:00 | 2023-01-25T23:16:57.441Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
-- | -- | --
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect

```

It's currently 2023-01-27, so it looks like the HCP server went down briefly on `2023-01-26T18:11:00.128-05:00`, and it caused the runner to become stuck. 

I've `tcping'd` the runner's health check port `1234`, and it's still open.

I'd like to get in there and take a thread dump, but it looks like enabling exec on aws ecs is non-trivial, and needs to be set up before the task is launched: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html

Based on the logs, I bet it's hanging somewhere in here: https://github.com/hashicorp/waypoint/blob/main/internal/runner/accept.go#L191-L267

My money is on here: https://github.com/hashicorp/waypoint/blob/5d8a6712aaa838c34b950f6cf5cccf9e64137e6d/internal/runner/accept.go#L206

Or here: https://github.com/hashicorp/waypoint/blob/5d8a6712aaa838c34b950f6cf5cccf9e64137e6d/internal/runner/accept.go#L221

If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.

**Workaround**

Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.

NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with `waypoint job cancel` first.


**Steps to Reproduce**

- Run a static runner on ecs
- Wait for an eventual hang

**Expected behavior**
Waypoint runner should not hang

**Waypoint Platform Versions**
Additional version and platform information to help triage the issue if
applicable:

* Waypoint CLI Version: 0.10.5
* Waypoint Server Platform and Version: (like `docker`, `nomad`, `kubernetes`): HCP

**Additional context**
If anyone else sees this, add a 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Static runner hanging on aws ecs #4460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Static runner hanging on aws ecs #4460

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions