Skip to content

Half-open HBONE connections cause 503 errors when Pod IPs are reused in Ambient Mode ? #1637

@Dylan-KW

Description

@Dylan-KW

Problem Summary

In Istio Ambient Mode with waypoint proxy enabled, we're experiencing intermittent 503 Upstream connection termination errors during pod-to-pod communication (both are in mesh). The issue occurs when Pod IP addresses are reused after the AWS VPC CNI's 30-second cooldown period, causing RST packets due to stale half-open HBONE connections held by the waypoint proxy.

Environment

Istio Version: 1.27.1
Kubernetes Version: 1.31.9
Cloud Provider: AWS EKS
CNI: AWS VPC CNI
Mode: Ambient Mode
Network Constraints: Limited IP pool causing frequent IP address reuse

Issue Details

Current Behavior

  1. I think that the waypoint proxy maintains half-open HBONE connections to ztunnel after the original pod is terminated.
  • Through tcpdump analysis, i saw that the reused IP address receives HTTP/2 stream data frames intended for the previous connection
  1. We observed that RST packets are sent immediately after the reused IP receives these HTTP/2 DATA frames
  2. When a new pod is created with a recycled IP address (after 30s cooldown), it receives traffic intended for the previous pod
  3. The new pod/ztunnel responds with RST packets as the connection state doesn't match
  4. This results in 503 Upstream connection termination errors
  5. Waypoint's retry mechanism (reset-before-request), i think, doesn't appear to handle this scenario effectively

Expected Behavior

Waypoint should detect stale connections and close them proactively (<- This seems related to ztunnel issue #1191 )
When receiving RST packets, waypoint should retry the request with a fresh connection

Help Wanted

  1. root cause validation
  • Are we correctly identifying this as a half-open connection issue?
  • Could there be other factors contributing to this behavior?
  • Has anyone else observed similar HTTP/2 DATA frames being sent to reused IPs?
  1. mitigation strategies : i'm seeking guidance on the following potential solutions
    A. Retry Policy Adjustment : Would changing the retry policy from reset-before-request to reset resolve this issue?
    B. Keepalive Configuration : Can aggressive keepalive settings help detect and clean up half-open connections?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions