Half-open HBONE connections cause 503 errors when Pod IPs are reused in Ambient Mode ?

## Problem Summary
In Istio Ambient Mode with waypoint proxy enabled, we're experiencing intermittent 503 Upstream connection termination errors during pod-to-pod communication (both are in mesh). The issue occurs when Pod IP addresses are reused after the AWS VPC CNI's 30-second cooldown period, causing RST packets due to stale half-open HBONE connections held by the waypoint proxy.

## Environment
Istio Version: 1.27.1
Kubernetes Version: 1.31.9
Cloud Provider: AWS EKS
CNI: AWS VPC CNI
Mode: Ambient Mode
Network Constraints: Limited IP pool causing frequent IP address reuse


## Issue Details
### Current Behavior
1. I think that the waypoint proxy maintains half-open HBONE connections to ztunnel after the original pod is terminated.
- Through tcpdump analysis, i saw that the reused IP address receives HTTP/2 stream data frames intended for the previous connection
2. We observed that RST packets are sent immediately after the reused IP receives these HTTP/2 DATA frames
3. When a new pod is created with a recycled IP address (after 30s cooldown), it receives traffic intended for the previous pod
4. The new pod/ztunnel responds with RST packets as the connection state doesn't match
5. This results in 503 Upstream connection termination errors
6. Waypoint's retry mechanism (reset-before-request), i think, doesn't appear to handle this scenario effectively

## Expected Behavior
Waypoint should detect stale connections and close them proactively (<- This seems related to [ztunnel issue #1191](https://github.com/istio/ztunnel/issues/1191) )
When receiving RST packets, waypoint should retry the request with a fresh connection


## Help Wanted
1. root cause validation 
- Are we correctly identifying this as a half-open connection issue?
- Could there be other factors contributing to this behavior?
- Has anyone else observed similar HTTP/2 DATA frames being sent to reused IPs?

2. mitigation strategies : i'm seeking guidance on the following potential solutions
A. Retry Policy Adjustment : Would changing the retry policy from reset-before-request to reset resolve this issue?
B. Keepalive Configuration : Can aggressive keepalive settings help detect and clean up half-open connections?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Half-open HBONE connections cause 503 errors when Pod IPs are reused in Ambient Mode ? #1637

Problem Summary

Environment

Issue Details

Current Behavior

Expected Behavior

Help Wanted

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Half-open HBONE connections cause 503 errors when Pod IPs are reused in Ambient Mode ? #1637

Description

Problem Summary

Environment

Issue Details

Current Behavior

Expected Behavior

Help Wanted

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions