-
Couldn't load subscription status.
- Fork 143
Description
Problem Summary
In Istio Ambient Mode with waypoint proxy enabled, we're experiencing intermittent 503 Upstream connection termination errors during pod-to-pod communication (both are in mesh). The issue occurs when Pod IP addresses are reused after the AWS VPC CNI's 30-second cooldown period, causing RST packets due to stale half-open HBONE connections held by the waypoint proxy.
Environment
Istio Version: 1.27.1
Kubernetes Version: 1.31.9
Cloud Provider: AWS EKS
CNI: AWS VPC CNI
Mode: Ambient Mode
Network Constraints: Limited IP pool causing frequent IP address reuse
Issue Details
Current Behavior
- I think that the waypoint proxy maintains half-open HBONE connections to ztunnel after the original pod is terminated.
- Through tcpdump analysis, i saw that the reused IP address receives HTTP/2 stream data frames intended for the previous connection
- We observed that RST packets are sent immediately after the reused IP receives these HTTP/2 DATA frames
- When a new pod is created with a recycled IP address (after 30s cooldown), it receives traffic intended for the previous pod
- The new pod/ztunnel responds with RST packets as the connection state doesn't match
- This results in 503 Upstream connection termination errors
- Waypoint's retry mechanism (reset-before-request), i think, doesn't appear to handle this scenario effectively
Expected Behavior
Waypoint should detect stale connections and close them proactively (<- This seems related to ztunnel issue #1191 )
When receiving RST packets, waypoint should retry the request with a fresh connection
Help Wanted
- root cause validation
- Are we correctly identifying this as a half-open connection issue?
- Could there be other factors contributing to this behavior?
- Has anyone else observed similar HTTP/2 DATA frames being sent to reused IPs?
- mitigation strategies : i'm seeking guidance on the following potential solutions
A. Retry Policy Adjustment : Would changing the retry policy from reset-before-request to reset resolve this issue?
B. Keepalive Configuration : Can aggressive keepalive settings help detect and clean up half-open connections?