-
Notifications
You must be signed in to change notification settings - Fork 201
Description
Observed a bug in the interaction between the apiserver-network-proxy agent and server components when using a non-lease-controller-based approach to propagate server count metadata. This leads to agents not establishing connections to all available servers, especially during a rolling update where servers scale from N to M.
Environment: Proxy Server using non-lease-controller mode to communicate server count.
Problem:
The agent determines how many proxy server connections to establish based on the serverCount metadata received from the server it connects to. During a rolling update, older server pods may still exist and return stale metadata (e.g., old serverCount=2, even though actual server count has increased to 6). Here's what happens in such a case:
- Agent initially receives server count 2, hence it connects to 2 servers (e.g., server A and server B).
- The rollout begins; new servers are added (servers C, D, E, etc.).
- The agent connects to more servers (e.g., C, D, E) — now having 5 total connections.
- Later, the agent might get connected again to an old pod(e.g., server A, not yet terminated) as the agent tries the loadbalancer in front of proxy-server. In such a case, the agent receives outdated metadata: serverCount=2.
- This overwrites the agent's internal view of server count (from 6 → 2).
- The agent sees that it has 3 active connections (C, D, E) as eventually A and B connections will terminate as those pods will terminate. Agent believes it has met or exceeded the required number (3 ≥ 2).
- No further connections are attempted, despite the presence of new servers (e.g., A', B', F).
- As a result, the agent ends up with stale/incomplete connections, and does not recover until a restart.
Sample error log
I0504 19:34:38.561148 1 client.go:210] "Connect to server" serverID="67afde44-3666-4d1d-ba11-f15ed3a45668"
I0504 19:34:38.561167 1 clientset.go:213] "Server count change suggestion by server" current=2 serverID="67afde44-3666-4d1d-ba11-f15ed3a45668" actual=6
I0504 19:34:38.561174 1 clientset.go:222] "sync added client connecting to proxy server" serverID="67afde44-3666-4d1d-ba11-f15ed3a45668"
I0504 19:34:38.561198 1 client.go:321] "Start serving" serverID="67afde44-3666-4d1d-ba11-f15ed3a45668" agentID="0af9065f-f093-4687-b578-cc836b2088df"
I0504 19:34:39.405847 1 client.go:528] "remote connection EOF" connectionID=12
I0504 19:34:39.619253 1 client.go:210] "Connect to server" serverID="043f14d0-7731-4f88-a69e-374b6162cfb0"
I0504 19:34:39.619270 1 clientset.go:222] "sync added client connecting to proxy server" serverID="043f14d0-7731-4f88-a69e-374b6162cfb0"
I0504 19:34:39.619295 1 client.go:321] "Start serving" serverID="043f14d0-7731-4f88-a69e-374b6162cfb0" agentID="0af9065f-f093-4687-b578-cc836b2088df"
I0504 19:34:39.630021 1 client.go:528] "remote connection EOF" connectionID=23
I0504 19:34:40.430729 1 client.go:528] "remote connection EOF" connectionID=15
I0504 19:34:40.725064 1 client.go:210] "Connect to server" serverID="92f4d6f3-c994-4930-9c9c-4857e7b3e873"
I0504 19:34:40.725091 1 clientset.go:222] "sync added client connecting to proxy server" serverID="92f4d6f3-c994-4930-9c9c-4857e7b3e873"
I0504 19:34:40.725125 1 client.go:321] "Start serving" serverID="92f4d6f3-c994-4930-9c9c-4857e7b3e873" agentID="0af9065f-f093-4687-b578-cc836b2088df"
I0504 19:34:41.837433 1 client.go:210] "Connect to server" serverID="66dea63d-42e9-4b18-80bd-bd3bfe7225e7"
I0504 19:34:41.837458 1 clientset.go:213] "Server count change suggestion by server" current=6 serverID="66dea63d-42e9-4b18-80bd-bd3bfe7225e7" actual=2
This behavior is not self-healing, because the stale metadata continues to prevent the agent from reconnecting to the full server set. Consequently:
Agent A might end up connected to 3 servers.
Agent B might end up with 4.
But actual server count is 6.
The issue is not seen if the agent server uses leases to establish the number of connections required.