Skip to content

Agent maintains stale server count metadata leading to incomplete connections in rolling update scenario (non-lease controller mode) #739

@ipochi

Description

@ipochi

Observed a bug in the interaction between the apiserver-network-proxy agent and server components when using a non-lease-controller-based approach to propagate server count metadata. This leads to agents not establishing connections to all available servers, especially during a rolling update where servers scale from N to M.

Environment: Proxy Server using non-lease-controller mode to communicate server count.

Problem:

The agent determines how many proxy server connections to establish based on the serverCount metadata received from the server it connects to. During a rolling update, older server pods may still exist and return stale metadata (e.g., old serverCount=2, even though actual server count has increased to 6). Here's what happens in such a case:

  1. Agent initially receives server count 2, hence it connects to 2 servers (e.g., server A and server B).
  2. The rollout begins; new servers are added (servers C, D, E, etc.).
  3. The agent connects to more servers (e.g., C, D, E) — now having 5 total connections.
  4. Later, the agent might get connected again to an old pod(e.g., server A, not yet terminated) as the agent tries the loadbalancer in front of proxy-server. In such a case, the agent receives outdated metadata: serverCount=2.
  5. This overwrites the agent's internal view of server count (from 6 → 2).
  6. The agent sees that it has 3 active connections (C, D, E) as eventually A and B connections will terminate as those pods will terminate. Agent believes it has met or exceeded the required number (3 ≥ 2).
  7. No further connections are attempted, despite the presence of new servers (e.g., A', B', F).
  8. As a result, the agent ends up with stale/incomplete connections, and does not recover until a restart.

Sample error log

I0504 19:34:38.561148       1 client.go:210] "Connect to server" serverID="67afde44-3666-4d1d-ba11-f15ed3a45668"
I0504 19:34:38.561167       1 clientset.go:213] "Server count change suggestion by server" current=2 serverID="67afde44-3666-4d1d-ba11-f15ed3a45668" actual=6
I0504 19:34:38.561174       1 clientset.go:222] "sync added client connecting to proxy server" serverID="67afde44-3666-4d1d-ba11-f15ed3a45668"
I0504 19:34:38.561198       1 client.go:321] "Start serving" serverID="67afde44-3666-4d1d-ba11-f15ed3a45668" agentID="0af9065f-f093-4687-b578-cc836b2088df"
I0504 19:34:39.405847       1 client.go:528] "remote connection EOF" connectionID=12
I0504 19:34:39.619253       1 client.go:210] "Connect to server" serverID="043f14d0-7731-4f88-a69e-374b6162cfb0"
I0504 19:34:39.619270       1 clientset.go:222] "sync added client connecting to proxy server" serverID="043f14d0-7731-4f88-a69e-374b6162cfb0"
I0504 19:34:39.619295       1 client.go:321] "Start serving" serverID="043f14d0-7731-4f88-a69e-374b6162cfb0" agentID="0af9065f-f093-4687-b578-cc836b2088df"
I0504 19:34:39.630021       1 client.go:528] "remote connection EOF" connectionID=23
I0504 19:34:40.430729       1 client.go:528] "remote connection EOF" connectionID=15
I0504 19:34:40.725064       1 client.go:210] "Connect to server" serverID="92f4d6f3-c994-4930-9c9c-4857e7b3e873"
I0504 19:34:40.725091       1 clientset.go:222] "sync added client connecting to proxy server" serverID="92f4d6f3-c994-4930-9c9c-4857e7b3e873"
I0504 19:34:40.725125       1 client.go:321] "Start serving" serverID="92f4d6f3-c994-4930-9c9c-4857e7b3e873" agentID="0af9065f-f093-4687-b578-cc836b2088df"
I0504 19:34:41.837433       1 client.go:210] "Connect to server" serverID="66dea63d-42e9-4b18-80bd-bd3bfe7225e7"
I0504 19:34:41.837458       1 clientset.go:213] "Server count change suggestion by server" current=6 serverID="66dea63d-42e9-4b18-80bd-bd3bfe7225e7" actual=2

This behavior is not self-healing, because the stale metadata continues to prevent the agent from reconnecting to the full server set. Consequently:

Agent A might end up connected to 3 servers.

Agent B might end up with 4.

But actual server count is 6.

The issue is not seen if the agent server uses leases to establish the number of connections required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions