Higher than normal error rate after onboarding ztunnel to nginx workloads

### Description
We have a workload working like diagram below:
1. nginx container will be the entrypoint for k8s `Service`
2. It performs consistent hashing based on requests queryParameter `camera_id` and then route to `pod_ip: port`
    2.a we have another deployment which consistently watching pod_ip change and reload nginx container when it happens.

<img width="838" height="501" alt="Image" src="https://github.com/user-attachments/assets/e86544ae-259d-4ca4-bf29-d9c4e9dd42db" />

The requests here are long polling connections,
user <->[1] (pod) <->[2] cameras

1. cameras send long polling connections on a pod,
2. users requests will land on same pod with nginx consistent hashing.
3. After camera polled users' requests and executed, it will respond to the pod and finish requests.

However, if camera failed to poll the requests or failed to repond to the requests, pod will return 504 Timeout.

### Nginx settings


```
upstream vproxy_hash {
    hash $cameraId consistent;
    include /tmp/vproxy_nginx_hosts.conf;
    keepalive 500;
    keepalive_timeout 360s;
}

upstream vproxy_remotesh {
    hash $cameraId consistent;
    include /tmp/vproxy_nginx_remotesh_hosts.conf;

    keepalive 500;
    keepalive_timeout 360s;
}

```



### Issues

After we onboarded ztunnel(L4 mesh), somehow the 504 Timeout error rate elevated.. 

I've tried 
POOL_UNUSED_RELEASE_TIMEOUT  -> 400s (greater than 360s mentioned above, possibly this could work? Currently monitoring it)
DEFAULT_POOL_MAX_STREAMS_PER_CONNECTION -> 300.

I don't think it's something related to ztunnel performance? ztunnel CPU looks fair: **2 cores, memory use 9 GiB** for a 8xlarge nodes (32 CPU, 64GiB memory), the node only contains pods from this workload.

Might be something when it handles connections upstream/downstream?  Maybe [[this](https://github.com/istio/ztunnel/issues/1483)]
Or I missed some other potential bottlenecks.  

If we have some tips for debugging, please let me know, I really appreciate it.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Higher than normal error rate after onboarding ztunnel to nginx workloads #1612

Description

Nginx settings

Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Higher than normal error rate after onboarding ztunnel to nginx workloads #1612

Description

Description

Nginx settings

Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions