-
Couldn't load subscription status.
- Fork 143
Description
Description
We have a workload working like diagram below:
- nginx container will be the entrypoint for k8s
Service - It performs consistent hashing based on requests queryParameter
camera_idand then route topod_ip: port
2.a we have another deployment which consistently watching pod_ip change and reload nginx container when it happens.
The requests here are long polling connections,
user <->[1] (pod) <->[2] cameras
- cameras send long polling connections on a pod,
- users requests will land on same pod with nginx consistent hashing.
- After camera polled users' requests and executed, it will respond to the pod and finish requests.
However, if camera failed to poll the requests or failed to repond to the requests, pod will return 504 Timeout.
Nginx settings
upstream vproxy_hash {
hash $cameraId consistent;
include /tmp/vproxy_nginx_hosts.conf;
keepalive 500;
keepalive_timeout 360s;
}
upstream vproxy_remotesh {
hash $cameraId consistent;
include /tmp/vproxy_nginx_remotesh_hosts.conf;
keepalive 500;
keepalive_timeout 360s;
}
Issues
After we onboarded ztunnel(L4 mesh), somehow the 504 Timeout error rate elevated..
I've tried
POOL_UNUSED_RELEASE_TIMEOUT -> 400s (greater than 360s mentioned above, possibly this could work? Currently monitoring it)
DEFAULT_POOL_MAX_STREAMS_PER_CONNECTION -> 300.
I don't think it's something related to ztunnel performance? ztunnel CPU looks fair: 2 cores, memory use 9 GiB for a 8xlarge nodes (32 CPU, 64GiB memory), the node only contains pods from this workload.
Might be something when it handles connections upstream/downstream? Maybe [this]
Or I missed some other potential bottlenecks.
If we have some tips for debugging, please let me know, I really appreciate it.