Skip to content

Higher than normal error rate after onboarding ztunnel to nginx workloads #1612

@Yufeireal

Description

@Yufeireal

Description

We have a workload working like diagram below:

  1. nginx container will be the entrypoint for k8s Service
  2. It performs consistent hashing based on requests queryParameter camera_id and then route to pod_ip: port
    2.a we have another deployment which consistently watching pod_ip change and reload nginx container when it happens.
Image

The requests here are long polling connections,
user <->[1] (pod) <->[2] cameras

  1. cameras send long polling connections on a pod,
  2. users requests will land on same pod with nginx consistent hashing.
  3. After camera polled users' requests and executed, it will respond to the pod and finish requests.

However, if camera failed to poll the requests or failed to repond to the requests, pod will return 504 Timeout.

Nginx settings

upstream vproxy_hash {
    hash $cameraId consistent;
    include /tmp/vproxy_nginx_hosts.conf;
    keepalive 500;
    keepalive_timeout 360s;
}

upstream vproxy_remotesh {
    hash $cameraId consistent;
    include /tmp/vproxy_nginx_remotesh_hosts.conf;

    keepalive 500;
    keepalive_timeout 360s;
}

Issues

After we onboarded ztunnel(L4 mesh), somehow the 504 Timeout error rate elevated..

I've tried
POOL_UNUSED_RELEASE_TIMEOUT -> 400s (greater than 360s mentioned above, possibly this could work? Currently monitoring it)
DEFAULT_POOL_MAX_STREAMS_PER_CONNECTION -> 300.

I don't think it's something related to ztunnel performance? ztunnel CPU looks fair: 2 cores, memory use 9 GiB for a 8xlarge nodes (32 CPU, 64GiB memory), the node only contains pods from this workload.

Might be something when it handles connections upstream/downstream? Maybe [this]
Or I missed some other potential bottlenecks.

If we have some tips for debugging, please let me know, I really appreciate it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions