Skip to content

dynamic-load-balance channel can not work #2257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
c98 opened this issue Apr 24, 2025 · 4 comments
Open

dynamic-load-balance channel can not work #2257

c98 opened this issue Apr 24, 2025 · 4 comments

Comments

@c98
Copy link

c98 commented Apr 24, 2025

Bug Report

Version

v0.12.3

Platform

x86_64-unknown-linux

Crates

Description

I use tonic as a grpc client with Channel::balance_channel(10), when client call server(multi node) at the first time, it works, and then wait a moment (1 min) the tcp connection status is turned to CLOSE_WAIT (i.e. the server close the connection as thers is no other streams), and from now on, the later call to server failed to response like this:

2025-04-24 17:55:57.317958 ERROR grpc{trace_id="2125102617454885569858015d101d"}: src/grpc/handler.rs:47: err=Status { code: Cancelled, message: "operation was canceled", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, "connection closed"))
) }
2025-04-24 17:55:57.318031 ERROR grpc{trace_id="2125102617454885569858015d101d"}: src/grpc/handler.rs:63: Status {
    code: Cancelled,
    message: "operation was canceled",
    source: Some(
        tonic::transport::Error(
            Transport,
            hyper::Error(
                Canceled,
                "connection closed",
            ),
        ),
    ),
}

TCP Status
Image

I don't know if there is something wrong with my usage. Please help confirm. The following is the creation of the channel, thank you.

fn create_channel<K>(vipkey: K) -> ChannelType
where
    K: AsRef<str> + Send + 'static,
{
    let (channel, tx) = Channel::balance_channel(10);
    let channel = ServiceBuilder::new()
        .timeout(Duration::from_secs(3))
        .service(channel);

    tokio::spawn(async move {
        let mut prev_hosts: Vec<Arc<Host>> = vec![];
        let vipkey = vipkey.as_ref();
        loop {
            let hosts = vs::srv_hosts(vipkey).await.unwrap_or_default();
            let hosts: Vec<_> = hosts.into_iter().filter(|h| h.valid).collect();
            for host in hosts.iter() {
                if prev_hosts
                    .iter()
                    .any(|h| h.ip == host.ip && h.port == host.port)
                {
                    continue;
                }
                let key = format!("http://{}:{}", host.ip.as_str(), host.port);
                let ep = Endpoint::from_shared(key.clone()).unwrap();
                let change = Change::Insert(key, ep);
                if let Err(err) = tx.send(change).await {
                    error!("send change error: {:?}", err);
                }
            }

            for host in prev_hosts.iter() {
                if hosts.iter().any(|h| h.ip == host.ip && h.port == host.port) {
                    continue;
                }
                let key = format!("http://{}:{}", host.ip.as_str(), host.port);
                let change = Change::Remove(key);
                if let Err(err) = tx.send(change).await {
                    error!("send change error: {:?}", err);
                }
            }
            prev_hosts = hosts;
            tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
        }
    });

    channel
}
@c98 c98 changed the title balance channel TCP status to CLOSE_WAIT and operation was canceled. dynamic-load-balance channel can not work May 5, 2025
@c98
Copy link
Author

c98 commented May 5, 2025

This issue can reproduce 100%.

when connect to multi c++ grpc server endpoint, only one endpoint can established, other will be CLOSE_WAIT.

code in examples/src/dynamic_load_balance/client.rs

pub mod pb {
    tonic::include_proto!("grpc.examples.unaryecho");
}

use pb::{echo_client::EchoClient, EchoRequest};
use tonic::transport::channel::Change;
use tonic::transport::Channel;
use tonic::transport::Endpoint;
use tracing::level_filters::LevelFilter;
use tracing_subscriber::fmt;
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::util::SubscriberInitExt;
use tracing_subscriber::Layer;

use std::sync::Arc;

use std::sync::atomic::{AtomicBool, Ordering::SeqCst};
use tokio::time::timeout;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    tracing_subscriber::registry()
    .with(
        fmt::layer()
            .with_ansi(true)
            .with_file(true)
            .with_line_number(true)
            .with_thread_ids(false)
            .with_target(false)
            .with_filter(
                LevelFilter::TRACE
            ),
    )
    .init();


    let e1 = Endpoint::from_static("http://127.0.0.1:50051");
    let e2 = Endpoint::from_static("http://11.124.250.156:9090");

    let (channel, rx) = Channel::balance_channel(10);
    let mut client = EchoClient::new(channel);

    tokio::spawn(async move {
        println!("Added first endpoint");
        let change = Change::Insert("1", e1);
        let res = rx.send(change).await;
        println!("{:?}", res);
        println!("Added second endpoint");
        let change = Change::Insert("2", e2);
        let res = rx.send(change).await;
        println!("{:?}", res);
    });

    {
        tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
        let request = tonic::Request::new(EchoRequest {
            message: "hello".into(),
        });

        let rx = client.unary_echo(request);
        if let Ok(resp) = timeout(tokio::time::Duration::from_secs(10), rx).await {
            println!("RESPONSE={:?}", resp);
        } else {
            println!("did not receive value within 10 secs");
        }
    }

    println!("... Bye");
    tokio::time::sleep(tokio::time::Duration::from_secs(5000)).await;
    Ok(())
}

11.124.250.156:9090 is the c++ grpc server endpoint.
after the client issue a request, the tcp status is:

$netstat -antlp | grep :5005
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:50051         0.0.0.0:*               LISTEN      112413/dynamic-load 
tcp        0      0 127.0.0.1:50052         0.0.0.0:*               LISTEN      112413/dynamic-load 
tcp        0      0 127.0.0.1:50051         127.0.0.1:42554         ESTABLISHED 112413/dynamic-load 
tcp        0      0 127.0.0.1:42554         127.0.0.1:50051         ESTABLISHED 50880/dynamic-load- 

$netstat -antlp | grep :9090
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp       52      0 11.165.57.131:46340     11.124.250.156:9090     ESTABLISHED 50880/dynamic-load-

after about 2min, the tcp status is:

$netstat -antlp | grep :9090
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp       53      0 11.165.57.131:46340     11.124.250.156:9090     CLOSE_WAIT  50880/dynamic-load- 

from now on, if client issue a request that hit the 11.124.250.156:9090 endpoint, it will failed with error as described above.

@LucioFranco
Copy link
Member

Yeah, I think this is def a bug with tower::balance I see you made an issue there already. We should use that one to track this and then we can bump a new release when it gets "fixed" there. Though this is probably a pretty big foundational issue with balance. We are working on new grpc code with a new load balancer but that is still quite a ways out.

@c98
Copy link
Author

c98 commented May 6, 2025

Yeah, I think this is def a bug with tower::balance I see you made an issue there already. We should use that one to track this and then we can bump a new release when it gets "fixed" there. Though this is probably a pretty big foundational issue with balance. We are working on new grpc code with a new load balancer but that is still quite a ways out.

Thanks for your reply, I'll keep track on it.

@c98
Copy link
Author

c98 commented May 7, 2025

tower-rs/tower#824 (comment)

@LucioFranco I have updated the survey on this issue, with the details provided in the reference above.
This question is quite complicated. I don't know what you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants