-
Bug ReportVersion
I also tried with axum v0.7.9.
Also tried with the branch from #3129 but to no avail.
PlatformDarwin Greyjoy.local 24.3.0 Darwin Kernel Version 24.3.0: Thu Jan 2 20:24:16 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6000 arm64 CratesDescriptionThis might be related to #3112, but I'm using graceful shutdown to it might be distinct. I'm about 99% sure this is due to me doing something wrong, but I've been beating my head against this for several hours today and can't figure it out. The issue I'm having is that Axum doesn't seem to be shutting down gracefully (or at all) when I'm passing in a CancellationToken. I'm fairly new to My code looks something like this: use axum::{Router, routing::get};
use reqwest::ClientBuilder;
use std::time::Duration;
use tokio::{net::TcpListener, signal};
use tokio_util::sync::CancellationToken;
use tower::ServiceBuilder;
use tower_http::timeout::TimeoutLayer;
pub async fn listen(listener: TcpListener, cancel_token: CancellationToken) -> eyre::Result<()> {
let layer = ServiceBuilder::new().layer(TimeoutLayer::new(Duration::from_secs(5)));
let app = Router::new()
.route("/health", get(|| async { "ok" }))
.layer(layer);
axum::serve(listener, app)
.with_graceful_shutdown(async move { cancel_token.cancelled().await })
.await
.unwrap();
println!("Shutdown");
Ok(())
}
async fn request(target_addr: &str) -> eyre::Result<()> {
let url = format!("http://{}/health", target_addr);
ClientBuilder::new()
.timeout(Duration::from_secs(5))
.connect_timeout(Duration::from_secs(1))
.pool_idle_timeout(Duration::from_secs(1))
.pool_max_idle_per_host(0)
.http2_keep_alive_interval(None)
.tcp_keepalive(None)
.build()
.expect("Failed to create HTTP client")
.get(&url)
.body(r#"{}"#)
.header("Content-Type", "application/json")
.send()
.await?
.error_for_status()?;
Ok(())
}
#[tokio::main]
async fn main() -> eyre::Result<()> {
let server_port = std::env::args().nth(1).expect("Server port not provided");
let peer_port = std::env::args().nth(2).expect("Peer port not provided");
let cancel_token = CancellationToken::new();
let cancel_token2 = cancel_token.clone();
let cancel_token3 = cancel_token.clone();
tokio::spawn(async move {
let addr = format!("127.0.0.1:{}", server_port);
listen(TcpListener::bind(&addr).await.unwrap(), cancel_token)
.await
.unwrap();
});
tokio::spawn(async move {
let addr = format!("127.0.0.1:{}", peer_port);
loop {
if let Err(error) = request(&addr).await {
eprintln!("Error: {:?}", error);
}
println!("Request sent");
tokio::time::sleep(Duration::from_secs(1)).await;
if cancel_token2.is_cancelled() {
println!("Cancel token 2 cancelled");
break;
}
}
});
tokio::spawn(async move {
signal::ctrl_c().await.expect("failed to listen for event");
cancel_token3.cancel();
})
.await
.unwrap();
Ok(())
} Unfortunately, I think there are some other moving parts involved, since this doesn't actually reproduce the issue... it just looks kinda like the code that reproduces the issue. I have a project here: https://github.com/ndouglas/whispers/ that shows this issue reliably. It's a new project (just started it like yesterday) and it's fairly small and pretty simple. I know that Hell is Other People's Projects, so I wish I could create a minimally reproducible example, but I think this hinges on behavior of requests that are in flight, connected, etc. I just don't know how. To reproduce, just run the program in 3-4 different terminals. e.g.
Wait for it (it should auto-configure networking and the peers should start gossipping). It'll spew a good amount of text related to mDNS broadcasts. After a few seconds, go from tab to tab and hit At least one of the instances will fail to shut down, while others will exit successfully. It'll log a few things, it'll seem like it shut down... but it didn't. If you open up ![]() I believe this number is equal to the number of incoming connections at that moment, but can't verify. If you jump into that task, you'll see it doesn't have any wakers. ![]() I've tried a lot of stuff. I have timeouts and connection timeouts, timeout on the requests, I've moved cancel tokens around, I built a shutdown manager, I've added functions and removed functions, tried with and without the gentle shutdown, etc. I'm just not sure what (if anything) I'm doing wrong. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
What sorts of connections are these? Are you using long polling, SSE, websockets? All of these things might block shutdown of the connections AFAIK (I don't know the details though). |
Beta Was this translation helpful? Give feedback.
-
I didn't check the project you linked, but in the code you posted here you're not actually listening for ctrl-c, you start listening for it only after the server already finishes. #[tokio::main]
async fn main() -> eyre::Result<()> {
let cancel_token = CancellationToken::new();
listen(
TcpListener::bind("127.0.0.1:0").await.unwrap(),
cancel_token.clone(),
)
.await // THIS await prevents this function to make any more progress
.unwrap();
// This task is not spawned until after the `listen` method finished so the cancellation token is never cancelled.
tokio::spawn(async move {
signal::ctrl_c().await.expect("failed to listen for event");
cancel_token.cancel();
})
.await
.unwrap();
Ok(())
} You want to spawn the cancellation task first (but don't await it) and then await the listener/serve future. That said you might be running into other issues too based on what you're describing. |
Beta Was this translation helpful? Give feedback.
-
Hi @mladedav @jplatte, I figured it out. I was deadlocking myself with DashMap. I don't think Axum had anything to do with it, just me misusing something I was new to and confusing the symptoms. It took me like literally 5 minutes to figure it out today. Friday and yesterday I think my head just wasn't in the right space. (Got rejected at the CEO approval stage for a job I'd been super excited about for ~2 months.) Thank you both so much for looking at my discussion, and for offering suggestions. I deeply appreciate it. Have a lovely day 🙂 |
Beta Was this translation helpful? Give feedback.
Hi @mladedav @jplatte, I figured it out. I was deadlocking myself with DashMap. I don't think Axum had anything to do with it, just me misusing something I was new to and confusing the symptoms.
It took me like literally 5 minutes to figure it out today. Friday and yesterday I think my head just wasn't in the right space. (Got rejected at the CEO approval stage for a job I'd been super excited about for ~2 months.)
Thank you both so much for looking at my discussion, and for offering suggestions. I deeply appreciate it. Have a lovely day 🙂