Advice for debugging long round trip time in multilingual deployment #2284

C-Loftus · 2025-05-21T13:32:04Z

C-Loftus
May 21, 2025

Thank you very much for your excellent work on this library.

Question

I have a gRPC client in Go and a server in tonic.
Although the backend function that is ran in tonic only takes 200ms, the round trip is significantly more and can sometimes take up to 1 second.
Are there any debugging tips I can use to find where the roundtrip latency is coming from?
- I am finding it a bit tricky to determine if this is due to go-grpc, tonic, tokio, or some other part of the stack
- I am new to gRPC so I may be missing something fundamental about server multiplexing

My guess is that there is a limit on concurrent requests somehow and thus the client is blocking on either send or recieve but I am unclear where / why

Background

I have a CLI written in Go using the typical go-grpc lib
My go client is reusing the same connection with the same client for each request

	conn, err := grpc.NewClient("unix:///tmp/shacl_validator.sock",
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)
	if err != nil {
		return SitemapCrawlStats{}, fmt.Errorf("failed to connect to gRPC server: %w", err)
	}
	defer conn.Close()
	grpcClient := protoBuild.NewShaclValidatorClient(conn)

it is communicating to a tonic gRPC server via a unix socket

- My Tonic Server Code

I only have one simple service on my grpc server.
I am using multithreading in tokio so I assumed there is no blocking serverside and can handle 100+ connections to the socket without issue

#[tokio::main(flavor = "multi_thread")] // defaults to number of cpus on the system
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let path = "/tmp/shacl_validator.sock";

    // Remove the socket file if it already exists
    if Path::new(path).exists() {
        fs::remove_file(path)?;
    }

    std::fs::create_dir_all(Path::new(path).parent().unwrap())?;

    let uds = UnixListener::bind(path)?;
    let uds_stream = UnixListenerStream::new(uds);

    // the validator just runs one CPU intensive function to validate a payload then returns
    let validator = Validator::default();

    println!("Starting gRPC server on {}", path);

    // Run the server and listen for Ctrl+C
    let server = Server::builder()
        .add_service(ShaclValidatorServer::new(validator))
        .serve_with_incoming_shutdown(
            uds_stream,
            async {
                signal::ctrl_c().await.expect("failed to install Ctrl+C handler");
            }
        );

    let result = server.await;

    // Clean up the socket file on shutdown
    if Path::new(path).exists() {
        println!("Cleaning up socket file at {}", path);
        fs::remove_file(path)?;
    }
    result?;
    Ok(())
}

My tonic service has potentially a few hundred inbound requests at once and is executing a CPU bound function (essentially a web crawler utility function)
- Despite this, the validation part that is CPU bound pretty much never takes above 300ms

Timing Background

Each validation function takes < 200ms to complete but I find that the total round trip is >500ms
I assumed performance would be better since I am using a local socket
My payload and protobuf schema are both very small and simple (no nesting, nothing over a few thousand bytes in total, i.e. serialized json payload size)
- Here is an example of a payload I am sending to tonic, https://reference.geoconnex.us/collections/dams/items/1022598?f=jsonld (it is fairly small when serialized)
Here are some timing spans from otel

Thank you very much

Answered by kvc0

May 22, 2025

Nearly all of my experience with tonic is via http2. For reference, at my day job, we have a go sdk customers use (on their own servers) that connects via normal tls http2 to remote servers (that I own) running tonic. P999 latency for payloads of this size is on the order of 1-3 milliseconds, assuming they're in the same cloud and depending on whether or not they're in the same AZ. I'm not saying this to brag, but to reassure you that multilingual grpc with < 1s latency is a very reasonable ask 😆 ❤️

First and easiest, a tokio trivia: Your main is fine, except that it runs your server "off the runtime." This can cause delays mostly for new connections (possibly more). Try changing one lin…

View full answer

kvc0 · 2025-05-22T00:05:25Z

kvc0
May 22, 2025

Nearly all of my experience with tonic is via http2. For reference, at my day job, we have a go sdk customers use (on their own servers) that connects via normal tls http2 to remote servers (that I own) running tonic. P999 latency for payloads of this size is on the order of 1-3 milliseconds, assuming they're in the same cloud and depending on whether or not they're in the same AZ. I'm not saying this to brag, but to reassure you that multilingual grpc with < 1s latency is a very reasonable ask 😆 ❤️

First and easiest, a tokio trivia: Your main is fine, except that it runs your server "off the runtime." This can cause delays mostly for new connections (possibly more). Try changing one line in your server to:

let result = tokio::spawn(server).await;

This makes your server run on a runtime thread, so any task it spawns will be from within the runtime - and that makes the tasks start much more quickly. I don't think this is costing you a second though.

Second and probably a wild goose chase, is your go client making only 1 connection for all the concurrency? I've had to go to great lengths to make high concurrency on 1 h2 connection work well. If your client library allows it, you might try a load balancing connection pool or something with 10 connections to spread the concurrent stream synchronization. I also don't think this is costing you a second.

Third and probably the root, is that 200 milliseconds is too long to hold a cooperative multitasking thread. Tokio doesn't know which task you want to execute first, so in the abstract, they run in unordered fashion, and your complete work items delay other work items.

In Rust, tasks don't have any way to implicitly, preemptively yield in the middle of their work. They will occupy their thread unconditionally until they reach an await (or complete) just like a 2+2 expression or thread::sleep(). For your tasks that run tonic requests, you want to stay within 50µs of an await at all times, as much as possible. Broadly, I know of 2 convenient patterns to do this with what you describe:

Make your cpu work async: You make that 200-300ms "validation" code asynchronous, and assuming you're doing a bunch of computation in some loop, you would put an await in there every thousand iterations or something to that effect, like this:

tokio::task::consume_budget().await;

Run your cpu work off the runtime: Not all thread-bound work is created equal. Consider how easily a cpu accomplishes thread::sleep() or reading bytes from a drive, versus a hard loop computing the nth digit of pi. I am not sure what kind of work your validation work is, so if I had to give you a silver bullet recommendation, I'd ask you to move that 200-300ms to a different thread pool.

You can do that by:

let validation_result = tokio::task::spawn_blocking(async move {
    run_200ms_validation(the_request)
});

or by creating a rayon thread pool and doing some channel plumbing, e.g., with tokio::sync::oneshot::channel(). A 200-300 millisecond cpu-bound work is enough to legitimize an "off-runtime task wake" via a mechanism like this.

Lastly, some obligatory engineering theory because I can't help myself: If you have 100 computations that each take 0.2 seconds, that's 20 seconds of CPU time that has come from somewhere. By Little's Law, a 16 core CPU can only retire a CPU-bound 200ms task at a rate of:

r * 0.2 = 16
r = 80

80 requests per second. If you make requests at a higher rate than that, you are queueing and your latency will increase - irrespective of libraries, tech stacks, architectures, or anything else.

0 replies

C-Loftus · 2025-05-22T13:34:02Z

C-Loftus
May 22, 2025
Author

Thank you so much for your answer! That indeed made a huge speedup, and after some additional caching on the schema validation I am doing, I am now getting < 5 ms round trip times. This is awesome! Thanks!

Context for what I did for others:

I spawn the server on the runtime now

    // Run the server and listen for Ctrl+C
    let server = Server::builder()
        // this service is the one i custom defined in tonic
        .add_service(ShaclValidatorServer::new(validator))
        .serve_with_incoming_shutdown(uds_stream, async {
            signal::ctrl_c()
                .await
                .expect("failed to install Ctrl+C handler");
        });

    // Make sure that the server is ran on the runtime
    let result = tokio::spawn(server).await?;

Before I implemented caching, I used the tokio blocking task function as recommended and that indeed did help:
(This code has the slow 200ms logic, here just for reference, I'm sure there is a better way than using an Arc)

#[tonic::async_trait]
impl ShaclValidator for Validator {
    /// Validates the triples in the request using both dataset-oriented and location-oriented validation.
    /// Returns a ValidationReply with the validation result.
    async fn validate(
        &self,
        request: Request<TurtleValidationRequest>,
    ) -> Result<Response<ValidationReply>, Status> {
        println!("Received request");

        let start = std::time::Instant::now();

        let req = request.into_inner();


       let triples = Arc::new(req.triples);
        let dataset_triples = triples.clone();
        let location_triples = triples.clone();

        let dataset_schema = self.dataset_schema.clone();
        let location_schema = self.location_schema.clone();
        // takes about 200ms but now occurs in such a way that the main async runtime is not blocked
        let dataset_handle = tokio::task::spawn_blocking(move || {
            let start_validation = std::time::Instant::now();
            let res = validate_triples(&dataset_schema, &dataset_triples);
            println!("Dataset validation took: {:?}", start_validation.elapsed());
            res
        });
        // takes about 200ms but now occurs in such a way that the main async runtime is not blocked
        let location_handle = tokio::task::spawn_blocking(move || {
            let start_validation = std::time::Instant::now();
            let res = validate_triples(&location_schema, &location_triples);
            println!("Location validation took: {:?}", start_validation.elapsed());
            res
        });

        let (dataset_validation_report, location_validation_report) =
            tokio::try_join!(dataset_handle, location_handle)
                .map_err(|e| Status::internal(format!("Join error: {:?}", e)))?;

This got it down around 400ms; I think some of the concurrent allocations were still making it slow?

Thirdly the last thing I did was optimize any long running functions and reduce allocations in the handler to get them below that 50ms
- With some better in memory caching I reduced the time to validate on my end significantly
- This got the time below 5ms and since it was much faster now, I got rid of the blocking task code and just use a normal function call since I presume the wakeup time is not worth it or is inconsequential at this level of latency
  - However, now I know to switch to tokio::task::spawn_blocking(move if my function in the future ends up taking longer

        println!("Received request");

        let start = std::time::Instant::now();

        let req = request.into_inner();

        let dataset_validation_report = self.validate_dataset_oriented(&req.triples);
        let location_validation_report = self.validate_location_oriented(&req.triples);

        println!("Validation took {:?}", start.elapsed());

This ended up getting me <3ms on average for my round trip

I have not yet looked into a clientside gRPC connection pool since it seems less needed now and it does not seem to be natively supported in the default go-grpc library, but this may be an exercise I will explore at a future time.

Thank you very much again; I appreciate it!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Advice for debugging long round trip time in multilingual deployment #2284

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Advice for debugging long round trip time in multilingual deployment #2284

Uh oh!

C-Loftus May 21, 2025

Question

Background

Timing Background

Replies: 2 comments

Uh oh!

kvc0 May 22, 2025

Uh oh!

C-Loftus May 22, 2025 Author

Context for what I did for others:

C-Loftus
May 21, 2025

kvc0
May 22, 2025

C-Loftus
May 22, 2025
Author