Strange CPU Usage/Scheduling Behavior #4753
Replies: 11 comments 2 replies
-
When moving from an 8 core to a 32 core machine, I see the same behavior. Despite 4x number of cores, and an overall faster completion time, there is still an unexplained period where the CPUs have 0% utilization, sandwiched between 100% usage. |
Beta Was this translation helpful? Give feedback.
-
Sorry, did you say 0.19.1? Or is this 1.19.1? |
Beta Was this translation helpful? Give feedback.
-
Thanks, that's a typo. I'm on latest: 1.19.1. |
Beta Was this translation helpful? Give feedback.
-
You say that you are using mpsc channels. Are your channels bounded? Perhaps your threads are sleeping because they are trying to send a message on a bounded channel that doesn't have space for the message? |
Beta Was this translation helpful? Give feedback.
-
Thanks, they are indeed bounded. I had a limit of 1000 on all three of the channels. In the situation that one of them is full, I'd expect at least one core to have high usage as it's trying to To test, I increased the bound on all the channels to 1,000,000. This is much larger than the total number of messages sent throughout execution. No change in behavior. Still unexplained periods of 0% CPU usage. I'll see if I can swap out the computation with some boilerplate and share the code. |
Beta Was this translation helpful? Give feedback.
-
Here is a minimal reproducing sample: https://github.com/xanderdunn/tokio-sample I'm sure it's architected sub-optimally, so if any feedback comes to mind when you take a look, please let me know. |
Beta Was this translation helpful? Give feedback.
-
That's still a lot of code for me to go through, so I've only skimmed some of the files. However, the widespread usage of locks makes me uncomfortable. I would investigate whether you're holding a lock somewhere that is preventing the threads from making progress. I have a draft for a blog post about this kind of thing. I haven't finished it and the project is currently on hold, but you can read it here: link |
Beta Was this translation helpful? Give feedback.
-
This is a great post, thanks very much for sharing! Some points that stood out for me:
I definitely have some encapsulation and cleanup of my lock usage to do.
I am essentially doing this. I am storing two channel I don't yet know if this is the source of my issue, but I'll work on these directions. |
Beta Was this translation helpful? Give feedback.
-
I encapsulated all of my lock usage into structs as described in your article. See here. Unfortunately, this doesn't appear to have affected the 0% CPU usage stalling behavior. However, the code is much nicer, the with_* approach is very nice! I also tried switching out all of my This is in line with my above observation:
If no write locks are being acquired, I wouldn't expect locks to be degrading performance. |
Beta Was this translation helpful? Give feedback.
-
I've narrowed it down to specifically a lag between the sending and receiving of messages over the network's bidirectional gRPC streams, even in the absence of all computationally intensive work, so I posted on tonic since it's doing the network communication here. |
Beta Was this translation helpful? Give feedback.
-
This issue ended up being specific to my use of Docker containers. When I perform exactly the same test on my local machine without containers, there is no lag in messaging at all. I haven't figured out what about the Docker containers is causing the problem, but I am at least unblocked for my performance testing. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for this stellar library. I've only just begun using it last week and it's likely going to make its way into production for us.
I have a gRPC server using tonic with bidirectional streams. To test it, we use Docker Compose to spin up 75 nodes each running the same rust binary on an 8 core machine. They perform some work on
tokio::spawn(async
andtokio::task::spawn_blocking
tasks, share the results with each other over gRPC bidirectionaltokio_stream::wrappers::ReceiverStream
s, and then exit.Each node computes 225 cryptographic values, and sends its values to every other node via
ReceiverStream
s for a total of 75*225 values computed and sent over streams.I'm seeing some strange behavior where, in the middle of the computationally intense process of creating and sharing all of those values, the CPU usage just goes to 0% and nothing happens for 10-30 seconds. Then, after some time and for no apparent reason, the processors suddenly max out again until the job completes. See a video of this here: https://youtu.be/c9UQPLjj6jM. The video starts after node setup has completed and the computationally and message intensive portion we're interested in starts. You'll see 100% CPU usage, followed by a suddenly silent period of 0% usage. This abrupt 100%->0%->100% happens a couple of times.
At smaller numbers of nodes, like 6 or 20, I do not see this behavior. The CPUs max out until the task is complete. This is what I would expect.
Things I'm checking:
Arc<RwLock<HashMap>>
object, but I put a log statement before the acquiring of every.write()
, and none of them are occurring during this phase of the program, only during startup.tokio::task::spawn_blocking
tasks and the results are put on mpsc's so that async channels can make use of the results. All of the computationally expensive work should be in blocking tasks. The total number of calls totokio::task::spawn_blocking
across all 75 nodes on a successful end to end run is 450 (not all are alive simultaneously).I realize this is impossible to debug without a minimal reproducing example, but I wonder if perhaps any debugging ideas arise.
tokio 1.19.1. Ubuntu 20.04.
Is it possible worker thread scheduling may be expected to perform this way under high CPU load?
Things I could try next:
spawn_blocking
It's a relatively simple program: A single file with 640 lines of code. If I can't make any progress I should be able to replace the computation with some boilerplate and copy the tokio/tonic logic to share.
Beta Was this translation helpful? Give feedback.
All reactions