What's the reason behind significant latency between CPU and TPU data transmission (way higher than "data size" / "available bandwidth"? #29480

JianmingTONG · 2025-06-16T00:39:32Z

JianmingTONG
Jun 16, 2025

We are currently working on a project involving both a custom JAX C++ kernel (running on CPU) and a JAX kernel on TPU. In our setup, the CPU kernel performs data preprocessing, and the processed results are transferred to the TPU for further computation using jax.device_put(). The corresponding code snippet is shown in Figure 1 below.

To better understand the performance characteristics of this pipeline, we collected TPU and CPU execution traces. Please find attached two screenshots illustrating four iterations of the CPU–TPU processing flow:

Figure 2 shows a high-level trace:

The first row corresponds to TPU activity.
The remaining rows capture CPU activity.
The red box highlights a TPU kernel execution.
The green box marks the CPU-side C++ kernel execution.

Figure 3 provides a zoomed-in view of the region marked by the black oval in Figure 2:

The yellow box highlights the jax.device_put() operation.
The purple box represents the invocation of the TPU kernel from CPU.
The blue region in Figure 2 emphasizes the noticeable overhead from Transpose and Linearize ops following the data transfer.

Our main question is regarding the overhead observed immediately after jax.device_put():

We notice significant time spent on Transpose and Linearize operations on the CPU before TPU execution begins.
Could you help us understand where these operations originate from?
Are there any recommended practices for reducing or avoiding this overhead?

For context, we’ve attached the relevant codebase and a README with reproduction instructions: https://drive.google.com/file/d/1yTJVLM-PPaKcpCjDXlCbof3DUErH9c2C/view?usp=sharing

We would greatly appreciate your insights on optimizing this data transfer and minimizing preprocessing bottlenecks!

Best
Jianming and Jingtian

Figure 1

Figure 2

Figure 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's the reason behind significant latency between CPU and TPU data transmission (way higher than "data size" / "available bandwidth"? #29480

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

What's the reason behind significant latency between CPU and TPU data transmission (way higher than "data size" / "available bandwidth"? #29480

Uh oh!

JianmingTONG Jun 16, 2025

Replies: 0 comments

JianmingTONG
Jun 16, 2025