Skip to content

Add debug latency logs for all-purpose compute #862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jayantsing-db
Copy link
Collaborator

@jayantsing-db jayantsing-db commented Jun 24, 2025

NO_CHANGELOG=true

Description

Until telemetry is rolled-out, add debug latency logs to analyse the thrift server http calls

Lower level http transport logs

2025-06-24 13:54:09 FINE com.databricks.jdbc.dbclient.impl.thrift.DatabricksHttpTTransport#flush - Connection [26021ed8-74eb-47fe-9c9f-89c4310daa9b] Header refresh latency: 1537ms

2025-06-24 13:54:10 FINE com.databricks.jdbc.dbclient.impl.thrift.DatabricksHttpTTransport#flush - Connection [900a5c8d-332c-45a8-b9c7-962bfbb89d45] HTTP request latency: 1346ms

Thrift Accessor logs

2025-06-24 13:54:10 FINE com.databricks.jdbc.dbclient.impl.thrift.DatabricksThriftAccessor#getThriftResponse - Connection [900a5c8d-332c-45a8-b9c7-962bfbb89d45] Thrift request latency (TOpenSessionReq): 2967ms

2025-06-24 14:07:25 FINE com.databricks.jdbc.dbclient.impl.thrift.DatabricksThriftAccessor#execute - Connection [303aebdd-ccf7-4152-b003-945aaf2bebcb] Statement [49294d43-56d0-4170-bb70-3707733e07ed|0ee7cdf5-d339-4e04-9408-c67c73440830] Session [5acd47b3-dc29-490c-5acd-47b3dc29490c (AllPurpose cluster with clusterId {0613-231111-59eitsdx} and orgId {6051921418418893})] Thrift polling latency: 1ms

2025-06-24 14:07:35 FINE com.databricks.jdbc.dbclient.impl.thrift.DatabricksThriftAccessor#execute - Connection [1949a7a5-17d2-4b89-9761-091d0edc14d0] Statement [9fc42a20-7ddc-4ddc-8933-15c6983946c5|4553e64a-45bc-4295-8934-5954ebff4780] Session [f4159b0e-883f-4136-f415-9b0e883f4136 (AllPurpose cluster with clusterId {0613-231111-59eitsdx} and orgId {6051921418418893})] Thrift fetch latency: 342ms

2025-06-24 14:07:25 FINE com.databricks.jdbc.dbclient.impl.thrift.DatabricksThriftAccessor#execute - Connection [303aebdd-ccf7-4152-b003-945aaf2bebcb] Statement [49294d43-56d0-4170-bb70-3707733e07ed|0ee7cdf5-d339-4e04-9408-c67c73440830] Session [5acd47b3-dc29-490c-5acd-47b3dc29490c (AllPurpose cluster with clusterId {0613-231111-59eitsdx} and orgId {6051921418418893})] Thrift execute latency (SQL): 725ms

Testing

Manual testing

Additional Notes to the Reviewer

Copy link

Please ensure that the NEXT_CHANGELOG.md file is updated with any relevant changes.
If this is not necessary for your PR, please include the following in your PR description:
NO_CHANGELOG=true
and rerun the job.

Copy link

github-actions bot commented Jun 24, 2025

📊 Code Coverage Report

Overall Project 86.18% -0.07% 🟢
Files changed 90.98% 🟢

File Coverage
DatabricksThriftAccessor.java 85.11% -1.33% 🟢
DatabricksHttpTTransport.java 80.4% -5.2% 🔴

@@ -312,6 +401,20 @@ DatabricksResultSet executeAsync(
throw new DatabricksSQLException(response.status.errorMessage, response.status.sqlState);
}
} catch (DatabricksSQLException | TException e) {
long executeAsyncEndTime = System.currentTimeMillis();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each of this is laced with DatabricksMetricsTimedProcessor - can we maybe use that to log the timings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i would have ideally used that. But that annotation accesses connection-context, statement-id, etc. through context holder which is broken. Many of these logs need that info.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context holder which is broken

The aim is just to print the time required on these functions, not to exportLatencyLog(executionTime); maybe we just log the time and function name there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think statement id and session id, etc. are important for further analysis. for example, relating to the cluster logs, etc. just the method name will give us client side timings and we will have no way to relate to cluster and detailed analysis.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete logs that will be shared by the customer will anyway have the initial connection log, context etc. Is that not the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it won't have statement ids, session ids for concurrent connections. And even if it has and we don't attach ids with latency logs, in concurrent execution, there is no guaranteed way to map a latency log of a method name with the corresponding session-id, statement-id

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connectionContext bug is a one off case, we can trade off that with polluting our implementation I guess. If we are not ok with that, can you create a TODO to remove all of this once the task has been taken up? https://databricks.atlassian.net/browse/PECOBLR-389

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO to remove these once timed processor annotation has access to session, operation handles (or the latency issue is resolved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants