Skip to content

[Feature] Inject invocation_id into user_agent for BigQuery for better tracking in Logs Explorer #1280

@jeremyyeo

Description

@jeremyyeo

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Caveat: not a GCP expert by any means.

When we run any SQL statement (however trivial) on BigQuery - GCP appear to break that out into many individual log events in the log explorer. Lets look at a quick example:

# dbt_project.yml
name: analytics
profile: all
version: "1.0.0"

models:
  analytics:
    +materialized: table
-- models/foo.sql
select 1 c
$ dbt --debug run

03:52:57  Got an exception trying to initialize tracking
03:52:57  Running with dbt=1.10.9
03:52:57  running dbt with arguments {'log_format': 'default', 'log_path': '/Users/jeremy/git/dbt-basic/logs', 'indirect_selection': 'eager', 'version_check': 'True', 'cache_selected_only': 'False', 'use_colors': 'True', 'introspect': 'True', 'send_anonymous_usage_stats': 'True', 'warn_error_options': 'WarnErrorOptionsV2(error=[], warn=[], silence=[])', 'use_experimental_parser': 'False', 'printer_width': '80', 'write_json': 'True', 'invocation_command': 'dbt --debug run', 'log_cache_events': 'False', 'no_print': 'None', 'target_path': 'None', 'fail_fast': 'False', 'partial_parse': 'True', 'warn_error': 'None', 'empty': 'False', 'profiles_dir': '/Users/jeremy/.dbt', 'debug': 'True', 'static_parser': 'True', 'quiet': 'False'}
03:52:59  Registered adapter: bigquery=1.10.1
03:52:59  checksum: 6543b8b248ceda473ef0d611849d5d909085b6b714afa9b515e9635faea7af23, vars: {}, profile: , target: , version: 1.10.9
03:52:59  Unable to do partial parsing because a project config has changed
03:53:00  Wrote artifact WritableManifest to /Users/jeremy/git/dbt-basic/target/manifest.json
03:53:00  Wrote artifact SemanticManifest to /Users/jeremy/git/dbt-basic/target/semantic_manifest.json
03:53:00  Found 1 model, 496 macros
03:53:00  
03:53:00  Concurrency: 4 threads (target='bq')
03:53:00  
03:53:00  Acquiring new bigquery connection 'master'
03:53:00  Acquiring new bigquery connection 'list_cse-sandbox-319708'
03:53:00  Opening a new connection, currently in state init
03:53:02  Re-using an available connection from the pool (formerly list_cse-sandbox-319708, now list_cse-sandbox-319708_dbt_jyeo)
03:53:02  Opening a new connection, currently in state closed
03:53:03  Opening a new connection, currently in state init
03:53:03  Began running node model.analytics.foo
03:53:03  1 of 1 START sql table model dbt_jyeo.foo ...................................... [RUN]
03:53:03  Re-using an available connection from the pool (formerly list_cse-sandbox-319708_dbt_jyeo, now model.analytics.foo)
03:53:03  Began compiling node model.analytics.foo
03:53:03  Writing injected SQL for node "model.analytics.foo"
03:53:03  Began executing node model.analytics.foo
03:53:03  Opening a new connection, currently in state closed
03:53:04  Writing runtime sql for node "model.analytics.foo"
03:53:04  On model.analytics.foo: /* {"app": "dbt", "dbt_version": "1.10.9", "profile_name": "all", "target_name": "bq", "node_id": "model.analytics.foo"} */
    create or replace table `cse-sandbox-319708`.`dbt_jyeo`.`foo`
    OPTIONS()
    as (
select 1 c
    );
03:53:05  BigQuery adapter: https://console.cloud.google.com/bigquery?project=cse-sandbox-319708&j=bq:US:ce121c01-395a-47ca-88e2-28d7d0a7b531&page=queryresults
03:53:07  1 of 1 OK created sql table model dbt_jyeo.foo ................................. [CREATE TABLE (1.0 rows, 0 processed) in 3.91s]
03:53:07  Finished running node model.analytics.foo
03:53:07  Opening a new connection, currently in state closed
03:53:07  Connection 'master' was properly closed.
03:53:07  Connection 'model.analytics.foo' was properly closed.
03:53:07  
03:53:07  Finished running 1 table model in 0 hours 0 minutes and 7.20 seconds (7.20s).
03:53:07  Command end result
03:53:07  Wrote artifact WritableManifest to /Users/jeremy/git/dbt-basic/target/manifest.json
03:53:07  Wrote artifact SemanticManifest to /Users/jeremy/git/dbt-basic/target/semantic_manifest.json
03:53:07  Wrote artifact RunExecutionResult to /Users/jeremy/git/dbt-basic/target/run_results.json
03:53:07  
03:53:07  Completed successfully
03:53:07  
03:53:07  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1
03:53:07  Resource report: {"command_name": "run", "command_success": true, "command_wall_clock_time": 9.875144, "process_in_blocks": "0", "process_kernel_time": 0.488256, "process_mem_max_rss": "391938048", "process_out_blocks": "0", "process_user_time": 2.570171}
03:53:07  Command `dbt run` succeeded at 15:53:07.533506 after 9.88 seconds
03:53:07  Flushing usage events

A single create or replace ... statement was sent to BQ.

Observations:

BQ Job History UI

Image

From the BQ Job History UI, we can see the entry and the associated query + the dbt invocation id - so we can tie that back to the dbt invocation.

GCP Logs Explorer

Image

That singular SQL statement execution had materialized into 7 different log entries in GCP Logs Explorer. (I cannot explain why this became 7 log entries but I will call this "GCP Backend Operations").

^ Within the 7 entries, 3 of them:

  • methodName: "jobservice.getqueryresults"
  • methodName: "jobservice.jobcompleted"
  • methodName: "jobservice.insert"

Will show jobConfiguration.labels.dbt_invocation_id: "5d52c585-4241-4058-b4b6-a5dd73e95959"

The other 4:

  • methodName: "google.cloud.bigquery.v2.JobService.InsertJob"
  • methodName: "google.cloud.bigquery.v2.JobService.InsertJob"
  • methodName: "google.cloud.bigquery.v2.JobService.InsertJob"
  • methodName: "google.cloud.bigquery.v2.JobService.InsertJob"

Do not have any keys that reflect the invocation id.

All 7 of them will have a user agent key:

requestMetadata.callerSuppliedUserAgent: "dbt-bigquery-1.10.1 gl-python/3.11.9 grpc/1.74.0 gax/2.25.1 gapic/3.35.1 gccl/3.35.1,gzip(gfe)"

The proposal is to inject the invocation id into the user agent string so we can see something like:

requestMetadata.callerSuppliedUserAgent: "dbt-bigquery-1.10.1 5d52c585-4241-4058-b4b6-a5dd73e95959 gl-python/3.11.9 grpc/1.74.0 gax/2.25.1 gapic/3.35.1 gccl/3.35.1,gzip(gfe)"

P.s. this single statement / BQ Job entry into N GCP log explorer entries (aka "GCP Backend Operations") is not specific to dbt. Simply running a create statement in the BQ UI itself will also do this.

Describe alternatives you've considered

NA

Who will this benefit?

Large enterprise customers looking to attribute dbt invocations to each log line in their audit tool / Logs explorer.

Are you interested in contributing this feature?

No response

Anything else?

We can probably import the invocation_id generating method and inject it here:

client_info=ClientInfo(user_agent=f"dbt-bigquery-{dbt_version.version}"),

Maybe something like:

# dbt-bigquery/src/dbt/adapters/bigquery/clients.py
from dbt_common.invocation import get_invocation_id

...

@BQ_DEFAULT_RETRY
def _create_bigquery_client(credentials: BigQueryCredentials) -> BigQueryClient:
    return BigQueryClient(
        credentials.execution_project,
        create_google_credentials(credentials),
        location=getattr(credentials, "location", None),
        client_info=ClientInfo(user_agent=f"dbt-bigquery-{dbt_version.version} invocation_id:{get_invocation_id()}"),
        client_options=ClientOptions(
            quota_project_id=credentials.quota_project, api_endpoint=credentials.api_endpoint
        ),
    )

...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions