Skip to content

[Feature] Allow to set job name in OpenLineage events #25535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dolfinus opened this issue Apr 9, 2025 · 2 comments · May be fixed by #25704
Open

[Feature] Allow to set job name in OpenLineage events #25535

dolfinus opened this issue Apr 9, 2025 · 2 comments · May be fixed by #25704

Comments

@dolfinus
Copy link
Contributor

dolfinus commented Apr 9, 2025

Currently, OpenLineage integration uses queryId as jobName field value:
https://github.com/trinodb/trino/blob/474/plugin/trino-openlineage/src/main/java/io/trino/plugin/openlineage/OpenLineageListener.java#L249

This is not very convenient, as each queryId is unique, and it also doesn't mean anything for end user. Instead, consider allowing user to set custom jobName using session variables:

SET SESSION openlineage-event-listener.job.name = 'myawesomejob';
X-Trino-Session: openlineage-event-listener.job.name=myawesomejob

Another option is to use X-Trino-Client-Info or X-Trino-Source, but it can contain data populated by low-level clients (Python client, HTTP client, JDBC driver and so on), and it usually used to identify particular client/software, not a session.

@dolfinus dolfinus linked a pull request Apr 29, 2025 that will close this issue
@dolfinus
Copy link
Contributor Author

dolfinus commented Apr 29, 2025

Unfortunately, only system and catalog properties can be overridden this way, but OpenLineage is neither of that - it's an EventListener with static configuration.

Also, using some query context fields as a new default for job name could lead to new issues:

  • X-Trino-ClientInfo is an arbitrary string there some clients pass internal data. For example, Airflow TrinoHook passes a JSON value here with some Airflow DAG+Task Info. In contrast, OpenLineage job names should be the same for every run of the same script/query.
  • X-Trino-Source sounds like a more suitable alternative, but it has default value provided by integration - trino-cli, trino-python-client, trino-jdbc, airflow, etc. So this static value will be bound to queries triggered by different users, which can produce a total mess in the lineage graph.

In #25704 I've implemented an different approach - configurable job name with several substitutions (queryId, source, user, principal). So Trino admins may configure OpenLineage integration depending on how Trino is used in their company/environment:

  • If users have to pass distinct X-Trino-Source: My Awesome session for logical set of queries (e.g. produced by the same ETL script/task), they can use ${source}.
  • If source of query doesn't matter, but only user does, they could use ${user} or ${source}-${user} instead.
  • Default value is still ${queryId}, in case if someone relies on current behavior of integration.

This doesn't sound ideal for me, but it can solve this particular issue. Maybe some other source of job name should be used instead - I'm new to Trino, and may not know something yet.

@pawel-big-lebowski
Copy link

@dolfinus Thanks for raising this. Having timestamp based query id as a job name seems more like a bug to me. Being able to configure would mitigate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants