Skip to content

Simplify Joins on Shared Column Name #1173

Open
@ntjohnson1

Description

@ntjohnson1

Currently joining two dataframes on a column of the same name generates ambiguity/duplicates the column.

ctx = SessionContext()
left_df = ctx.from_pydict({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
right_df = ctx.from_pydict({"id": [2, 3, 4], "city": ["New York", "London", "Paris"]})

joined_df = left_df.join(right_df, on="id")
joined_df.select(df.col("id"))

Error:

Exception: DataFusion error: ...snip..."column 'id' is ambiguous"
...snip..."possible column <hash_1>.id"...snip...possible column <hash_2>.id"...

Describe the solution you'd like
It would be nice to either:
1.) Have a pyspark like approach where some explicit differentiator in the api drops the duplicate joined_df = left_df.join(right_df, on=["id"])
2.) Have a means to refer to the specific column that is unambiguous since internally that is tracked joined_df = left_df.join(right_df, on="id").drop(joined_df["id"])

Describe alternatives you've considered
This works just is a bit verbose
joined_df = left_df.join(right_df.with_column_renamed("id", "not_id"), left_on="id", right_on="not_id").drop("not_id")

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions