Open
Description
Currently joining two dataframes on a column of the same name generates ambiguity/duplicates the column.
ctx = SessionContext()
left_df = ctx.from_pydict({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
right_df = ctx.from_pydict({"id": [2, 3, 4], "city": ["New York", "London", "Paris"]})
joined_df = left_df.join(right_df, on="id")
joined_df.select(df.col("id"))
Error:
Exception: DataFusion error: ...snip..."column 'id' is ambiguous"
...snip..."possible column <hash_1>.id"...snip...possible column <hash_2>.id"...
Describe the solution you'd like
It would be nice to either:
1.) Have a pyspark like approach where some explicit differentiator in the api drops the duplicate joined_df = left_df.join(right_df, on=["id"])
2.) Have a means to refer to the specific column that is unambiguous since internally that is tracked joined_df = left_df.join(right_df, on="id").drop(joined_df["id"])
Describe alternatives you've considered
This works just is a bit verbose
joined_df = left_df.join(right_df.with_column_renamed("id", "not_id"), left_on="id", right_on="not_id").drop("not_id")