feat: reduce duplicate fields on join #1184

timsaucer · 2025-07-08T11:49:38Z

Which issue does this PR close?

Rationale for this change

In the current version of the code when you do a join and there is a common on column name, then you end up with two columns in the output dataframe with ambiguous names. This is an annoyance for users where they have to work around by renaming the column to join on. With this change, it makes the interface more user friendly.

What changes are included in this PR?

Adds an option keep_duplicate_keys if the user does not want to drop duplicate column names
By default, adds a select on the join to keep only the first (left) dataframe column
Small change to unit test to fix error when user would get a deprecation warning when they passed on and not join_on
Small formatter change where the user would get very large rendering for narrow dataframes.

Are there any user-facing changes?

Yes.

DataFrame.join() by default will now only return a single column for duplicate on keys. The user can revert to the previous version by setting keep_duplicate_keys to True.

kosiew

Left a few suggestions.

Also, it would be good to update docs/source/.../joins.rst to mention keep_duplicate_keys

kosiew · 2025-07-09T06:00:49Z

python/tests/test_dataframe.py

@@ -400,7 +400,6 @@ def test_unnest_without_nulls(nested_df):
    assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9])




It would be good if new tests for keep_duplicate_keys=False or True were added. To ensure coverage, add tests verifying that passing keep_duplicate_keys=True preserves both columns.

kosiew · 2025-07-09T06:03:30Z

python/datafusion/dataframe.py

@@ -678,6 +681,7 @@ def join(
        left_on: str | Sequence[str] | None = None,
        right_on: str | Sequence[str] | None = None,
        join_keys: tuple[list[str], list[str]] | None = None,
+        keep_duplicate_keys: bool = False,


The name keep_duplicate_keys is somewhat confusing: it drops the right-side keys when False.
A more direct name like drop_duplicate_keys: bool = False (default) or deduplicate: bool = False may better express intent.

Additionally, in many DataFrame libraries (Pandas, PySpark), the term suffixes or indicator is used for duplicate‐column handling. Consider whether a suffix‐based approach (with default ('', '_right')) could be more familiar to users than a boolean drop flag.

timsaucer added 3 commits July 8, 2025 07:35

Add field to dataframe join to indicate if we should keep duplicate keys

f5c7ed0

Suppress expected warning

fb8096b

Minor: small tables rendered way too large

8e1ed67

timsaucer self-assigned this Jul 8, 2025

timsaucer added the api change label Jul 8, 2025

timsaucer mentioned this pull request Jul 8, 2025

Add support for automatic join column deduplication in DataFrame joins #1185

Closed

kosiew reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: reduce duplicate fields on join #1184

feat: reduce duplicate fields on join #1184

Uh oh!

timsaucer commented Jul 8, 2025

Uh oh!

kosiew left a comment

Uh oh!

kosiew Jul 9, 2025

Uh oh!

kosiew Jul 9, 2025

Uh oh!

Uh oh!

		@@ -400,7 +400,6 @@ def test_unnest_without_nulls(nested_df):
		assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9])

feat: reduce duplicate fields on join #1184

Are you sure you want to change the base?

feat: reduce duplicate fields on join #1184

Uh oh!

Conversation

timsaucer commented Jul 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!