Skip to content

feat: reduce duplicate fields on join #1184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

Closes #1173

Rationale for this change

In the current version of the code when you do a join and there is a common on column name, then you end up with two columns in the output dataframe with ambiguous names. This is an annoyance for users where they have to work around by renaming the column to join on. With this change, it makes the interface more user friendly.

Screenshot 2025-07-07 at 7 25 54 PM

What changes are included in this PR?

  • Adds an option keep_duplicate_keys if the user does not want to drop duplicate column names
  • By default, adds a select on the join to keep only the first (left) dataframe column
  • Small change to unit test to fix error when user would get a deprecation warning when they passed on and not join_on
  • Small formatter change where the user would get very large rendering for narrow dataframes.

Are there any user-facing changes?

Yes.

DataFrame.join() by default will now only return a single column for duplicate on keys. The user can revert to the previous version by setting keep_duplicate_keys to True.

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few suggestions.

Also, it would be good to update docs/source/.../joins.rst to mention keep_duplicate_keys

@@ -400,7 +400,6 @@ def test_unnest_without_nulls(nested_df):
assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9])


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if new tests for keep_duplicate_keys=False or True were added. To ensure coverage, add tests verifying that passing keep_duplicate_keys=True preserves both columns.

@@ -678,6 +681,7 @@ def join(
left_on: str | Sequence[str] | None = None,
right_on: str | Sequence[str] | None = None,
join_keys: tuple[list[str], list[str]] | None = None,
keep_duplicate_keys: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name keep_duplicate_keys is somewhat confusing: it drops the right-side keys when False.
A more direct name like drop_duplicate_keys: bool = False (default) or deduplicate: bool = False may better express intent.

Additionally, in many DataFrame libraries (Pandas, PySpark), the term suffixes or indicator is used for duplicate‐column handling. Consider whether a suffix‐based approach (with default ('', '_right')) could be more familiar to users than a boolean drop flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify Joins on Shared Column Name
2 participants