Skip to content

Add support for automatic join column deduplication in DataFrame joins #1185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jul 8, 2025

Which issue does this PR close?

Rationale for this change

This change improves usability and interoperability by adding an optional deduplicate=True flag to DataFrame joins. In many real-world datasets, join keys exist in both tables with the same name. Prior to this PR, this resulted in conflicting column names in the output and required manual disambiguation or column renaming.

This feature aligns with the behavior in other DataFrame libraries like PySpark, making joins easier and more intuitive for users by optionally removing duplicate join columns from the right-hand side.

What changes are included in this PR?

  • Introduced a deduplicate argument to the DataFrame.join() method.
  • Refactored join key resolution into a _prepare_join helper method.
  • Implemented _deduplicate_right logic to rename right-hand join keys with unique aliases.
  • Automatically drops aliased join columns after use.
  • Applies coalesce to resolve null values in outer/right joins.
  • Updated the user guide with documentation and examples of disambiguation and deduplication.
  • Added comprehensive tests for:
    • Basic deduplication
    • Multi-column joins with deduplication
    • Select behavior post-join
    • All supported join types (inner, left, right, full)

Are these changes tested?

✅ Yes. Tests are included to verify:

  • Join outputs match expectations with and without deduplication.
  • Correct schema after deduplication.
  • Select operations behave as expected post-join.
  • All supported join types are covered.

Are there any user-facing changes?

✅ Yes. A new deduplicate keyword argument is available for DataFrame.join(). When enabled, it automatically removes duplicate join columns from the right DataFrame, simplifying common workflows and avoiding column naming conflicts.

This feature is backward-compatible and opt-in.

📘 User documentation has been updated with detailed usage examples, best practices, and behavior notes.

kosiew added 14 commits July 4, 2025 19:17
- Added a `deduplicate` boolean parameter to `DataFrame.join` that,
  when True, drops duplicate join columns from the right DataFrame after join.
- Implemented helper methods `_resolve_join_keys` and `_prepare_deduplicate`
  to normalize join key arguments and handle column renaming and dropping.
- Updated join logic to rename duplicate join columns in right DataFrame,
  join with renamed columns, and drop renamed duplicates post-join.
- Added tests `test_join_deduplicate` and `test_join_deduplicate_multi` covering
  deduplication of single and multiple join columns.
- Extended documentation with example usage of `deduplicate` for disambiguating columns.

Also added Copilot and agent instructions files describing Python and Rust style guidelines,
pre-commit usage, testing, and code organization conventions for the DataFusion Python project.
- Verify that selecting columns works correctly after a join with deduplicate=True
- Confirm the joined DataFrame contains only matching IDs
- Test selecting single and multiple columns post-join to ensure correct data retrieval
handle deduplication for right/full joins by coalescing join keys
refactor join preparation to lower complexity
update tests to use supported sort API and full join keyword
fix lint issues
@timsaucer
Copy link
Contributor

Would you mind taking a look at #1184 ? It's an alternate approach which basically reuses the logic of drop_columns on the rust side instead of adding the logic all on the python side. What do you think?

@kosiew kosiew marked this pull request as ready for review July 9, 2025 05:23
@kosiew
Copy link
Contributor Author

kosiew commented Jul 9, 2025

Closed because of #1184

@kosiew kosiew closed this Jul 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify Joins on Shared Column Name
2 participants