Add support for automatic join column deduplication in DataFrame joins #1185

kosiew · 2025-07-08T13:39:23Z

Which issue does this PR close?

Closes Simplify Joins on Shared Column Name #1173

Rationale for this change

This change improves usability and interoperability by adding an optional deduplicate=True flag to DataFrame joins. In many real-world datasets, join keys exist in both tables with the same name. Prior to this PR, this resulted in conflicting column names in the output and required manual disambiguation or column renaming.

This feature aligns with the behavior in other DataFrame libraries like PySpark, making joins easier and more intuitive for users by optionally removing duplicate join columns from the right-hand side.

What changes are included in this PR?

Introduced a deduplicate argument to the DataFrame.join() method.
Refactored join key resolution into a _prepare_join helper method.
Implemented _deduplicate_right logic to rename right-hand join keys with unique aliases.
Automatically drops aliased join columns after use.
Applies coalesce to resolve null values in outer/right joins.
Updated the user guide with documentation and examples of disambiguation and deduplication.
Added comprehensive tests for:
- Basic deduplication
- Multi-column joins with deduplication
- Select behavior post-join
- All supported join types (inner, left, right, full)

Are these changes tested?

✅ Yes. Tests are included to verify:

Join outputs match expectations with and without deduplication.
Correct schema after deduplication.
Select operations behave as expected post-join.
All supported join types are covered.

Are there any user-facing changes?

✅ Yes. A new deduplicate keyword argument is available for DataFrame.join(). When enabled, it automatically removes duplicate join columns from the right DataFrame, simplifying common workflows and avoiding column naming conflicts.

This feature is backward-compatible and opt-in.

📘 User documentation has been updated with detailed usage examples, best practices, and behavior notes.

- Added a `deduplicate` boolean parameter to `DataFrame.join` that, when True, drops duplicate join columns from the right DataFrame after join. - Implemented helper methods `_resolve_join_keys` and `_prepare_deduplicate` to normalize join key arguments and handle column renaming and dropping. - Updated join logic to rename duplicate join columns in right DataFrame, join with renamed columns, and drop renamed duplicates post-join. - Added tests `test_join_deduplicate` and `test_join_deduplicate_multi` covering deduplication of single and multiple join columns. - Extended documentation with example usage of `deduplicate` for disambiguating columns. Also added Copilot and agent instructions files describing Python and Rust style guidelines, pre-commit usage, testing, and code organization conventions for the DataFusion Python project.

- Verify that selecting columns works correctly after a join with deduplicate=True - Confirm the joined DataFrame contains only matching IDs - Test selecting single and multiple columns post-join to ensure correct data retrieval

…ation behavior

…DataFrame

…and return information

…ming in DataFrame

…e columns

…cate_all_types

…ividual column selection

handle deduplication for right/full joins by coalescing join keys refactor join preparation to lower complexity update tests to use supported sort API and full join keyword fix lint issues

timsaucer · 2025-07-08T19:31:59Z

Would you mind taking a look at #1184 ? It's an alternate approach which basically reuses the logic of drop_columns on the rust side instead of adding the logic all on the python side. What do you think?

…s documentation

kosiew · 2025-07-09T05:47:43Z

Closed because of #1184

kosiew added 14 commits July 4, 2025 19:17

refactor: style loading logic in DataFrameHtmlFormatter

460bae9

test: add deduplicate join selection tests for DataFrame

42f5a72

docs: enhance joins.rst with details on DataFrame naming and deduplic…

fa80aa6

…ation behavior

feat: introduce JoinKeys dataclass for improved join key handling in …

7d7146c

…DataFrame

docs: enhance _prepare_deduplicate docstring with detailed parameter …

f246224

…and return information

feat: implement collision-safe temporary aliases for join column rena…

0ad1b00

…ming in DataFrame

fix: update sorting in test_join_deduplicate_multi to include multipl…

01af791

…e columns

test: add deduplication tests for all join types in test_join_dedupli…

4dd5369

…cate_all_types

test: enhance join deduplication tests with schema validation and ind…

ec769fe

…ividual column selection

fix: improve error messages for join key validation in DataFrame

19d69ca

feat: enhance join operation preparation with JoinPreparation class

0bb81df

fix join deduplicate and tests

3dc96f3

handle deduplication for right/full joins by coalescing join keys refactor join preparation to lower complexity update tests to use supported sort API and full join keyword fix lint issues

kosiew added 3 commits July 9, 2025 11:42

docs: add example for selecting columns after deduplication in joins

628dc15

fix: add SessionContext import for DataFrame creation example in join…

ba54f7d

…s documentation

fix: add safety comment for type checking in join keys assignment

584c600

kosiew marked this pull request as ready for review July 9, 2025 05:23

fix Ruff errors

17ce6eb

kosiew closed this Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for automatic join column deduplication in DataFrame joins #1185

Add support for automatic join column deduplication in DataFrame joins #1185

Uh oh!

kosiew commented Jul 8, 2025

Uh oh!

timsaucer commented Jul 8, 2025

Uh oh!

kosiew commented Jul 9, 2025

Uh oh!

Uh oh!

Add support for automatic join column deduplication in DataFrame joins #1185

Add support for automatic join column deduplication in DataFrame joins #1185

Uh oh!

Conversation

kosiew commented Jul 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

timsaucer commented Jul 8, 2025

Uh oh!

kosiew commented Jul 9, 2025

Uh oh!

Uh oh!