-
Notifications
You must be signed in to change notification settings - Fork 118
Add support for automatic join column deduplication in DataFrame joins #1185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Added a `deduplicate` boolean parameter to `DataFrame.join` that, when True, drops duplicate join columns from the right DataFrame after join. - Implemented helper methods `_resolve_join_keys` and `_prepare_deduplicate` to normalize join key arguments and handle column renaming and dropping. - Updated join logic to rename duplicate join columns in right DataFrame, join with renamed columns, and drop renamed duplicates post-join. - Added tests `test_join_deduplicate` and `test_join_deduplicate_multi` covering deduplication of single and multiple join columns. - Extended documentation with example usage of `deduplicate` for disambiguating columns. Also added Copilot and agent instructions files describing Python and Rust style guidelines, pre-commit usage, testing, and code organization conventions for the DataFusion Python project.
- Verify that selecting columns works correctly after a join with deduplicate=True - Confirm the joined DataFrame contains only matching IDs - Test selecting single and multiple columns post-join to ensure correct data retrieval
…and return information
…ming in DataFrame
…ividual column selection
handle deduplication for right/full joins by coalescing join keys refactor join preparation to lower complexity update tests to use supported sort API and full join keyword fix lint issues
Would you mind taking a look at #1184 ? It's an alternate approach which basically reuses the logic of |
Closed because of #1184 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This change improves usability and interoperability by adding an optional
deduplicate=True
flag to DataFrame joins. In many real-world datasets, join keys exist in both tables with the same name. Prior to this PR, this resulted in conflicting column names in the output and required manual disambiguation or column renaming.This feature aligns with the behavior in other DataFrame libraries like PySpark, making joins easier and more intuitive for users by optionally removing duplicate join columns from the right-hand side.
What changes are included in this PR?
deduplicate
argument to theDataFrame.join()
method._prepare_join
helper method._deduplicate_right
logic to rename right-hand join keys with unique aliases.coalesce
to resolvenull
values in outer/right joins.Are these changes tested?
✅ Yes. Tests are included to verify:
Are there any user-facing changes?
✅ Yes. A new
deduplicate
keyword argument is available forDataFrame.join()
. When enabled, it automatically removes duplicate join columns from the right DataFrame, simplifying common workflows and avoiding column naming conflicts.This feature is backward-compatible and opt-in.
📘 User documentation has been updated with detailed usage examples, best practices, and behavior notes.