Skip to content

Commit 9dd14e0

Browse files
authored
[data] Update repartition on target_num_rows_per_block documentation (#51433)
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Update repartition on target_num_rows_per_block: When `target_num_rows_per_block` is set, it only repartitions Dataset blocks that are larger than `target_num_rows_per_block`. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
1 parent ce4d0ad commit 9dd14e0

File tree

1 file changed

+9
-8
lines changed

1 file changed

+9
-8
lines changed

python/ray/data/dataset.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1373,12 +1373,6 @@ def repartition(
13731373
"""Repartition the :class:`Dataset` into exactly this number of
13741374
:ref:`blocks <dataset_concept>`.
13751375
1376-
When `target_num_rows_per_block` is set, it repartitions :class:`Dataset`
1377-
to honor target number of rows per :ref:`blocks <dataset_concept>`. Note
1378-
that the system will internally figure out the number of rows per
1379-
:ref:`blocks <dataset_concept>` for optimal execution, based on the
1380-
`target_num_rows_per_block`.
1381-
13821376
This method can be useful to tune the performance of your pipeline. To learn
13831377
more, see :ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`.
13841378
@@ -1408,9 +1402,16 @@ def repartition(
14081402
14091403
Args:
14101404
num_blocks: Number of blocks after repartitioning.
1411-
target_num_rows_per_block: The target number of rows per block to
1405+
target_num_rows_per_block: [Experimental] The target number of rows per block to
14121406
repartition. Note that either `num_blocks` or
1413-
`target_num_rows_per_block` must be set, but not both.
1407+
`target_num_rows_per_block` must be set, but not both. When
1408+
`target_num_rows_per_block` is set, it only repartitions
1409+
:class:`Dataset` :ref:`blocks <dataset_concept>` that are larger than
1410+
`target_num_rows_per_block`. Note that the system will internally
1411+
figure out the number of rows per :ref:`blocks <dataset_concept>` for
1412+
optimal execution, based on the `target_num_rows_per_block`. This is
1413+
the current behavior because of the implementation and may change in
1414+
the future.
14141415
shuffle: Whether to perform a distributed shuffle during the
14151416
repartition. When shuffle is enabled, each output block
14161417
contains a subset of data rows from each input block, which

0 commit comments

Comments
 (0)