[data] Update repartition on target_num_rows_per_block documentation (#51433)

srinathk10 · web-flow · commit 9dd14e04d107 · 2025-03-25T01:24:22.000Z
## Why are these changes needed?  Update repartition on target_num_rows_per_block: When `target_num_rows_per_block` is set, it only repartitions Dataset blocks that are larger than `target_num_rows_per_block`. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
diff --git a/python/ray/data/dataset.py b/python/ray/data/dataset.py
@@ -1373,12 +1373,6 @@ def repartition(
         """Repartition the :class:`Dataset` into exactly this number of
         :ref:`blocks <dataset_concept>`.
 
-        When `target_num_rows_per_block` is set, it repartitions :class:`Dataset`
-        to honor target number of rows per :ref:`blocks <dataset_concept>`. Note
-        that the system will internally figure out the number of rows per
-        :ref:`blocks <dataset_concept>` for optimal execution, based on the
-        `target_num_rows_per_block`.
-
         This method can be useful to tune the performance of your pipeline. To learn
         more, see :ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`.
 
@@ -1408,9 +1402,16 @@ def repartition(
 
         Args:
             num_blocks: Number of blocks after repartitioning.
-            target_num_rows_per_block: The target number of rows per block to
+            target_num_rows_per_block: [Experimental] The target number of rows per block to
                 repartition. Note that either `num_blocks` or
-                `target_num_rows_per_block` must be set, but not both.
+                `target_num_rows_per_block` must be set, but not both. When
+                `target_num_rows_per_block` is set, it only repartitions
+                :class:`Dataset` :ref:`blocks <dataset_concept>` that are larger than
+                `target_num_rows_per_block`. Note that the system will internally
+                figure out the number of rows per :ref:`blocks <dataset_concept>` for
+                optimal execution, based on the `target_num_rows_per_block`. This is
+                the current behavior because of the implementation and may change in
+                the future.
             shuffle: Whether to perform a distributed shuffle during the
                 repartition. When shuffle is enabled, each output block
                 contains a subset of data rows from each input block, which