[SPARK-52240] Corrected row index usage when exploding packed arrays in vectorized reader

djspiewak · yhuang-db · commit 0757f2a262cc · 2025-06-09T09:54:20.000-07:00
This PR fixes an issue in the vectorized parquet reader with respect to executing the `explode` function on nested arrays where the array cuts across two or more pages. It's probably possible to minimize this slightly more but I wasn't able to find a reproducer. It's also worth noting that this issue illustrates a current gap in the lower-level unit tests for the vectorized reader, which don't appear to test much related to output vector offsets. The bug in question was a simple typo: the output row offset was used to dereference nested array lengths rather than input row offset. This only matters for the explode function and then only when resuming the same operation on a second page. This case (and all related cases) are, at present, untested. I added a high-level test and example `.parquet` file which reproduces the issue and verifies the fix, but it would be ideal if more tests were added at a lower level. It is very likely that other similar bugs are present within the vectorized reader as it relates to nested substructures remapped during the query pipeline. ### What changes were proposed in this pull request? It's a fairly straightforward typo issue in the code. ### Why are the changes needed? The vectorized parquet reader does not correctly handle this case ### Does this PR introduce _any_ user-facing change? Aside from fixing the vectorized reader? No. ### How was this patch tested? Unit test (well, more of an integration test) included in PR ### Was this patch authored or co-authored using generative AI tooling? Nope Closes apache#46928 from djspiewak/bug/packed-list-vectorized. Authored-by: Daniel Spiewak <dspiewak@nvidia.com> Signed-off-by: Chao Sun <chao@openai.com>
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java
@@ -56,7 +56,7 @@ public void readBinary(int total, WritableColumnVector c, int rowId) {
     ByteBufferOutputWriter outputWriter = ByteBufferOutputWriter::writeArrayByteBuffer;
     int length;
     for (int i = 0; i < total; i++) {
-      length = lengthsVector.getInt(rowId + i);
+      length = lengthsVector.getInt(currentRow + i);
       try {
         buffer = in.slice(length);
       } catch (EOFException e) {
diff --git a/sql/core/src/test/resources/test-data/packed-list-vectorized.parquet b/sql/core/src/test/resources/test-data/packed-list-vectorized.parquet
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
@@ -1307,6 +1307,16 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
     }
   }
 
+  test("explode nested lists crossing a rowgroup boundary") {
+    withAllParquetReaders {
+      checkAnswer(
+        readResourceParquetFile("test-data/packed-list-vectorized.parquet")
+          .selectExpr("explode(DIStatus.command_status.actions_status)")
+          .selectExpr("col.result"),
+        List.fill(4992)(Row("SUCCESS")))
+    }
+  }
+
   test("read dictionary encoded decimals written as INT64") {
     withAllParquetReaders {
       checkAnswer(