Skip to content

Commit 54546be

Browse files
authored
Fix a potential segfault in PQ reader's number of rows per source calculation (#18906)
This PR fixes a potential segfault from illegal memory access (out of range vector index) in computation of number of rows read from each source in parquet reader. The segfault should manifest when the `num_rows` option leads us to not read from all data sources. Consider this example: we have 5 files with 1000 rows each, we set our skip rows to 1500 and num rows to 1000. The `partial_sum_nrows_source` vector (at L725 in `reader_impl.cpp`) would look like this: [0, 500, 1000, 1000, 1000, 1000]. Here, computing the `end_iter` (end_row = 1000) with std::upper_bound will give us `std::end(..)` instead of `std::begin(..) + 2` which should lead to a segfault. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #18906
1 parent 4655b4b commit 54546be

File tree

2 files changed

+7
-10
lines changed

2 files changed

+7
-10
lines changed

cpp/src/io/parquet/reader_impl.cpp

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -727,17 +727,14 @@ std::vector<size_t> reader::impl::calculate_output_num_rows_per_source(size_t co
727727
// Binary search start_row and end_row in exclusive_sum_num_rows_per_source vector
728728
auto const start_iter =
729729
std::upper_bound(partial_sum_nrows_source.cbegin(), partial_sum_nrows_source.cend(), start_row);
730-
auto const end_iter =
731-
(end_row == _file_itm_data.global_skip_rows + _file_itm_data.global_num_rows)
732-
? partial_sum_nrows_source.cend() - 1
733-
: std::upper_bound(start_iter, partial_sum_nrows_source.cend(), end_row);
730+
auto const end_iter = std::lower_bound(start_iter, partial_sum_nrows_source.cend(), end_row);
734731

735732
// Compute the array offset index for both iterators
736-
auto const start_idx = std::distance(partial_sum_nrows_source.cbegin(), start_iter);
737-
auto const end_idx = std::distance(partial_sum_nrows_source.cbegin(), end_iter);
738-
739-
CUDF_EXPECTS(start_idx <= end_idx,
740-
"Encountered invalid source files indexes for output chunk row bounds");
733+
auto const start_idx = std::distance(partial_sum_nrows_source.cbegin(), start_iter);
734+
auto const end_idx = std::distance(partial_sum_nrows_source.cbegin(), end_iter);
735+
auto const num_sources = static_cast<cudf::size_type>(partial_sum_nrows_source.size());
736+
CUDF_EXPECTS(start_idx <= end_idx and start_idx < num_sources and end_idx < num_sources,
737+
"Encountered out of range output table chunk row bounds");
741738

742739
// If the entire chunk is from the same source file, then the count is simply num_rows
743740
if (start_idx == end_idx) {

cpp/tests/io/parquet_reader_test.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2390,7 +2390,7 @@ TEST_F(ParquetReaderTest, NumRowsPerSource)
23902390
// Read rows_to_read rows skipping rows_to_skip (> two sources) from ten data sources
23912391
{
23922392
auto constexpr rows_to_skip = 25'999;
2393-
auto constexpr rows_to_read = 47'232;
2393+
auto constexpr rows_to_read = 10'900;
23942394

23952395
auto constexpr nsources = 10;
23962396
std::vector<std::string> const datasources(nsources, filepath);

0 commit comments

Comments
 (0)