Skip to content

Missing lower_bound/upper_bound incorrectly treated as though all values for column in partition are NULL #1354

Open
@Nathan-Fenner

Description

@Nathan-Fenner

Apache Iceberg Rust version

Current, v0.4.0

Describe the bug

The current implementation for partition filtering treats a missing lower_bound/upper_bound value as though all rows are null:

Current manifest_evaluator.rs:

    fn less_than(
        &mut self,
        reference: &BoundReference,
        datum: &Datum,
        _predicate: &BoundPredicate,
    ) -> crate::Result<bool> {
        let field = self.field_summary_for_reference(reference);
        match &field.lower_bound {
            Some(bound) if datum <= bound => ROWS_CANNOT_MATCH,
            Some(_) => ROWS_MIGHT_MATCH,
            None => ROWS_CANNOT_MATCH,
        }
    }

This means that if statistics were not computed on a given partition file, that file will be excluded no matter what.

For comparison, the Java implementation handles this correctly:

      T lower = lowerBound(term);
      if (null == lower || NaNUtil.isNaN(lower)) {
        // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
        return ROWS_MIGHT_MATCH;
      }

by treating a null lower bound as indicating that all rows might match.

To Reproduce

No response

Expected behavior

A partition file with a missing lower_bound column should not be excluded (should be included) from scans that filter on that column with </<=/>/>=.

Willingness to contribute

I can contribute a fix for this bug independently

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions