Skip to content

[BUG]: Datetime profiling fails on datetime.max and ignores round=False option #475

@bsr-the-mngrm

Description

@bsr-the-mngrm

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When calling DQProfiler.profile_table() with options={"round": False} on a Delta table that contains datetime.datetime.max values, the profiler crashes with:

OverflowError: date value out of range

This happens in _round_datetime during profiling. The round=False option is also ignored due to a logic bug in the code.

Relevant traceback:

  File .../profiler.py, line 680, in DQProfiler._round_datetime
    return value.replace(hour=0, minute=0, second=0, microsecond=0) + datetime.timedelta(days=1)
OverflowError: date value out of range

Root causes:

  1. _get_min_max() receives the opts dictionary, but does not pass it to _adjust_min_max_limits(). As a result, _round_value() always uses opts={"round": True}, even when the user specifies round=False.

  2. _round_datetime() does not safely handle values near datetime.datetime.max, causing overflow when attempting to round “up”.

Expected Behavior

  • options={"round": False} should prevent all rounding, including datetime rounding, throughout the profiling logic.
  • If a value is datetime.max, rounding logic should detect this and avoid performing + timedelta(days=1) that results in overflow.
  • Ideally, the function should cap values at datetime.max or log a warning, rather than fail.

Suggested Fixes

  • Propagate opts to _adjust_min_max_limits() to respect user config.
  • Update _round_datetime() to detect near-overflow conditions and skip rounding or cap safely.

Steps To Reproduce

  1. Create a Delta table with a timestamp column containing a value equal to or close to datetime.datetime.max:
    import datetime
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    df = spark.createDataFrame([
        (datetime.datetime.max.replace(microsecond=0),),
    ], ["_end_date"])
    
    df.write.format("delta").mode("overwrite").saveAsTable("your_catalog.your_schema.your_table")
  2. Use databricks.labs.dqx to profile this table, specifying options={"round": False}:
    from databricks.labs.dqx import DQProfiler
    from databricks.sdk import WorkspaceClient
    
    ws = WorkspaceClient()
    profiler = DQProfiler(ws)
    
    summary_stats, profiles = profiler.profile_table(
        table="your_catalog.your_schema.your_table",
        columns=["_end_date"],
        options={"round": False}
    )
  3. Observe the crash:
    OverflowError: date value out of range

Cloud

Azure

Operating System

Linux

Relevant log output

### Log Output


OverflowError: date value out of range
  File <command-MASKED>, line 7
      4 ws = WorkspaceClient()
      5 profiler = DQProfiler(ws)
----> 7 summary_stats, profiles = profiler.profile_table(
      8     table="MASKED.MASKED.MASKED",
      9     columns=["MASKED", "MASKED", "_end_date"],  # specify columns to profile
     10     options={"round": False}
     11 )
     13 print("Summary Statistics:", summary_stats)
     14 print("Generated Profiles:", profiles)

  File /MASKED/lib/python3.12/site-packages/databricks/labs/dqx/profiler/profiler.py, line 680, in DQProfiler._round_datetime
    678     return value.replace(hour=0, minute=0, second=0, microsecond=0)
    679 if direction == "up":
--> 680     return value.replace(hour=0, minute=0, second=0, microsecond=0) + datetime.timedelta(days=1)
    681 return value

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions