-
Notifications
You must be signed in to change notification settings - Fork 55
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
When calling DQProfiler.profile_table()
with options={"round": False}
on a Delta table that contains datetime.datetime.max
values, the profiler crashes with:
OverflowError: date value out of range
This happens in _round_datetime
during profiling. The round=False
option is also ignored due to a logic bug in the code.
Relevant traceback:
File .../profiler.py, line 680, in DQProfiler._round_datetime
return value.replace(hour=0, minute=0, second=0, microsecond=0) + datetime.timedelta(days=1)
OverflowError: date value out of range
Root causes:
-
_get_min_max()
receives theopts
dictionary, but does not pass it to_adjust_min_max_limits()
. As a result,_round_value()
always usesopts={"round": True}
, even when the user specifiesround=False
. -
_round_datetime()
does not safely handle values neardatetime.datetime.max
, causing overflow when attempting to round “up”.
Expected Behavior
options={"round": False}
should prevent all rounding, including datetime rounding, throughout the profiling logic.- If a value is
datetime.max
, rounding logic should detect this and avoid performing+ timedelta(days=1)
that results in overflow. - Ideally, the function should cap values at
datetime.max
or log a warning, rather than fail.
Suggested Fixes
- Propagate
opts
to_adjust_min_max_limits()
to respect user config. - Update
_round_datetime()
to detect near-overflow conditions and skip rounding or cap safely.
Steps To Reproduce
- Create a Delta table with a timestamp column containing a value equal to or close to datetime.datetime.max:
import datetime from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([ (datetime.datetime.max.replace(microsecond=0),), ], ["_end_date"]) df.write.format("delta").mode("overwrite").saveAsTable("your_catalog.your_schema.your_table")
- Use
databricks.labs.dqx
to profile this table, specifyingoptions={"round": False}
:from databricks.labs.dqx import DQProfiler from databricks.sdk import WorkspaceClient ws = WorkspaceClient() profiler = DQProfiler(ws) summary_stats, profiles = profiler.profile_table( table="your_catalog.your_schema.your_table", columns=["_end_date"], options={"round": False} )
- Observe the crash:
OverflowError: date value out of range
Cloud
Azure
Operating System
Linux
Relevant log output
### Log Output
OverflowError: date value out of range
File <command-MASKED>, line 7
4 ws = WorkspaceClient()
5 profiler = DQProfiler(ws)
----> 7 summary_stats, profiles = profiler.profile_table(
8 table="MASKED.MASKED.MASKED",
9 columns=["MASKED", "MASKED", "_end_date"], # specify columns to profile
10 options={"round": False}
11 )
13 print("Summary Statistics:", summary_stats)
14 print("Generated Profiles:", profiles)
File /MASKED/lib/python3.12/site-packages/databricks/labs/dqx/profiler/profiler.py, line 680, in DQProfiler._round_datetime
678 return value.replace(hour=0, minute=0, second=0, microsecond=0)
679 if direction == "up":
--> 680 return value.replace(hour=0, minute=0, second=0, microsecond=0) + datetime.timedelta(days=1)
681 return value
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working