-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-41476: [Python][C++] Impossible to specify is_adjusted_to_utc
for Time
type when writing to Parquet
#47316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…AdjustedToUTC field for the Time Parquet logical type
I think this is something that should be fixed on the Spark side per the discussion from the old PR that you've mentioned. |
Hi @wgtmac, Thanks for sharing your thoughts on this. I agree with you that the best case scenario would be for the Apache Spark community to extend the Spark Parquet reader to support the The decision to default to Given that the Parquet specification allows for writing local Given the complexity of this issue, does anyone feel that it would be helpful to ask for clarification from the broader Parquet community about this? It appears others have been confused about the purpose of the I really appreciate hearing everyone's thoughts on this. This is definitely a nuanced issue, and I am comfortable with whatever direction the community collectively feels is most appropriate. However, in my personal opinion, this would a worthwhile change. Thanks! Best, |
The previous PR on this didn't make any progress because we found that this is unclear on the Parquet side: #43268 (comment). Perhaps the right direction is to remove |
Rationale for this change
As of today, it's not possible to write Parquet
TIME
data whoseisAdjustedToUTC
parameter isfalse
. Instead,isAdjustedToUTC
is hard-coded totrue
here.Unfortunately, some Parquet consumers only support
TIME
data if theisAdjustedToUTC
parameter isfalse
, meaning they cannot import ParquetTIME
data generated by our Parquet Writer. For example, the apache/spark Parquet reader only supports ParquetTIME
columns ifisAdjustedToUTC=false
andunits=MICROSECONDS
.Adding support for writing
TIME
data with theisAdjustedToUTC
set tofalse
would unblock users who need to write Spark-compatible Parquet data.What changes are included in this PR?
write_time_adjusted_to_utc
as a property toparquet::ArrowWriterProperties
. Iftrue
, allTIME
columns have theirisAdjustedToUTC
parameters set totrue
. Otherwise,isAdjustedToUTC
is set tofalse
for allTIME
columns. This property istrue
by default.enable_write_time_adjusted_to_utc()
anddisable_write_time_adjusted_to_utc()
methods toparquet::ArrowWriterProperties::Builder
.Are these changes tested?
Yes. I added test case
ParquetTimeAdjustedToUTC
to test suiteTestConvertArrowSchema
.Are there any user-facing changes?
Yes. Users can now write Parquet
TIME
columns whoseisAdjustedToUTC
parameter isfalse
.NOTE
is_adjusted_to_utc
for Time type when writing to Parquet #41476