Skip to content

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 16, 2025

What changes were proposed in this pull request?

Fix decimal rescaling in createDataFrame

Why are the changes needed?

this query works in classic, but fails in connect

classic

In [1]: import decimal

In [2]: df = spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])

In [3]: df
Out[3]: DataFrame[d: decimal(38,18)]

In [4]: df.show()
+--------------------+
|                   d|
+--------------------+
|1.233999999999999986|
+--------------------+

connect

In [1]: import decimal

In [2]: df = spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[1], line 2
      1 import decimal
----> 2 spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])

File ~/spark/python/pyspark/sql/connect/session.py:740, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
    733     from pyspark.sql.conversion import (
    734         LocalDataToArrowConversion,
    735     )
    737     # Spark Connect will try its best to build the Arrow table with the
    738     # inferred schema in the client side, and then rename the columns and
    739     # cast the datatypes in the server side.
--> 740     _table = LocalDataToArrowConversion.convert(_data, _schema, prefers_large_types)

...

ArrowInvalid: Rescaling Decimal value would cause data loss

The root cause is the data loss in arrow conversion

In [13]: d = decimal.Decimal(1.234)

In [14]: d
Out[14]: Decimal('1.2339999999999999857891452847979962825775146484375')

In [15]: pa.scalar(d)
Out[15]: <pyarrow.Decimal256Scalar: Decimal('1.2339999999999999857891452847979962825775146484375')>

In [16]: pa.scalar(d).cast(pa.decimal128(38, 18))
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[10], line 1
----> 1 pa.scalar(d).cast(pa.decimal128(38, 18))

...

ArrowInvalid: Rescaling Decimal value would cause data loss

Does this PR introduce any user-facing change?

yes, the query works after this PR

How was this patch tested?

added test

Was this patch authored or co-authored using generative AI tooling?

no

fix
raise PySparkValueError(f"input for {dataType} must not be None")
return None
return value
return round(value, dataType.scale).normalize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. Does this mean the result of Classic is changed by this PR too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no. This is only used in Spark Connect IIUC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, this is also used some places in workers.

Are these safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some other tests failed, let me double check

HyukjinKwon
HyukjinKwon previously approved these changes Oct 16, 2025
@HyukjinKwon HyukjinKwon dismissed their stale review October 16, 2025 23:41

Oh let me double check

@zhengruifeng zhengruifeng changed the title [SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame [WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants