[WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame #52637

zhengruifeng · 2025-10-16T13:20:22Z

What changes were proposed in this pull request?

Fix decimal rescaling in createDataFrame

Why are the changes needed?

this query works in classic, but fails in connect

classic

In [1]: import decimal

In [2]: df = spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])

In [3]: df
Out[3]: DataFrame[d: decimal(38,18)]

In [4]: df.show()
+--------------------+
|                   d|
+--------------------+
|1.233999999999999986|
+--------------------+

connect

In [1]: import decimal

In [2]: df = spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[1], line 2
      1 import decimal
----> 2 spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])

File ~/spark/python/pyspark/sql/connect/session.py:740, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
    733     from pyspark.sql.conversion import (
    734         LocalDataToArrowConversion,
    735     )
    737     # Spark Connect will try its best to build the Arrow table with the
    738     # inferred schema in the client side, and then rename the columns and
    739     # cast the datatypes in the server side.
--> 740     _table = LocalDataToArrowConversion.convert(_data, _schema, prefers_large_types)

...

ArrowInvalid: Rescaling Decimal value would cause data loss

The root cause is the data loss in arrow conversion

In [13]: d = decimal.Decimal(1.234)

In [14]: d
Out[14]: Decimal('1.2339999999999999857891452847979962825775146484375')

In [15]: pa.scalar(d)
Out[15]: <pyarrow.Decimal256Scalar: Decimal('1.2339999999999999857891452847979962825775146484375')>

In [16]: pa.scalar(d).cast(pa.decimal128(38, 18))
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[10], line 1
----> 1 pa.scalar(d).cast(pa.decimal128(38, 18))

...

ArrowInvalid: Rescaling Decimal value would cause data loss

Does this PR introduce any user-facing change?

yes, the query works after this PR

How was this patch tested?

added test

Was this patch authored or co-authored using generative AI tooling?

no

fix

dongjoon-hyun · 2025-10-16T16:16:26Z

python/pyspark/sql/conversion.py

                            raise PySparkValueError(f"input for {dataType} must not be None")
                        return None
-                    return value
+                    return round(value, dataType.scale).normalize()


Just a question. Does this mean the result of Classic is changed by this PR too?

I think no. This is only used in Spark Connect IIUC

oh wait, this is also used some places in workers.

spark/python/pyspark/sql/pandas/serializers.py

Lines 924 to 928 in b3748d8

conv = LocalDataToArrowConversion._create_converter(

spark_type,

none_on_identity=True,

int_to_decimal_coercion_enabled=self._int_to_decimal_coercion_enabled,

)

spark/python/pyspark/worker.py

Lines 2258 to 2260 in b3748d8

table = LocalDataToArrowConversion.convert(

data, return_type, prefers_large_var_types

)

spark/python/pyspark/sql/worker/plan_data_source_read.py

Line 78 in b3748d8

LocalDataToArrowConversion._create_converter(field.dataType) for field in return_type.fields

Are these safe?

I see some other tests failed, let me double check

Oh let me double check

fix

655582f

fix

github-actions bot added SQL PYTHON labels Oct 16, 2025

zhengruifeng requested review from HyukjinKwon and ueshin October 16, 2025 13:22

remove trailing zeros

b3748d8

dongjoon-hyun reviewed Oct 16, 2025

View reviewed changes

HyukjinKwon previously approved these changes Oct 16, 2025

View reviewed changes

zhengruifeng changed the title ~~[SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame~~ [WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame #52637

[WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame #52637

zhengruifeng commented Oct 16, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun Oct 16, 2025

Uh oh!

HyukjinKwon Oct 16, 2025

Uh oh!

ueshin Oct 16, 2025

Uh oh!

zhengruifeng Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	conv = LocalDataToArrowConversion._create_converter(
	spark_type,
	none_on_identity=True,
	int_to_decimal_coercion_enabled=self._int_to_decimal_coercion_enabled,
	)

	table = LocalDataToArrowConversion.convert(
	data, return_type, prefers_large_var_types
	)

[WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame #52637

Are you sure you want to change the base?

[WIP][SPARK-53938][PYTHON][CONNECT] Fix decimal rescaling in createDataFrame #52637

Conversation

zhengruifeng commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ueshin Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng commented Oct 16, 2025 •

edited

Loading