Skip to content

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Oct 14, 2025

What changes were proposed in this pull request?

Fixes observations on Spark Connect with plan cache enabled.

Why are the changes needed?

The observations on Spark Connect with plan cache enabled has cases that can't see the observed values.

>>> from pyspark.sql.observation import Observation
>>> from pyspark.sql import functions as F
>>>
>>> df = spark.range(10)
>>> observation = Observation()
>>> observed_df = df.observe(observation, F.count(F.lit(1)).alias("cnt"))
>>>
>>> observed_df.schema  # causes to cache
StructType([StructField('id', LongType(), False)])
>>>
>>> observed_df.show()
+---+
| id|
+---+
...
+---+

>>>
>>> observation.get
{}

This should be:

>>> spark.conf.set('spark.connect.session.planCache.enabled', False)
...
>>> observation.get
{'cnt': 10}

This is because the cached plan by the Analyze request for observed_df.schema doesn't register the new Observation for the actual execution.

The observations should be registered later.

Does this PR introduce any user-facing change?

Yes, the observations will be available.

How was this patch tested?

Modified the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@ueshin
Copy link
Member Author

ueshin commented Oct 16, 2025

Thanks! merging to master.

@ueshin ueshin closed this in eb117a6 Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants