Skip to content

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Sep 12, 2025

What changes were proposed in this pull request?

Supports complex types on observations.

Why are the changes needed?

The observations didn't support complex types.

For example:

>>> observation = Observation("struct")
>>> df = spark.range(10).observe(
...     observation,
...     F.struct(F.count(F.lit(1)).alias("rows"), F.max("id").alias("maxid")).alias("struct"),
... )
  • classic
>>> df.collect()
[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5), Row(id=6), Row(id=7), Row(id=8), Row(id=9)]
>>> observation.get
{'struct': JavaObject id=o61}
  • connect
>>> df.collect()
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkTypeError: [UNSUPPORTED_LITERAL] Unsupported Literal 'struct {
...

Does this PR introduce any user-facing change?

Yes, complex types are available on observations.

>>> df.collect()
[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5), Row(id=6), Row(id=7), Row(id=8), Row(id=9)]
>>>
>>> observation.get
{'struct': Row(rows=10, maxid=9)}

How was this patch tested?

Added the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@ueshin
Copy link
Member Author

ueshin commented Sep 12, 2025

cc @heyihong

@ueshin ueshin requested a review from zhengruifeng September 12, 2025 01:52
@@ -436,11 +442,48 @@ def _to_value(
assert dataType is None or isinstance(dataType, DayTimeIntervalType)
return DayTimeIntervalType().fromInternal(literal.day_time_interval)
elif literal.HasField("array"):
elementType = proto_schema_to_pyspark_data_type(literal.array.element_type)
if literal.array.HasField("data_type"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious when does data_type exist and when does it not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example of Array, data_type was added and element_type was deprecated.

https://github.com/apache/spark/blob/master/sql/connect/common/src/main/protobuf/spark/connect/expressions.proto#L227-L244

but the creator of the message may follow it or may not, so this should be able to handle both cases.

Same for Map and Struct.

Copy link
Contributor

@heyihong heyihong Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To minimize changes, it is not necessary to support these new data type fields. They are still under development and not fully stabilized yet. Currently, the new data type fields are only used in the requests from the Spark Connect Scala Client.

If we really want to support them, I think we need to consider:

  • How can we enable the new data type fields in the responses while maintaining backward compatibility?
  • How is this code path tested?

Copy link
Member Author

@ueshin ueshin Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll revert the changes and leave it to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants