Skip to content

Conversation

shujingyang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR supports df.asTable for Arrow UDTF in spark connect by correcting the schema creation in ArrowStreamArrowUDTFSerializer.load_stream(). The original code was incorrectly using pa.schema(struct.type) which created a schema with the entire struct as a single field, instead of extracting the individual field names with [field.name for field in struct.type] to properly create a RecordBatch from the flattened arrays.

This ensures table arguments are passed as proper pa.RecordBatch objects to Arrow UDTFs

Why are the changes needed?

This is a fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit tests

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

Comment on lines 25 to 29
def test_arrow_udtf_with_table_argument_basic(self):
super().test_arrow_udtf_with_table_argument_basic()

# TODO(SPARK-53323): Support table arguments in Spark Connect Arrow UDTFs
@unittest.skip("asTable() is not supported in Spark Connect")
def test_arrow_udtf_with_table_argument_and_scalar(self):
super().test_arrow_udtf_with_table_argument_and_scalar()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these and add pass on the class body.

Comment on lines +221 to +224
flattened_arrays = struct.flatten()
field_names = [field.name for field in struct.type]
flattened_batch = pa.RecordBatch.from_arrays(
struct.flatten(), schema=pa.schema(struct.type)
flattened_arrays, names=field_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but why it's working for Spark classic but not Spark Connect? This code path should be shared by both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the generated schema from both with pyarrow 15.0.2 and 20.0.0:

print(f"{flattened_batch.schema} <=====> {pa.schema(struct.type)}")

and saw from the tests:

id: int64 <=====> id: int64 not null

or

id: int64 <=====> id: int64

Seems like the original schema is more accurate?

@HyukjinKwon HyukjinKwon changed the title [SPARK-53323] Support df.asTable() for Arrow UDTF in Spark Connect [SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants