[SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect #52320

shujingyang-db · 2025-09-12T00:23:47Z

What changes were proposed in this pull request?

This PR supports df.asTable for Arrow UDTF in spark connect by correcting the schema creation in ArrowStreamArrowUDTFSerializer.load_stream(). The original code was incorrectly using pa.schema(struct.type) which created a schema with the entire struct as a single field, instead of extracting the individual field names with [field.name for field in struct.type] to properly create a RecordBatch from the flattened arrays.

This ensures table arguments are passed as proper pa.RecordBatch objects to Arrow UDTFs

Why are the changes needed?

This is a fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit tests

Was this patch authored or co-authored using generative AI tooling?

No

ueshin

Otherwise, LGTM.

ueshin · 2025-09-12T00:59:12Z

python/pyspark/sql/tests/connect/arrow/test_parity_arrow_udtf.py

    def test_arrow_udtf_with_table_argument_basic(self):
        super().test_arrow_udtf_with_table_argument_basic()

-    # TODO(SPARK-53323): Support table arguments in Spark Connect Arrow UDTFs
-    @unittest.skip("asTable() is not supported in Spark Connect")
    def test_arrow_udtf_with_table_argument_and_scalar(self):
        super().test_arrow_udtf_with_table_argument_and_scalar()


Let's remove these and add pass on the class body.

allisonwang-db · 2025-09-12T01:02:19Z

python/pyspark/sql/pandas/serializers.py

+                    flattened_arrays = struct.flatten()
+                    field_names = [field.name for field in struct.type]
                    flattened_batch = pa.RecordBatch.from_arrays(
-                        struct.flatten(), schema=pa.schema(struct.type)
+                        flattened_arrays, names=field_names


Hmm, but why it's working for Spark classic but not Spark Connect? This code path should be shared by both.

I checked the generated schema from both with pyarrow 15.0.2 and 20.0.0:

print(f"{flattened_batch.schema} <=====> {pa.schema(struct.type)}")

and saw from the tests:

id: int64 <=====> id: int64 not null

or

id: int64 <=====> id: int64

Seems like the original schema is more accurate?

init

b90a919

github-actions bot added SQL PYTHON CONNECT labels Sep 12, 2025

ueshin approved these changes Sep 12, 2025

View reviewed changes

allisonwang-db reviewed Sep 12, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-53323] Support df.asTable() for Arrow UDTF in Spark Connect~~ [SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect Sep 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect #52320

[SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect #52320

shujingyang-db commented Sep 12, 2025

Uh oh!

ueshin left a comment

Uh oh!

ueshin Sep 12, 2025

Uh oh!

allisonwang-db Sep 12, 2025

Uh oh!

ueshin Sep 12, 2025

Uh oh!

Uh oh!

[SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect #52320

Are you sure you want to change the base?

[SPARK-53323][CONNECT] Support df.asTable() for Arrow UDTF in Spark Connect #52320

Conversation

shujingyang-db commented Sep 12, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ueshin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!