feat(yaml): add schema unification for Flatten transform #35672

liferoad · 2025-07-23T21:28:29Z

Implement schema merging for Flatten transform to handle PCollections with different schemas. The unified schema contains all fields from input PCollections, making fields optional to handle missing values. Added a test case to verify the behavior.

Fixes #35666

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Implement schema merging for Flatten transform to handle PCollections with different schemas. The unified schema contains all fields from input PCollections, making fields optional to handle missing values. Added a test case to verify the behavior.

…nal types Extract inner types from Optional when unifying schemas to properly handle type unions. Also improve code readability by breaking long lines and clarifying comments.

Fix type resolution for nested generic types by properly extracting inner types when comparing field types. This ensures correct type hints are generated for optional fields in YAML provider.

…ma unification Handle list types more carefully during schema unification to avoid unsupported Union types. Also ensure iterable values are properly converted to lists when needed for schema compatibility.

github-actions · 2025-07-24T04:21:02Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

… tests add comprehensive test cases for schema unification in Flatten transform

sdks/python/apache_beam/yaml/yaml_provider.py

damccorm · 2025-07-24T15:04:08Z

sdks/python/apache_beam/yaml/yaml_provider.py

+
+      # Merge all field names and types, making them optional
+      all_fields = {}
+      for schema in schemas:


Should we check that there aren't conflicting types here (e.g. pcoll1 wants 'foo': int, pcoll2 wants 'foo': str)?

Ideally this would yield a Union type.

I have the tests below to validate this works with _unify_field_types by treating them as Optional[Any] to simply the logics for Flatten given the Union could be a very long list (e.g., Optional[Union[int, str, list,....]]). Probably very hard to handle the nested structures.

Do we actually need to handle nested structures? Could we just say given:

pcoll1: {'foo': TypeA} pcoll2: {'foo': TypeB} outPcoll: {'foo': Union[TypeA, TypeB]}

and ignore the nested representations?

_unify_field_types for now at least does not use Union. Whenever two types are different, it uses Optional[Any]. I have some bit concerns about how accurate we need to infer the schema (e.g., stop at the list level like what you suggest or just do the simplest one my PR uses). I also think we should support specifying the schema and then it will make no sense for us to unify the schemas with our rules.

github-actions · 2025-07-24T15:38:39Z

Assigning reviewers:

R: @claudevdm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

gemini-code-assist · 2025-07-25T18:08:58Z

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

damccorm · 2025-07-25T18:01:42Z

sdks/python/apache_beam/yaml/yaml_provider.py

+
+      # Merge all field names and types, making them optional
+      all_fields = {}
+      for schema in schemas:


Do we actually need to handle nested structures? Could we just say given:

pcoll1: {'foo': TypeA} pcoll2: {'foo': TypeB} outPcoll: {'foo': Union[TypeA, TypeB]}

and ignore the nested representations?

damccorm · 2025-07-25T18:03:53Z

sdks/python/apache_beam/yaml/yaml_provider.py

+      existing_inner = (
+          existing_type.__args__[0] if hasattr(existing_type, '__args__') and
+          len(existing_type.__args__) == 1 else existing_type)
+      field_inner = (
+          field_type.__args__[0] if hasattr(field_type, '__args__') and
+          len(field_type.__args__) == 1 else field_type)
+
+      # Handle type unification more carefully
+      if existing_inner == Any or field_inner == Any:
+        return Optional[Any]
+      elif existing_inner == field_inner:
+        return Optional[existing_inner]


Will this logic unify Iterable[str], str to Optional[str] since Dict also has args of length 1? I think we want to actually check if the outer type is Optional

The current way is to prioritize List, so it should be Optional[Iterable[str]]. _unify_element_with_schema does this conversion.

I see - could you add a comment here explaining that this function expects all iterables to already be coerced to lists? I agree we do that correctly below, but it is probably a little brittle to rely on this without being explicit about the API

damccorm · 2025-07-25T18:05:27Z

sdks/python/apache_beam/yaml/yaml_provider.py

+        elif existing_is_list:
+          # Existing is list, keep it as list type
+          return Optional[existing_inner]
+        elif field_is_list:
+          # New field is list, use list type
+          return Optional[field_inner]


Why is this not just Optional[Union[existing_inner, field_inner]]? Isn't either list or single element valid?

Ah, I see we're trying to avoid Union types. Probably this just needs to be encoded as Any then, right?

That is what line 978 does.

Right, but if I'm unifying List[int] and int, right now it unifies to Optional[List[int]], right? But that isn't right if I'm flattening {foo: 1} and {foo: [1,2,3]}

damccorm · 2025-07-25T18:08:34Z

sdks/python/apache_beam/yaml/yaml_provider.py

+                  existing_type, field_type)
+          else:
+            # Make field optional since not all PCollections may have it
+            all_fields[field_name] = Optional[field_type]


Could we keep track of when one of these schema difference conditions is hit and warn?

liferoad · 2025-07-28T21:02:31Z

/gemini review

gemini-code-assist

Code Review

An excellent addition of schema unification for the Flatten transform. The implementation is well-structured and the accompanying tests are thorough, covering a good range of scenarios.

I've identified a critical issue in the type unification logic that could lead to incorrect schema inference. I've provided a detailed explanation and a suggested replacement for the affected method. Additionally, I've included a minor suggestion to improve the readability of the new test code by reducing duplication.

Once the critical issue is addressed, this will be a solid contribution.

gemini-code-assist · 2025-07-28T21:04:33Z

sdks/python/apache_beam/yaml/yaml_provider.py

+    def _unify_field_types(self, existing_type, field_type):
+      """Unify two field types, handling Optional and List types."""
+      # Extract inner types from Optional if needed
+      existing_inner = (
+          existing_type.__args__[0] if hasattr(existing_type, '__args__') and
+          len(existing_type.__args__) == 1 else existing_type)
+      field_inner = (
+          field_type.__args__[0] if hasattr(field_type, '__args__') and
+          len(field_type.__args__) == 1 else field_type)
+
+      # Handle type unification more carefully
+      if existing_inner == Any or field_inner == Any:
+        return Optional[Any]
+      elif existing_inner == field_inner:
+        return Optional[existing_inner]
+      else:
+        # Check for list types and prioritize them over other types
+        from apache_beam.typehints import typehints as th
+        existing_is_list = (
+            hasattr(existing_inner, '__origin__') and
+            existing_inner.__origin__ in (list, th.List))
+        field_is_list = (
+            hasattr(field_inner, '__origin__') and
+            field_inner.__origin__ in (list, th.List))
+
+        if existing_is_list and field_is_list:
+          # Both are list types, unify their element types
+          existing_elem = existing_inner.__args__[
+              0] if existing_inner.__args__ else Any
+          field_elem = field_inner.__args__[0] if field_inner.__args__ else Any
+          if existing_elem == field_elem:
+            return Optional[th.List[existing_elem]]
+          else:
+            return Optional[th.List[Any]]
+        elif existing_is_list:
+          # Existing is list, keep it as list type
+          return Optional[existing_inner]
+        elif field_is_list:
+          # New field is list, use list type
+          return Optional[field_inner]
+        else:
+          # Neither is a list, use Any to avoid unsupported Union
+          # types in schema translation
+          return Optional[Any]


The current implementation of _unify_field_types has several issues that can lead to incorrect schema unification and potential runtime errors:

The logic to extract inner types from Optional is incorrect. It uses len(type.__args__) == 1, which is false for typing.Optional[T] (which is an alias for Union[T, NoneType], having 2 type arguments).

The same logic incorrectly treats list-like types such as th.List[T] as their element type T, because th.List[T].__args__ has a length of 1.

The *_is_list checks are unreliable because they operate on these incorrectly "unwrapped" types.

This can lead to incorrect schema inference, for example, treating a list field as a primitive, or failing to correctly unify list types.

I suggest replacing the method with a more robust implementation using typing.get_origin and typing.get_args for type introspection. This will correctly handle Optional types and list types.

from typing import get_origin, get_args def _unify_field_types(self, existing_type, field_type): """Unify two field types, handling Optional and List types.""" existing_origin = get_origin(existing_type) or existing_type field_origin = get_origin(field_type) or field_type existing_inner = get_args(existing_type)[0] if existing_origin is Optional else existing_type field_inner = get_args(field_type)[0] if field_origin is Optional else field_type if existing_inner == Any or field_inner == Any: return Optional[Any] elif existing_inner == field_inner: return Optional[existing_inner] else: existing_is_list = existing_origin in (list, th.List) field_is_list = field_origin in (list, th.List) if existing_is_list and field_is_list: existing_elem = get_args(existing_inner)[0] if get_args(existing_inner) else Any field_elem = get_args(field_inner)[0] if get_args(field_inner) else Any if existing_elem == field_elem: return Optional[th.List[existing_elem]] else: return Optional[th.List[Any]] elif existing_is_list: return Optional[existing_inner] elif field_is_list: return Optional[field_inner] else: return Optional[Any]

gemini-code-assist · 2025-07-28T21:04:33Z

sdks/python/apache_beam/yaml/yaml_transform_test.py

+      def check_result(actual):
+        expected_ids = {1, 2, 3, 4, 5}
+        actual_ids = {
+            getattr(row, 'id', row.get('id') if hasattr(row, 'get') else None)
+            for row in actual
+        }
+        assert actual_ids == expected_ids, (
+            f"Expected IDs {expected_ids}, got {actual_ids}")
+
+        # Check that all rows have required fields
+        for row in actual:
+          row_id = getattr(
+              row, 'id', row.get('id') if hasattr(row, 'get') else None)
+          name = getattr(
+              row, 'name', row.get('name') if hasattr(row, 'get') else None)
+          assert row_id is not None, f"Missing id field in row {row}"
+          assert name is not None, f"Missing name field in row {row}"
+          # Optional fields should be present but may be None/empty
+          price = getattr(
+              row, 'price', row.get('price') if hasattr(row, 'get') else None)
+          categories = getattr(
+              row,
+              'categories',
+              row.get('categories') if hasattr(row, 'get') else None)
+          assert price is not None or row_id == 3, \
+              f"Missing price field in row {row}"
+          assert categories is not None or row_id == 4, \
+              f"Missing categories field in row {row}"
+


The check_result function contains repetitive logic to access fields from row objects, which can be either beam.Row or dict. This makes the code harder to read and maintain.

To improve this, you can introduce a helper function within check_result to abstract away the field access logic. This will reduce duplication and make the test assertions cleaner and more readable.

Suggested change

def check_result(actual):

expected_ids = {1, 2, 3, 4, 5}

actual_ids = {

getattr(row, 'id', row.get('id') if hasattr(row, 'get') else None)

for row in actual

}

assert actual_ids == expected_ids, (

f"Expected IDs {expected_ids}, got {actual_ids}")

# Check that all rows have required fields

for row in actual:

row_id = getattr(

row, 'id', row.get('id') if hasattr(row, 'get') else None)

name = getattr(

row, 'name', row.get('name') if hasattr(row, 'get') else None)

assert row_id is not None, f"Missing id field in row {row}"

assert name is not None, f"Missing name field in row {row}"

# Optional fields should be present but may be None/empty

price = getattr(

row, 'price', row.get('price') if hasattr(row, 'get') else None)

categories = getattr(

row,

'categories',

row.get('categories') if hasattr(row, 'get') else None)

assert price is not None or row_id == 3, \

f"Missing price field in row {row}"

assert categories is not None or row_id == 4, \

f"Missing categories field in row {row}"

def check_result(actual):

def get_field(row, field):

return getattr(row, field, row.get(field) if hasattr(row, 'get') else None)

expected_ids = {1, 2, 3, 4, 5}

actual_ids = {get_field(row, 'id') for row in actual}

assert actual_ids == expected_ids, (

f"Expected IDs {expected_ids}, got {actual_ids}")

for row in actual:

row_id = get_field(row, 'id')

name = get_field(row, 'name')

assert row_id is not None, f"Missing id field in row {row}"

assert name is not None, f"Missing name field in row {row}"

price = get_field(row, 'price')

categories = get_field(row, 'categories')

assert price is not None or row_id == 3, \

f"Missing price field in row {row}"

assert categories is not None or row_id == 4, \

f"Missing categories field in row {row}"

liferoad · 2025-07-29T22:26:58Z

switched to #35728 to simply the logics.

github-actions bot added python yaml labels Jul 23, 2025

liferoad added 5 commits July 23, 2025 18:43

refactor(yaml_provider): improve schema unification by handling optio…

ab7962a

…nal types Extract inner types from Optional when unifying schemas to properly handle type unions. Also improve code readability by breaking long lines and clarifying comments.

fix(yaml_provider): handle nested generic types in field type resolution

f11debf

Fix type resolution for nested generic types by properly extracting inner types when comparing field types. This ensures correct type hints are generated for optional fields in YAML provider.

fix tests

cfdc3e3

fix(yaml_provider): improve type handling and list conversion in sche…

e3e74c9

…ma unification Handle list types more carefully during schema unification to avoid unsupported Union types. Also ensure iterable values are properly converted to lists when needed for schema compatibility.

fix lint

12e797c

liferoad added 3 commits July 24, 2025 08:05

refactor(yaml): remove unused import in yaml_provider and add flatten…

2c46680

… tests add comprehensive test cases for schema unification in Flatten transform

fix lint

b2db997

fix lint

c5cc4a6

damccorm reviewed Jul 24, 2025

View reviewed changes

github-actions bot added the Next Action: Reviewers label Jul 24, 2025

damccorm mentioned this pull request Jul 24, 2025

feat(yaml): add schema validation for flatten transform #35675

Closed

3 tasks

liferoad mentioned this pull request Jul 24, 2025

[Bug]: YAML Flatten incorrectly drops fields when input PCollections' schema are different #35666

Open

17 tasks

liferoad requested a review from damccorm July 25, 2025 15:58

damccorm reviewed Jul 25, 2025

View reviewed changes

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

liferoad closed this Jul 29, 2025

feat(yaml): add schema unification for Flatten transform #35672

feat(yaml): add schema unification for Flatten transform #35672

Uh oh!

Conversation

liferoad commented Jul 23, 2025

GitHub Actions Tests Status (on master branch)

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

gemini-code-assist bot commented Jul 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liferoad commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

liferoad commented Jul 29, 2025

Uh oh!

Uh oh!