Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 8, 2025

Which issue does this PR close?

This is part of a series smaller PRs to replace #17281


Rationale for this change

The existing struct-casting logic was implemented in a narrow context (NestedSchemaAdapter) and relied on arrow::compute::cast for non-struct casts. That limited reuse and produced surprising failures when the planner attempted to reconcile mixed or nullable struct fields. This change generalizes cast_column to accept arrow::compute::CastOptions, adds a robust validate_struct_compatibility function (which returns Result<()> on success), preserves parent nulls, and integrates the generalized cast function into SchemaAdapter so casting behavior is consistent across the codebase.

By centralizing struct-casting logic we improve maintainability, make behavior explicit (via CastOptions), and fix multiple user-facing failures where struct fields differed by name, order, or nullability. This change enables reliable struct-to-struct casts during planning and execution and simplifies later extension to other adapters.


What changes are included in this PR?

  • API / implementation

    • Generalized pub fn cast_column(...) to accept a &CastOptions argument and dispatch to arrow::compute::cast_with_options for non-struct targets.

    • Implemented cast_struct_column(...) that:

      • Validates struct compatibility (validate_struct_compatibility) before attempting to cast children.
      • Uses field name lookup rather than positional matching, preserving order or allowing reordering as required by target schema.
      • Fills missing target fields with new_null_array(...) of the target data type.
      • Preserves parent-level null bitmap when building the resulting StructArray.
      • Propagates contextual error messages when a child field cast fails (adds field name to context).
    • Added validate_struct_compatibility(...) returning Result<()> (errors on incompatible types or unsafe nullability changes).

    • Introduced and used DEFAULT_CAST_OPTIONS in tests and in datasource::schema_adapter to provide a consistent default behavior when casting in mapping pipelines.

  • Integration

    • Updated SchemaAdapter/SchemaMapping signatures to accept a CastColumnFn that takes &CastOptions and wired the default mapping to call cast_column(..., &DEFAULT_CAST_OPTIONS).
    • Updated callers in schema_adapter.rs to validate struct compatibility and return Ok(true)/Ok(false) consistently where expected.
  • Tests

    • Added/expanded unit tests for:

      • simple primitive casts with options (test_cast_simple_column, test_cast_column_with_options)
      • struct casts with missing fields and null preservation (test_cast_struct_with_missing_field, test_cast_struct_parent_nulls_retained)
      • incompatible child types (test_cast_struct_incompatible_child_type)
      • nullability compatibility checks (test_validate_struct_compatibility_nullable_to_non_nullable, test_validate_struct_compatibility_non_nullable_to_nullable, test_validate_struct_compatibility_nested_nullable_to_non_nullable)
      • nested structs with reordering, extra and missing fields (test_cast_nested_struct_with_extra_and_missing_fields)
      • arrays and map fields inside structs (test_cast_struct_with_array_and_map_fields)
      • field ordering differences (test_cast_struct_field_order_differs)
  • Documentation

    • Added doc comments explaining CastOptions usage and included a short example showing safe: true making overflow produce NULL rather than an error.

Are these changes tested?

Yes — the patch adds multiple unit tests in datafusion/common/src/nested_struct.rs covering primitive casts with options, struct-to-struct casts (including missing/extra fields, null preservation), nullability validation, nested structs, array/map child fields, and ordering differences. These tests exercise the new cast_column entrypoint and schema adapter integration.

If CI surfaces any flakiness, follow-ups should add targeted property tests or fuzzing for deeply nested or strange datatypes.


Are there any user-facing changes?

  • Better support for struct-to-struct casting at query-plan and execution time. Queries that previously failed due to mismatched struct fields or nullability (for example select {'a': 1} = {'a': 2, 'b': NULL}) should behave more consistently or at least produce clearer errors indicating which child field failed to cast.

  • API changes are limited to internal behavior (the CastColumnFn in SchemaAdapter and cast signatures were extended). Public crate API surfaces remain stable for typical end-users but developers of custom adapters should update conforming closures to accept &CastOptions.


API / Compatibility Notes

  • This PR changes the internal signature of the cast function used by SchemaAdapter (it now accepts &CastOptions). Consumers that construct SchemaMapping by hand in downstream code (or tests) will need to adapt to the new closure shape.

  • No breaking changes to public user-facing Rust API are intended beyond the internal adapter closure signature; nevertheless, reviewers should confirm whether SchemaAdapter is considered public API in any downstream projects and coordinate a deprecation or migration path if needed.


Why Result<bool, _> was amended to Result<(), _>

Originally validate_struct_compatibility(...) returned Result<bool, _> where Ok(true) meant "compatible" and Ok(false) meant "incompatible but non-error". In practice the function should either succeed (meaning compatibility checks passed) or fail with a detailed error explaining why the structs cannot be cast. Returning a bool forced callers to interpret a false value and decide whether to convert it into an error. By changing the return type to Result<()> we make the contract clearer and more idiomatic:

  • Ok(()) indicates compatibility.
  • Err(...) indicates a concrete, actionable reason for incompatibility (e.g., incompatible child types or unsafe nullability change).

This simplifies callers (they can ? the validation) and yields richer error messages for users and better control flow in planning code that needs to bail on incompatible casts. It avoids a two-step "check then error" pattern and makes the function safer to use in chained logic.

@github-actions github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Sep 8, 2025
@kosiew kosiew marked this pull request as ready for review September 8, 2025 06:33
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

Comment on lines +282 to +284
fn arc_field(name: &str, data_type: DataType) -> FieldRef {
Arc::new(field(name, data_type))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find these very helpful. The non_null_field one maybe because it tells me in less verbose way what is being created but I'm okay just reading Arc::new(...)

@adriangb
Copy link
Contributor

adriangb commented Sep 8, 2025

This PR changes the internal signature of the cast function used by SchemaAdapter (it now accepts &CastOptions). Consumers that construct SchemaMapping by hand in downstream code (or tests) will need to adapt to the new closure shape.

And to be clear, this is breaking the recently introduced cast func API right? So impacted users will only be those on recent versions?

@adriangb
Copy link
Contributor

adriangb commented Sep 8, 2025

Btw I approved but let's leave this up for another day or so to see if anyone else has feedback

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look through this and I agree it looks like an improvement to me. Thank you @kosiew and @adriangb

@andygrove
Copy link
Member

@comphead @parthchandra @mbutrovich we should also review this to see how it impacts usage in Comet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants