-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Generalize struct-to-struct casting with CastOptions and SchemaAdapter integration #17468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
fn arc_field(name: &str, data_type: DataType) -> FieldRef { | ||
Arc::new(field(name, data_type)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find these very helpful. The non_null_field
one maybe because it tells me in less verbose way what is being created but I'm okay just reading Arc::new(...)
And to be clear, this is breaking the recently introduced cast func API right? So impacted users will only be those on recent versions? |
Btw I approved but let's leave this up for another day or so to see if anyone else has feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comphead @parthchandra @mbutrovich we should also review this to see how it impacts usage in Comet |
Which issue does this PR close?
cast_column
to enable struct → struct casting in more contexts #16579This is part of a series smaller PRs to replace #17281
Rationale for this change
The existing struct-casting logic was implemented in a narrow context (NestedSchemaAdapter) and relied on
arrow::compute::cast
for non-struct casts. That limited reuse and produced surprising failures when the planner attempted to reconcile mixed or nullable struct fields. This change generalizescast_column
to acceptarrow::compute::CastOptions
, adds a robustvalidate_struct_compatibility
function (which returnsResult<()>
on success), preserves parent nulls, and integrates the generalized cast function intoSchemaAdapter
so casting behavior is consistent across the codebase.By centralizing struct-casting logic we improve maintainability, make behavior explicit (via
CastOptions
), and fix multiple user-facing failures where struct fields differed by name, order, or nullability. This change enables reliable struct-to-struct casts during planning and execution and simplifies later extension to other adapters.What changes are included in this PR?
API / implementation
Generalized
pub fn cast_column(...)
to accept a&CastOptions
argument and dispatch toarrow::compute::cast_with_options
for non-struct targets.Implemented
cast_struct_column(...)
that:validate_struct_compatibility
) before attempting to cast children.new_null_array(...)
of the target data type.StructArray
.Added
validate_struct_compatibility(...)
returningResult<()>
(errors on incompatible types or unsafe nullability changes).Introduced and used
DEFAULT_CAST_OPTIONS
in tests and indatasource::schema_adapter
to provide a consistent default behavior when casting in mapping pipelines.Integration
SchemaAdapter
/SchemaMapping
signatures to accept aCastColumnFn
that takes&CastOptions
and wired the default mapping to callcast_column(..., &DEFAULT_CAST_OPTIONS)
.schema_adapter.rs
to validate struct compatibility and returnOk(true)
/Ok(false)
consistently where expected.Tests
Added/expanded unit tests for:
test_cast_simple_column
,test_cast_column_with_options
)test_cast_struct_with_missing_field
,test_cast_struct_parent_nulls_retained
)test_cast_struct_incompatible_child_type
)test_validate_struct_compatibility_nullable_to_non_nullable
,test_validate_struct_compatibility_non_nullable_to_nullable
,test_validate_struct_compatibility_nested_nullable_to_non_nullable
)test_cast_nested_struct_with_extra_and_missing_fields
)test_cast_struct_with_array_and_map_fields
)test_cast_struct_field_order_differs
)Documentation
CastOptions
usage and included a short example showingsafe: true
making overflow produceNULL
rather than an error.Are these changes tested?
Yes — the patch adds multiple unit tests in
datafusion/common/src/nested_struct.rs
covering primitive casts with options, struct-to-struct casts (including missing/extra fields, null preservation), nullability validation, nested structs, array/map child fields, and ordering differences. These tests exercise the newcast_column
entrypoint and schema adapter integration.If CI surfaces any flakiness, follow-ups should add targeted property tests or fuzzing for deeply nested or strange datatypes.
Are there any user-facing changes?
Better support for struct-to-struct casting at query-plan and execution time. Queries that previously failed due to mismatched struct fields or nullability (for example
select {'a': 1} = {'a': 2, 'b': NULL}
) should behave more consistently or at least produce clearer errors indicating which child field failed to cast.API changes are limited to internal behavior (the
CastColumnFn
inSchemaAdapter
and cast signatures were extended). Public crate API surfaces remain stable for typical end-users but developers of custom adapters should update conforming closures to accept&CastOptions
.API / Compatibility Notes
This PR changes the internal signature of the cast function used by
SchemaAdapter
(it now accepts&CastOptions
). Consumers that constructSchemaMapping
by hand in downstream code (or tests) will need to adapt to the new closure shape.No breaking changes to public user-facing Rust API are intended beyond the internal adapter closure signature; nevertheless, reviewers should confirm whether
SchemaAdapter
is considered public API in any downstream projects and coordinate a deprecation or migration path if needed.Why
Result<bool, _>
was amended toResult<(), _>
Originally
validate_struct_compatibility(...)
returnedResult<bool, _>
whereOk(true)
meant "compatible" andOk(false)
meant "incompatible but non-error". In practice the function should either succeed (meaning compatibility checks passed) or fail with a detailed error explaining why the structs cannot be cast. Returning abool
forced callers to interpret afalse
value and decide whether to convert it into an error. By changing the return type toResult<()>
we make the contract clearer and more idiomatic:Ok(())
indicates compatibility.Err(...)
indicates a concrete, actionable reason for incompatibility (e.g., incompatible child types or unsafe nullability change).This simplifies callers (they can
?
the validation) and yields richer error messages for users and better control flow in planning code that needs to bail on incompatible casts. It avoids a two-step "check then error" pattern and makes the function safer to use in chained logic.