Skip to content

Conversation

rafafrdz
Copy link

@rafafrdz rafafrdz commented Sep 9, 2025

Which issue does this PR close?

part of #15914

Rationale for this change

Migrate Spark functions from https://github.com/lakehq/sail/ to DataFusion engine to unify codebase

What changes are included in this PR?

Fix

  • PATH part: if there is no path pat it returns an empty string (not "/")
  • AUTHORITY part: the authority section must include the port if the URL specifies one. However, at the moment, if the authority contains a well-known port (e.g., 80, 23, etc.), it is omitted and only shown when the port is a non-standard (“custom”) one.

Are these changes tested?

unit-tests and sqllogictests added

Are there any user-facing changes?

Now can be called in queries

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Sep 9, 2025
@rafafrdz rafafrdz marked this pull request as draft September 9, 2025 12:39
@rafafrdz rafafrdz marked this pull request as ready for review September 9, 2025 14:40
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some CI failures to address

Comment on lines 50 to 49
signature: Signature::one_of(
vec![
TypeSignature::Uniform(
1,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
2,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
3,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
],
Volatility::Immutable,
),
signature: Signature::user_defined(Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change interfere with dictionary types now? I think this works on main but doesn't work on this PR:

query T
SELECT parse_url(arrow_cast('http://spark.apache.org/path?query=1', 'Dictionary(Int32, Utf8)'), 'HOST'::string);
----
spark.apache.org

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don’t quite follow… Why should dictionary types be used as arguments for parse_url or try_parse_url? As far as I understand, the spec only expects string types parse_url doc ref

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary type is another way to represent strings, similar to how view type is a different way to represent strings.

Can see here in the doc there's a reference to handling dictionary types:

/// One or arguments of all the same string types.
///
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8.
/// Null is considered as `Utf8` by default
/// Dictionary with string value type is also handled.
///
/// For example, if a function is called with (utf8, large_utf8), all
/// arguments will be coerced to `LargeUtf8`
///
/// For functions that take no arguments (e.g. `random()` use [`TypeSignature::Nullary`]).
String(usize),

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me check it out then 😃

Copy link
Author

@rafafrdz rafafrdz Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rereading this several times, my understanding is that when you pass a Dictionary with string values, DataFusion attempts to match it against the String signature. However, parse_url is defined to accept only plain string arguments ref. It does not expect any dictionary inputs.

We mark the UDF’s signature as user_defined to enable coercion across string types (Utf8, Utf8View, LargeUtf8), but a dictionary array is still not a string type, so it isn’t coerced, and the call won’t match.

In short, even if the String signature seems to "capture" Dictionary with string values, parse_url will still reject them because the underlying physical type is a dictionary, not a string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but my point was previously (on main), parse_url seems to be capable of accepting string dictionary types, but with these changes that is no longer possible; is this something to be concerned about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to simplify the type signature to just be:

signature: Signature::one_of(
    vec![
        TypeSignature::String(1),
        TypeSignature::String(2),
        TypeSignature::String(3),
    ],
    Volatility::Immutable,
),

So they get cast to the same string type? This would avoid the need for the large match statement in spark_handled_parse_url as well.

See related comment on other PR: #17195 (comment)

@rafafrdz rafafrdz marked this pull request as draft September 11, 2025 11:04
@rafafrdz rafafrdz marked this pull request as ready for review September 11, 2025 13:59
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some fixes for PATH and AUTHORITY are now included in this PR, so would be good to call this out in the PR body.

Comment on lines 50 to 49
signature: Signature::one_of(
vec![
TypeSignature::Uniform(
1,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
2,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
3,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
],
Volatility::Immutable,
),
signature: Signature::user_defined(Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but my point was previously (on main), parse_url seems to be capable of accepting string dictionary types, but with these changes that is no longer possible; is this something to be concerned about?

Arc::new(LargeStringArray::from(vals.to_vec())) as ArrayRef
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel quite a few of these tests could be done as slt tests; @alamb do we have a preference on where tests should be done? Should we prefer slt over rust tests, and fallback only to rust if it is something that slt can't handle?

Took a look at https://datafusion.apache.org/contributor-guide/testing.html but it doesn't mention if we have a specific preference, other than slt's being easier to maintain.

@rafafrdz rafafrdz requested a review from Jefffrey September 12, 2025 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants