-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat(spark): implement Spark try_parse_url
function
#17485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fc60bd0
to
8869878
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some CI failures to address
signature: Signature::one_of( | ||
vec![ | ||
TypeSignature::Uniform( | ||
1, | ||
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], | ||
), | ||
TypeSignature::Uniform( | ||
2, | ||
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], | ||
), | ||
TypeSignature::Uniform( | ||
3, | ||
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], | ||
), | ||
], | ||
Volatility::Immutable, | ||
), | ||
signature: Signature::user_defined(Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change interfere with dictionary types now? I think this works on main but doesn't work on this PR:
query T
SELECT parse_url(arrow_cast('http://spark.apache.org/path?query=1', 'Dictionary(Int32, Utf8)'), 'HOST'::string);
----
spark.apache.org
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don’t quite follow… Why should dictionary types be used as arguments for parse_url or try_parse_url? As far as I understand, the spec only expects string types parse_url doc ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary type is another way to represent strings, similar to how view type is a different way to represent strings.
Can see here in the doc there's a reference to handling dictionary types:
datafusion/datafusion/expr-common/src/signature.rs
Lines 229 to 239 in 351675d
/// One or arguments of all the same string types. | |
/// | |
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8. | |
/// Null is considered as `Utf8` by default | |
/// Dictionary with string value type is also handled. | |
/// | |
/// For example, if a function is called with (utf8, large_utf8), all | |
/// arguments will be coerced to `LargeUtf8` | |
/// | |
/// For functions that take no arguments (e.g. `random()` use [`TypeSignature::Nullary`]). | |
String(usize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, let me check it out then 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rereading this several times, my understanding is that when you pass a Dictionary with string values, DataFusion attempts to match it against the String
signature. However, parse_url
is defined to accept only plain string arguments ref. It does not expect any dictionary inputs.
We mark the UDF’s signature as user_defined
to enable coercion across string types (Utf8
, Utf8View
, LargeUtf8
), but a dictionary array is still not a string type, so it isn’t coerced, and the call won’t match.
In short, even if the String
signature seems to "capture" Dictionary with string values, parse_url
will still reject them because the underlying physical type is a dictionary, not a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, but my point was previously (on main), parse_url
seems to be capable of accepting string dictionary types, but with these changes that is no longer possible; is this something to be concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to simplify the type signature to just be:
signature: Signature::one_of(
vec![
TypeSignature::String(1),
TypeSignature::String(2),
TypeSignature::String(3),
],
Volatility::Immutable,
),
So they get cast to the same string type? This would avoid the need for the large match statement in spark_handled_parse_url
as well.
See related comment on other PR: #17195 (comment)
f0090f8
to
bf39835
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some fixes for PATH and AUTHORITY are now included in this PR, so would be good to call this out in the PR body.
signature: Signature::one_of( | ||
vec![ | ||
TypeSignature::Uniform( | ||
1, | ||
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], | ||
), | ||
TypeSignature::Uniform( | ||
2, | ||
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], | ||
), | ||
TypeSignature::Uniform( | ||
3, | ||
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], | ||
), | ||
], | ||
Volatility::Immutable, | ||
), | ||
signature: Signature::user_defined(Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, but my point was previously (on main), parse_url
seems to be capable of accepting string dictionary types, but with these changes that is no longer possible; is this something to be concerned about?
Arc::new(LargeStringArray::from(vals.to_vec())) as ArrayRef | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel quite a few of these tests could be done as slt tests; @alamb do we have a preference on where tests should be done? Should we prefer slt over rust tests, and fallback only to rust if it is something that slt can't handle?
Took a look at https://datafusion.apache.org/contributor-guide/testing.html but it doesn't mention if we have a specific preference, other than slt's being easier to maintain.
Which issue does this PR close?
part of #15914
Rationale for this change
Migrate Spark functions from https://github.com/lakehq/sail/ to DataFusion engine to unify codebase
What changes are included in this PR?
try_parse_url
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.try_parse_url.htmlparse_url
functionFix
PATH
part: if there is no path pat it returns an empty string (not "/")AUTHORITY
part: the authority section must include the port if the URL specifies one. However, at the moment, if the authority contains a well-known port (e.g., 80, 23, etc.), it is omitted and only shown when the port is a non-standard (“custom”) one.Are these changes tested?
unit-tests and sqllogictests added
Are there any user-facing changes?
Now can be called in queries