-
Notifications
You must be signed in to change notification settings - Fork 631
Fix placeholder spans #1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix placeholder spans #1979
Conversation
src/dialect/mod.rs
Outdated
/// Returns true if this dialect allow colon placeholders | ||
/// e.g. `SELECT :var` (JPA named parameters) | ||
fn supports_colon_placeholder(&self) -> bool { | ||
false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, sqlparser's parser combines a colon token followed by a word token into a placeholder, but does not adjust the span of the resulting ast node :/ and allows for whitespace between the colon and the word.
@xitep could you elaborate on this part? I'm not sure I followed the intent, is my understanding correct that currently the parser indeed handles e.g. SELECT :var
? If so what's the requirement for the tokenizer updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @iffyio, yes the parser can deal with things like :var
already, but i'm not sure this is a feature or rather just a side effect of the code pointed out above. here's an example:
let q = sqlparser::parser::Parser::new(&sqlparser::dialect::GenericDialect)
.try_with_sql("select :var")
.unwrap()
.parse_query()
.unwrap();
let _ = sqlparser::ast::visit_expressions(&q, |expr: &Expr| {
if let Expr::Value(v) = expr
&& let Value::Placeholder(s) = &v.value
{
println!("value: {v:?}");
println!("placeholder: {s:?})");
}
ControlFlow::<()>::Continue(())
});
// OUTPUT:
// value: ValueWithSpan { value: Placeholder(":var"), span: Span(Location(1,8)..Location(1,9)) }
// placeholder: ":var")
note:
- the span is incorrect, it covers only the colon
- the placeholder represents
:<word>
(which in fact is what i'm after)
however, you'll get the same output when using "select : var"
or even "select : /*foobar*/ var"
as input :-/ this is not what i want.
so the intention is to make :<word>
a first level token, a placeholder.
does this help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that make sense, it seems that the handling of placeholders isn't very consistent by the parser as some are done in tokenizer while others later on. I think it could be good to do a more general hopefully improvement if we feel that ideally placeholders are managed by the tokenizer, the we can parse and represent it consistently/accordingly.
Did a quick pass through existing support, and I can imagine something like the following representation to merge the different representations explicitly, wdyt?
enum PlaceholderKind {
/// Example:
/// ```sql
/// SELECT ?, ?foo
/// ```
SingleQuestionMark,
/// Example:
/// ```sql
/// SELECT @foo
/// ```
SingleAt,
/// Example:
/// ```sql
/// SELECT $1, $foo
/// ```
SingleDollar,
/// Example:
/// ```sql
/// SELECT :42, :foo
/// ```
Colon,
/// Example:
/// ```sql
/// SELECT $$, $$abc$$, $abc$, $$$abc$$$
/// ```
DollarEnclosed {
count: usize
}
}
Token::Placeholder {
kind: PlaceholderKind,
value: Option<String>
}
It would also remove the need for this code as well
datafusion-sqlparser-rs/src/parser/mod.rs
Lines 9637 to 9646 in 15d8bfe
// Not calling self.parse_identifier(false)? because only in placeholder we want to check numbers as idfentifies | |
// This because snowflake allows numbers as placeholders | |
let next_token = self.next_token(); | |
let ident = match next_token.token { | |
Token::Word(w) => Ok(w.into_ident(next_token.span)), | |
Token::Number(w, false) => Ok(Ident::new(w)), | |
_ => self.expected("placeholder", next_token), | |
}?; | |
let placeholder = tok.to_string() + &ident.value; | |
ok_value(Value::Placeholder(placeholder)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i can try it out and upgrade this PR accordingly. here's one thought to be clarified before i set off with it:
Token::Placeholder {
kind: PlaceholderKind,
value: Option<String>
}
this implies to me that "value" would be stripped off of the placeholder demarcation character(s); e.g. for the input ":foo"
-> value
would read Some("foo")
, for the input "$$foo$$"
it would read Some("foo")
, for the input "$"
value
would be None
, correct?
But then, what would sqlparser::ast::Value::Placeholder(..)
look like? Clients surely might want to know exactly which placeholder style/kind was used; since sqlparser
supports various formats, some clients might opt in to reject what they don't want their queries to contain.
all in all, it looks to me that in your proposal we want to end up with value
being the original, full token as identified in the scanned input (which, naturally, is never None
or empty), therefore:
Token::Placeholder {
kind: PlaceholderKind,
value: String, // ~ preferrably &'a str
}
any feedback on this thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's correct that the input would be stripped into two, the kind and the value as in your examples.
Value::Placeholder
I imagine would have the same representation mirroring Token::Placeholder
since the goal becomes that placeholder parsing is handled by the tokenizer i.e.Value::Placeholder(Token::Placeholder)
which, naturally, is never None or empty
I think we might not always have a variable name, hence the Option
- sqlite for example uses anonymous ?
in some contexts for placeholders e.g. WHERE city IN (?, ?, ?, ?)
. In that scenario the representation becomes Token::Placeholder{kind: PlaceholderKind::SingleQuestionMark, value: None}
value: String, // ~ preferrably &'a str
ah heads up I think a reference &str
wouldn't be feasible here without introducing lifetimes to the Token/Value enums which would be quite an invasive change and likely not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt ...
DollarEnclosed {
count: usize
}
i'm seeing "$$foo$$" being treated as a "double quoted string", not as a placeholder. if dialect.supports_dollar_placeholder() == true)
then $...
is considered a placeholder, but the dollars at the start and end don't have to be balanced. this make the DollarEnclosed
variant essentially void.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, after a lot of struggle i came to the conclusion that the approach with first-class placeholder is not feasible in sqlparser
. After having the tokenizer and the model change ready, this test (now failing) convinced me, that my concern cannot be resolved at the level of the tokenizer since sqlparser
needs to be very flexible and determine the meaning only on the level of the parser. you can have a look here at how the change we discussed would have looked like. (this would be the consequential change to the parser.)
fortunately, i have found a far smaller change that satisfies my needs and will update this PR.
src/tokenizer.rs
Outdated
/// Tokenizes an identifier followed immediately after a colon, | ||
/// aka named query parameter, e.g. `:name`. The next char of the | ||
/// processed char stream is to be an alphabetic - panics otherwise. | ||
fn tokenize_colon_preceeded_placeholder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would likely be ideal to use existing logic like tokenize_identifier_or_keyword
or similar. Also heads up in general for the parser that returning an error when input is invalid etc is preferred over panicking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- wrt
tokenize_identifier_or_keyword
... i guess it would make it quite hard to support the$
then as part of the identifier (if enabled byDialect::supports_dollar_placeholder
https://docs.rs/sqlparser/latest/sqlparser/dialect/trait.Dialect.html#method.supports_dollar_placeholder.) i see the placeholder identifiers distinct (ie. with different restrictions) from other SQL identifiers. - thanks for the heads-up; the panic here would be a programming error, as that contract says that the caller must make sure ... but I'm fine with the established convention and would make it return an error instead.
src/dialect/mod.rs
Outdated
/// Returns true if this dialect allow colon placeholders | ||
/// e.g. `SELECT :var` (JPA named parameters) | ||
fn supports_colon_placeholder(&self) -> bool { | ||
false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that make sense, it seems that the handling of placeholders isn't very consistent by the parser as some are done in tokenizer while others later on. I think it could be good to do a more general hopefully improvement if we feel that ideally placeholders are managed by the tokenizer, the we can parse and represent it consistently/accordingly.
Did a quick pass through existing support, and I can imagine something like the following representation to merge the different representations explicitly, wdyt?
enum PlaceholderKind {
/// Example:
/// ```sql
/// SELECT ?, ?foo
/// ```
SingleQuestionMark,
/// Example:
/// ```sql
/// SELECT @foo
/// ```
SingleAt,
/// Example:
/// ```sql
/// SELECT $1, $foo
/// ```
SingleDollar,
/// Example:
/// ```sql
/// SELECT :42, :foo
/// ```
Colon,
/// Example:
/// ```sql
/// SELECT $$, $$abc$$, $abc$, $$$abc$$$
/// ```
DollarEnclosed {
count: usize
}
}
Token::Placeholder {
kind: PlaceholderKind,
value: Option<String>
}
It would also remove the need for this code as well
datafusion-sqlparser-rs/src/parser/mod.rs
Lines 9637 to 9646 in 15d8bfe
// Not calling self.parse_identifier(false)? because only in placeholder we want to check numbers as idfentifies | |
// This because snowflake allows numbers as placeholders | |
let next_token = self.next_token(); | |
let ident = match next_token.token { | |
Token::Word(w) => Ok(w.into_ident(next_token.span)), | |
Token::Number(w, false) => Ok(Ident::new(w)), | |
_ => self.expected("placeholder", next_token), | |
}?; | |
let placeholder = tok.to_string() + &ident.value; | |
ok_value(Value::Placeholder(placeholder)) |
This reverts commit c54bef6.
1. Require placeholders to be whitespace/comment-less 2. Correctly report span covering the whole placeholder
@iffyio would you like to have another look? i changed strategy to the whole problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xitep! The changes look reasonable to me overall, left some questions/comments
@@ -9586,14 +9586,14 @@ impl<'a> Parser<'a> { | |||
tok @ Token::Colon | tok @ Token::AtSign => { | |||
// Not calling self.parse_identifier(false)? because only in placeholder we want to check numbers as idfentifies | |||
// This because snowflake allows numbers as placeholders | |||
let next_token = self.next_token(); | |||
let next_token = self.next_token_no_skip().unwrap_or(&EOF_TOKEN).clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the intent correctly is this saying that e.g. SELECT :foo
is to be parsed as a placeholder but SELECT : foo
should not? If so I think it would be good to mention that in a comment. Also can we add a test case to demonstrate this change?
_ => self.expected("placeholder", next_token), | ||
}?; | ||
let placeholder = tok.to_string() + &ident.value; | ||
ok_value(Value::Placeholder(placeholder)) | ||
Ok(Value::Placeholder(tok.to_string() + &ident.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case for this to avoid future regressions? we have some span test cases here that we could add to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many thanks for your support @iffyio 🙇 |
:name
as a parameters in queries; initially requested through Parameterized queries #291