-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate locations throughout the parser #790
Conversation
Thank you @ankrgyl -- I will try and find time to review this over the weekend |
I am sorry I am a bit behind on reviews / merging in sqlparser-rs. I hope to be able to catch up later this week |
No rush! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for kickstarting this project / push @ankrgyl -- it is really exciting to see. I have some comments on the approach -- let me know what you think
cc @AugustoFKL for your thoughts
@@ -46,12 +47,12 @@ pub enum AlterTableOperation { | |||
/// `DROP CONSTRAINT [ IF EXISTS ] <name>` | |||
DropConstraint { | |||
if_exists: bool, | |||
name: Ident, | |||
name: Located<Ident>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of wrapping this in Located<>
?
Another alternative would be to add a location field to the Ident
struct like
pub struct Ident {
/// The value of the identifier without quotes.
pub value: String,
/// The starting quote if any. Valid quote characters are the single quote,
/// double quote, backtick, and opening square bracket.
pub quote_style: Option<char>,
/// location in the source tree. <----- New Field
location: Location,
}
Both will be breaking changes, but I am thinking to a future where there is location information on all structs -- if all fields all wrapped in Location<..>
it is going to be a lot of replicated Location<>
in the code
If they each have a location
field then the location will only be added by the parser (though the tests will need updating too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. For some background context, I've been working on a SQL compiler that uses the sqlparser-rs
library as our parser. We added these locations to improve the error messages in the compiler.
We actually started by embedding the location
in Ident
, but there are places where we actually want to store the Ident
itself (e.g. a map from Ident
to some binder state about it), and in those places we don't care about the Location
. There are also places where we want to do things like convert a string
into an Ident
, and if location
is a field on the Ident
, we'll have to make it empty/unknown. This isn't an issue in-and-of-itself, it just means that we won't get the typesystem's help to enforce whether or not an Ident
has a location.
A second benefit to wrapping in Located
is that we can write a bunch of functions that take Located<T>
(e.g. error messages) and use the location (yes, this could be accomplished with a Located
trait too). It's essentially like a smart pointer.
All of that said, I'm happy to consider another approach — I only have perspective from the one project I've been working on and want to ensure that the contributions are aligned with the long term of the community around this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an issue in-and-of-itself, it just means that we won't get the typesystem's help to enforce whether or not an Ident has a location.
this makes sense -- I guess I am worried about the implications on downstream crates of changing all fields to Location
-- specifically when used in match
expressions there are lots of patterns like https://github.com/apache/arrow-datafusion/blob/350cb47289a76e579b221fe374e4cf09db332569/datafusion/sql/src/expr/function.rs#L168-L184
Adding Located
will require all such locations to add a new level of Located. If we add
locationas a field then at least where there are matches with default match arms
..` no changes will be required.
match sql {
FunctionArgs::Named {
name,
arg: ...
.. // <-- can be used to ignore location
}
...
with located wouldn't this end up looking something like
match sql {
FunctionArgs::Named {
name: Located { name, ... },
arg: Located { ... }
.. // <-- can be used to ignore location
}
...
In terms of converting String to Ident what about something like an optional location? Like
pub struct Ident {
pub value: String,
pub quote_style: Option<char>,
location: Option<Location>,
}
And then have
impl FromStr for Ident {
...
Ident {
...
location: None.
}
(yes, this could be accomplished with a Located trait too).
I agree this would be a better approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with located wouldn't this end up looking something like
match sql {
FunctionArgs::Named {
name: Located { name, ... },
arg: Located { ... }
.. // <-- can be used to ignore location
}
...
I don't think so — but apologies if I'm missing something. Located
is exactly like a smart pointer (e.g. like Box
), so you can write
match sql {
FunctionArgs::Named {
name,
arg,
}
...
and then "pretend" name
is just an Ident
, except that you can also call .location()
on it.
In terms of converting String to Ident what about something like an optional location? Like
Yes that's true, but the problem is that now you don't have the typesystem's help to know whether or not something has a location. For example, imagine you put these idents into a map where you "don't care" about locations. You'd need Hash
, Eq
, etc. to ignore the location
field. However, there may be a separate use case where you do care about locations — what do you do then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your thoughtful reply @ankrgyl -- I plan to give this a try in the next few days. I am sorry I just don't have very much focus time to devote to sqlparser at the moment
This PR is definitely on my list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I actually open sourced the project I'm working on that uses sqlparser-rs
this weekend: https://github.com/qscl/queryscript. It's not ready for real use yet, but you can see the fork of sqlparser, and if you feel compelled, how we use it in the code (e.g. the VSCode uses Locations to show errors in the exact right locations). Happy to chat further about it offline too if helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for my lack of communication here.
Basically I have (obviously) extremely limited time for maintenance of this crate and for that I apologize.
Given the limited maintenance bandwidth, I am very hesitant to accept a change that causes downstream churn given the possibility of requiring non trivial effort to fix here.
Basically, to merge a PR like this I would like to see what changes are required to an existing project that uses AST matching extensively. Any project that uses sqlparser-rs would do (though I am personally very familiar with https://github.com/apache/arrow-datafusion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So specifically, I want to see a PR to an existing project that pins sqlparser to this branch and shows the tests / CI passing as a demonstration of how much downstream impact it would have)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted as a top-level comment, whoops! #790 (comment)
Marking back as a draft while we work through feedback |
Here you go for DataFusion: apache/datafusion@main...qscl:arrow-datafusion:sqlparser. I used this branch which is approximately the same as the PR (more recently rebased + fixed support for I verified that on both
Most of the changes in this diff can also probably be avoided with some |
Yes, this change makes sense -- thank you @ankrgyl Basically I still feel very worried about the downstream churn adding I agree that adding a I am not sure how to proceed here. |
Would it be worthwhile to chat live? I know you're tight on time, but may be useful to brainstorm a few ideas. Some things that come to mind for me are: (a) long term, I'd bet that for the library to (continue) to be world class, it should support locations (b) maybe there are some clever Rust tricks we can think about that make it optional (c) maybe we could get some more community feedback on the design/rollout. Feel free to shoot me an email at [email protected] and maybe we can chat? |
Sounds good -- thanks @ankrgyl -- I agree it would be great to find a way forward here |
Could we make the Located strategy a template on the Parser and default to a noop wrapper for backwards compat? PoCuse std::fmt;
// ast
#[derive(Debug, PartialEq)]
struct Number(u32);
impl fmt::Display for Number {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.0)
}
}
// location
#[derive(Debug, PartialEq)]
struct Loc(u32, u32);
impl fmt::Display for Loc {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}:{}", self.0, self.1)
}
}
// location wrapper strategy
trait LocWrapper {
type Item<T>;
fn loc<T>(item: T, beg: Loc, end: Loc) -> Self::Item<T>;
}
// no-op location wrapper
struct IdentityWrapper();
impl LocWrapper for IdentityWrapper {
type Item<T> = T;
fn loc<T>(item: T, _beg: Loc, _end: Loc) -> Self::Item<T> {
item
}
}
// span location wrapper
#[derive(Debug, PartialEq)]
struct Span<T> {
item: T,
beg: Loc,
end: Loc,
}
impl<T: fmt::Display> fmt::Display for Span<T> {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{} {}–{}", self.item, self.beg, self.end)
}
}
struct SpanWrapper;
impl LocWrapper for SpanWrapper {
type Item<T> = Span<T>;
fn loc<T>(item: T, beg: Loc, end: Loc) -> Span<T> {
Span { item, beg, end }
}
}
// parser
struct ParserWithLoc<'a, L: LocWrapper> {
_marker: std::marker::PhantomData<&'a L>,
}
impl<'a, L: LocWrapper> ParserWithLoc<'a, L> {
fn new() -> Self {
Self {
_marker: std::marker::PhantomData::<&'a L> {},
}
}
fn parse(&mut self) -> L::Item<Number> {
L::loc(Number(0), Loc(1, 16), Loc(3, 23))
}
}
type Parser<'a> = ParserWithLoc<'a, IdentityWrapper>; fn main() {
assert_eq!(Parser::new().parse(), Number(0));
assert_eq!(ParserWithLoc::<SpanWrapper>::new().parse(), Span{item: Number(0), beg: Loc(1, 16), end: Loc(3, 23)});
} Would ideally directly use |
Thanks @MartinNowak -- I don't quite follow your example (in how the AST that is produced would be different and how downstream crates would have to change ) |
Updated, maybe a bit clearer. This idea is about optional locations and backwards compatibility.
The key point is that we could employ a strategy to make the AST-wrapping configurable and thus allow library consumers to adopt locations at their own pace. fn main() {
assert_eq!(Parser::new().parse(), Number(0));
assert_eq!(ParserWithLoc::<SpanWrapper>::new().parse(), Span{item: Number(0), beg: Loc(1, 16), end: Loc(3, 23)});
} On the downside a wrapping approach would require to templatize the more complex AST to support locations of nested AST elements. |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
This diff propagates location information throughout identifiers in the parser, and includes the machinery (namely, the
Located
struct) that should make it straightforward to do so for other types.CC @alamb