Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate locations throughout the parser #790

Closed
wants to merge 1 commit into from

Conversation

ankrgyl
Copy link
Contributor

@ankrgyl ankrgyl commented Jan 5, 2023

This diff propagates location information throughout identifiers in the parser, and includes the machinery (namely, the Located struct) that should make it straightforward to do so for other types.

CC @alamb

@alamb
Copy link
Contributor

alamb commented Jan 6, 2023

Thank you @ankrgyl -- I will try and find time to review this over the weekend

@alamb
Copy link
Contributor

alamb commented Jan 9, 2023

I am sorry I am a bit behind on reviews / merging in sqlparser-rs. I hope to be able to catch up later this week

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Jan 9, 2023

No rush!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for kickstarting this project / push @ankrgyl -- it is really exciting to see. I have some comments on the approach -- let me know what you think

cc @AugustoFKL for your thoughts

@@ -46,12 +47,12 @@ pub enum AlterTableOperation {
/// `DROP CONSTRAINT [ IF EXISTS ] <name>`
DropConstraint {
if_exists: bool,
name: Ident,
name: Located<Ident>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of wrapping this in Located<>?

Another alternative would be to add a location field to the Ident struct like

pub struct Ident {
    /// The value of the identifier without quotes.
    pub value: String,
    /// The starting quote if any. Valid quote characters are the single quote,
    /// double quote, backtick, and opening square bracket.
    pub quote_style: Option<char>,
    /// location in the source tree. <----- New Field
    location: Location,
}

Both will be breaking changes, but I am thinking to a future where there is location information on all structs -- if all fields all wrapped in Location<..> it is going to be a lot of replicated Location<> in the code

If they each have a location field then the location will only be added by the parser (though the tests will need updating too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. For some background context, I've been working on a SQL compiler that uses the sqlparser-rs library as our parser. We added these locations to improve the error messages in the compiler.

We actually started by embedding the location in Ident, but there are places where we actually want to store the Ident itself (e.g. a map from Ident to some binder state about it), and in those places we don't care about the Location. There are also places where we want to do things like convert a string into an Ident, and if location is a field on the Ident, we'll have to make it empty/unknown. This isn't an issue in-and-of-itself, it just means that we won't get the typesystem's help to enforce whether or not an Ident has a location.

A second benefit to wrapping in Located is that we can write a bunch of functions that take Located<T> (e.g. error messages) and use the location (yes, this could be accomplished with a Located trait too). It's essentially like a smart pointer.

All of that said, I'm happy to consider another approach — I only have perspective from the one project I've been working on and want to ensure that the contributions are aligned with the long term of the community around this repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an issue in-and-of-itself, it just means that we won't get the typesystem's help to enforce whether or not an Ident has a location.

this makes sense -- I guess I am worried about the implications on downstream crates of changing all fields to Location -- specifically when used in match expressions there are lots of patterns like https://github.com/apache/arrow-datafusion/blob/350cb47289a76e579b221fe374e4cf09db332569/datafusion/sql/src/expr/function.rs#L168-L184

Adding Located will require all such locations to add a new level of Located. If we addlocationas a field then at least where there are matches with default match arms ..` no changes will be required.

match sql {
  FunctionArgs::Named {
    name, 
    arg: ...
    .. // <-- can be used to ignore location
  }
   ...

with located wouldn't this end up looking something like

match sql {
  FunctionArgs::Named {
    name: Located { name, ... }, 
    arg: Located { ... }
    .. // <-- can be used to ignore location
  }
   ...

In terms of converting String to Ident what about something like an optional location? Like

pub struct Ident {
    pub value: String,
    pub quote_style: Option<char>,
    location: Option<Location>,
}

And then have

impl FromStr for Ident { 
...
 Ident { 
 ... 
  location: None.
}

(yes, this could be accomplished with a Located trait too).

I agree this would be a better approach

Copy link
Contributor Author

@ankrgyl ankrgyl Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with located wouldn't this end up looking something like

match sql {
  FunctionArgs::Named {
    name: Located { name, ... }, 
    arg: Located { ... }
    .. // <-- can be used to ignore location
  }
   ...

I don't think so — but apologies if I'm missing something. Located is exactly like a smart pointer (e.g. like Box), so you can write

match sql {
  FunctionArgs::Named {
    name, 
    arg,
}
...

and then "pretend" name is just an Ident, except that you can also call .location() on it.

In terms of converting String to Ident what about something like an optional location? Like

Yes that's true, but the problem is that now you don't have the typesystem's help to know whether or not something has a location. For example, imagine you put these idents into a map where you "don't care" about locations. You'd need Hash, Eq, etc. to ignore the location field. However, there may be a separate use case where you do care about locations — what do you do then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your thoughtful reply @ankrgyl -- I plan to give this a try in the next few days. I am sorry I just don't have very much focus time to devote to sqlparser at the moment

This PR is definitely on my list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I actually open sourced the project I'm working on that uses sqlparser-rs this weekend: https://github.com/qscl/queryscript. It's not ready for real use yet, but you can see the fork of sqlparser, and if you feel compelled, how we use it in the code (e.g. the VSCode uses Locations to show errors in the exact right locations). Happy to chat further about it offline too if helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for my lack of communication here.

Basically I have (obviously) extremely limited time for maintenance of this crate and for that I apologize.

Given the limited maintenance bandwidth, I am very hesitant to accept a change that causes downstream churn given the possibility of requiring non trivial effort to fix here.

Basically, to merge a PR like this I would like to see what changes are required to an existing project that uses AST matching extensively. Any project that uses sqlparser-rs would do (though I am personally very familiar with https://github.com/apache/arrow-datafusion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So specifically, I want to see a PR to an existing project that pins sqlparser to this branch and shows the tests / CI passing as a demonstration of how much downstream impact it would have)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted as a top-level comment, whoops! #790 (comment)

@alamb
Copy link
Contributor

alamb commented Feb 17, 2023

Marking back as a draft while we work through feedback

@alamb alamb marked this pull request as draft February 17, 2023 19:03
@ankrgyl
Copy link
Contributor Author

ankrgyl commented Feb 17, 2023

Here you go for DataFusion: apache/datafusion@main...qscl:arrow-datafusion:sqlparser. I used this branch which is approximately the same as the PR (more recently rebased + fixed support for Visitor).

I verified that on both main and this branch, cargo test returns the following:

test result: FAILED. 466 passed; 144 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.79s

Most of the changes in this diff can also probably be avoided with some into() methods or changing some of the functions within datafusion to take Located<Ident> instead. Note that the distinction for a function accepting the Located or a raw Ident is quite important, and actually worth distinguishing.

@alamb
Copy link
Contributor

alamb commented Feb 18, 2023

Here you go for DataFusion: apache/datafusion@main...qscl:arrow-datafusion:sqlparser. I used this branch which is approximately the same as the PR (more recently rebased + fixed support for Visitor).

Yes, this change makes sense -- thank you @ankrgyl

Basically I still feel very worried about the downstream churn adding Located will add.

I agree that adding a location: Location would also cause a bunch of churn.

I am not sure how to proceed here.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Feb 18, 2023

Would it be worthwhile to chat live? I know you're tight on time, but may be useful to brainstorm a few ideas.

Some things that come to mind for me are: (a) long term, I'd bet that for the library to (continue) to be world class, it should support locations (b) maybe there are some clever Rust tricks we can think about that make it optional (c) maybe we could get some more community feedback on the design/rollout.

Feel free to shoot me an email at [email protected] and maybe we can chat?

@alamb
Copy link
Contributor

alamb commented Feb 18, 2023

Sounds good -- thanks @ankrgyl -- I agree it would be great to find a way forward here

@MartinNowak
Copy link
Contributor

MartinNowak commented Aug 27, 2023

Some things that come to mind for me are: (a) long term, I'd bet that
for the library to (continue) to be world class, it should support
locations (b) maybe there are some clever Rust tricks we can think
about that make it optional (c) maybe we could get some more community
feedback on the design/rollout.

Could we make the Located strategy a template on the Parser and default to a noop wrapper for backwards compat?

PoC
use std::fmt;

// ast
#[derive(Debug, PartialEq)]
struct Number(u32);

impl fmt::Display for Number {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{}", self.0)
    }
}

// location
#[derive(Debug, PartialEq)]
struct Loc(u32, u32);

impl fmt::Display for Loc {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{}:{}", self.0, self.1)
    }
}

// location wrapper strategy
trait LocWrapper {
    type Item<T>;

    fn loc<T>(item: T, beg: Loc, end: Loc) -> Self::Item<T>;
}

// no-op location wrapper
struct IdentityWrapper();

impl LocWrapper for IdentityWrapper {
    type Item<T> = T;

    fn loc<T>(item: T, _beg: Loc, _end: Loc) -> Self::Item<T> {
        item
    }
}

// span location wrapper
#[derive(Debug, PartialEq)]
struct Span<T> {
    item: T,
    beg: Loc,
    end: Loc,
}

impl<T: fmt::Display> fmt::Display for Span<T> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{} {}–{}", self.item, self.beg, self.end)
    }
}

struct SpanWrapper;

impl LocWrapper for SpanWrapper {
    type Item<T> = Span<T>;

    fn loc<T>(item: T, beg: Loc, end: Loc) -> Span<T> {
        Span { item, beg, end }
    }
}

// parser
struct ParserWithLoc<'a, L: LocWrapper> {
    _marker: std::marker::PhantomData<&'a L>,
}

impl<'a, L: LocWrapper> ParserWithLoc<'a, L> {
    fn new() -> Self {
        Self {
            _marker: std::marker::PhantomData::<&'a L> {},
        }
    }

    fn parse(&mut self) -> L::Item<Number> {
        L::loc(Number(0), Loc(1, 16), Loc(3, 23))
    }
}

type Parser<'a> = ParserWithLoc<'a, IdentityWrapper>;
fn main() {
    assert_eq!(Parser::new().parse(), Number(0));
    assert_eq!(ParserWithLoc::<SpanWrapper>::new().parse(), Span{item: Number(0), beg: Loc(1, 16), end: Loc(3, 23)});
}

Would ideally directly use Identity and Span as template arguments rather than associating them via wrappers (Higher kinded polymorphism, The push for GATs stabilization).

@alamb
Copy link
Contributor

alamb commented Aug 28, 2023

Could we make the Located strategy a template on the Parser and default to a noop wrapper for backwards compat?

Thanks @MartinNowak -- I don't quite follow your example (in how the AST that is produced would be different and how downstream crates would have to change )

@MartinNowak
Copy link
Contributor

MartinNowak commented Aug 28, 2023

Updated, maybe a bit clearer.

This idea is about optional locations and backwards compatibility.

(b) maybe there are some clever Rust tricks we can think
about that make it optional

#790 (comment)

The key point is that we could employ a strategy to make the AST-wrapping configurable and thus allow library consumers to adopt locations at their own pace.

fn main() {
    assert_eq!(Parser::new().parse(), Number(0));
    assert_eq!(ParserWithLoc::<SpanWrapper>::new().parse(), Span{item: Number(0), beg: Loc(1, 16), end: Loc(3, 23)});
}

On the downside a wrapping approach would require to templatize the more complex AST to support locations of nested AST elements.

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants