-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Spanned
to retrieve source locations on AST nodes
#1435
Conversation
Spanned
to retrieve sourcec locations on AST nodesSpanned
to retrieve source locations on AST nodes
@alamb Hey, we've started using this functionality internally with pretty great success so far. It's still a draft for now, because of the missing todo!'s, but I would appreciate feedback on the overall design and if you think this can get merged in the foreseeable future once issues are addressed. 😄 |
How is this |
Alright, I have now gotten rid of all the todo!s and warnings and un-drafted the PR. There is a lot of missing implementations of spans and I have documented those (in hindsight maybe a derive macro would have been the way to go here, someone better at writing those than me is free to take a shot at it 😅 ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this @Nyrox! took a quick look and left some comments inline, I'll make some time to do another pass
Co-authored-by: Ifeanyi Ubah <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Nyrox! The changes look reasonable to me overall given the discussion in the GH issue. Left some comments, one mostly wondering around the equality behavior now that the token location is embedded within the AST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much Thank you so much @Nyrox and @iffyio and @lovasoa -- this is epic work.
Also, thank you to @lustefaniak @yuyang-ok for your comments
Many people have tried this feature but non have prevailed. 👏 . If we are ever colocated I totally owe you an in person meet up / 🍻 with a beverage of your choice
In terms of next steps:
- I will file a ticket with the current state of the project / spans which can hopefully let us spread out the work for adding span information to the rest of parse tree over time.
- I think it would be good to consider changing the offsets in Location to be
u32
rather thanusize
which would reduce the memory requirements significantly I thin. - I have a few ideas to improve the documentation, but I will propose some follow on PRs to do that.
BTW I tried updating DataFusion to use this change here apache/datafusion#13546 and it went quite smoothly
Let's leave this PR open for a few more days to get any more feedback and then plan to merge it in 🚀
I started organizing follow on work here: I also have been going through the code and adding docs / examples. So far I am quite pleased |
I plan to merge this PR tomorrow unless there are any other comments |
Did we run some benchmarks (e.g. cargo bench)? |
No, I did not (it is not clear to me we have such a thing). Let me look |
cd sqlparser_bench
cargo bench
git remote add Nyrox https://github.com/Nyrox/sqlparser-rs.git
git fetch Nyrox
git checkout Nyrox/main
cargo bench Here is the benchmark result (it appears to be about 10%-15% slower according to the benchmark):
|
Here is the flamegraph for anyone who is interested (you can download it locally to get zoom / etc): |
Thanks, looks great. I think 15% degradation is fully worth it (and might be gained back if someone looks at optimizing sqlparser-rs) :) |
Yeah, I was looking at the flamegraph and there are a bunch of obvious things to improve performance (like changing |
I filed a ticket to discuss improving performance: |
I also noticed that there is a bunch of calls to |
Found the benchmark problem 🤦 And fix: I will rerun the benchmarks with actually parsing queriers |
Amusingly when I ran with the fixed benchmarks the result is basically the same (15% slower)
|
I am more convinced than ever that we could make a huge performance improvement by doing something like: |
🚀 -- thanks again @Nyrox @iffyio @lovasoa @lustefaniak @mkarbo @Dandandan and @yuyang-ok for helping push this along. It is pretty amazing we'll finally get new things cc @ankrgyl it's finally happening |
🚀 |
Amazing!!! |
BTW if anyone has time to help review another PR, this one adds a bunch of documentation and examples for this feature: |
BTW here is a PR from @davisp that recovers all the performance lost adding tokens (and then some) ❤️ |
This PR adds a new trait
Spanned
to retrieve source spans on AST nodes, see #161 , by recursively traversing the AST and combining spans. This approach is in contrast to the one taken in #839 and #790 by trying to minimise the amount of breaking changes for downstream users, by avoiding wrapping everything inWithSpan<T>
.Main Changes
The general philosophy of the PR is to be "good enough" without breaking things. As a result certain expressions will have broken or incorrect spans. I imagine these can be cleaned up in future PRs (which might require breaking changes).
f.e. many expressions do not include keywords in their span i.e.
will have it's source span simply reported as
<expr>::span
and there is many such cases, some of which are easier to fix than others. For expressions we can't generate spans for, I useSpan::EMPTY
as a sort of sentinel value to indicate missing information.With this approach the only downstream changes a user should have to do to upgrade, should be adding additional fields when matching on AST nodes.
Future Work
ast::value::Value
. This seems like a breaking change to me, which would require aWithSpan<T>
like type