Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqlparser] better sql parser? #866

Closed
BohuTANG opened this issue Jun 22, 2021 · 31 comments
Closed

[sqlparser] better sql parser? #866

BohuTANG opened this issue Jun 22, 2021 · 31 comments
Labels
C-feature Category: feature C-want-help Category: want help

Comments

@BohuTANG
Copy link
Member

Summary

sqlparser-rs is not easy to extend and we are looking for a good alternative(if have), comments are welcome

Reference:
https://github.com/datafuselabs/datafuse/blob/36dce2064742a1d64bfcf9a7cd308d0f5a3e4eb5/fusequery/query/src/sql/sql_parser.rs

@BohuTANG BohuTANG added C-feature Category: feature C-want-help Category: want help labels Jun 22, 2021
@BohuTANG
Copy link
Member Author

BohuTANG commented Jun 22, 2021

nom-sql
but no update for a long time

@PsiACE
Copy link
Member

PsiACE commented Jun 22, 2021

It seems that sqlparser lacks good alternatives. If necessary, we may need to maintain one by ourselves.

quaint is not a parser, but

  • An AST for building dynamic SQL queries.
  • Visitors for different databases to generate SQL strings.

toydb-parser is the parser for toydb

@sundy-li
Copy link
Member

How about parsers from yacc rules?

@dantengsky
Copy link
Member

FYI :

lalrpop: an lr(1) parser generator (lexer could be customized)

repo: https://github.com/lalrpop/lalrpop
doc: http://lalrpop.github.io/lalrpop/

  • never exercise it before
  • seems it lacks precedence/association rules declaration (again I am not sure.)

After googling around, I found a SQL parser using it:

https://github.com/hmwill/vervolg/blob/main/src/sql.lalrpop

@leiysky
Copy link
Contributor

leiysky commented Jun 22, 2021

nom-sql
but no update for a long time

nom is a nice parser combinator, with which we can extend SQL syntax easily.

I think we can fork nom-sql or make one from scratch(not a big deal).

Here's a nom Cypher parser I wrote at a Hackathon, which only cost me about few hours 🤣 . FYI.

@BohuTANG
Copy link
Member Author

How about parsers from yacc rules?

https://github.com/lalrpop/lalrpop is rust yacc

@leiysky
Copy link
Contributor

leiysky commented Jun 22, 2021

Based on my experience, working with bison(yacc) is suffering.

Here's the MySQL's yacc file, which has about 17,000 LOC.

https://github.com/mysql/mysql-server/blob/8.0/sql/sql_yacc.yy

Imagine that coding in a extremely huge plain text file without lint tool(no jump, no high light, no auto format), what a pain.

@PsiACE
Copy link
Member

PsiACE commented Jun 22, 2021

Based on my experience, working with bison(yacc) is suffering.

You are right, I also prefer a handwritten parser

@tlightsky
Copy link
Contributor

Based on my experience, working with bison(yacc) is suffering.

Here's the MySQL's yacc file, which has about 17,000 LOC.

https://github.com/mysql/mysql-server/blob/8.0/sql/sql_yacc.yy

Imagine that coding in a extremely huge plain text file without lint tool(no jump, no high light, no auto format), what a pain.

yes, hard to make changes to it

@abhikjain360
Copy link

If writing your own, maybe something like pest can be used as well. Tensorbase uses it (see here), and it seems pretty maintainable?

@BohuTANG
Copy link
Member Author

BohuTANG commented Jul 6, 2021

@abhikjain360 Thanks for the information, cool

@zhang2014
Copy link
Member

Maybe our sql parser is support multi dialect

  • When users use the ClickHouse protocol to access DataFuse, we may be ClickHouse style (syntax, function name)
  • When users use the MySQL protocol to access DataFuse, we should use MySQL style(syntax, function name)

@leiysky
Copy link
Contributor

leiysky commented Jul 28, 2021

Maybe our sql parser is support multi dialect

  • When users use the ClickHouse protocol to access DataFuse, we may be ClickHouse style (syntax, function name)
  • When users use the MySQL protocol to access DataFuse, we should use MySQL style(syntax, function name)

@zhang2014 Maybe we can disscuss this in #1218

@g302ge
Copy link

g302ge commented Aug 2, 2021

maybe handwritten parser is a good approach

@BohuTANG
Copy link
Member Author

BohuTANG commented Aug 2, 2021

maybe handwritten parser is a good approach

Hi!
There are 2 main purposes for the parser re-design: Extendibility and Maintainability.
For Maintainability, the handwritten parser is hard to matain(what's clickhouse does), so not recommend here.
For the parser, I would like it based on some rusty parser framework such as pest or others.
Good/New ideas are always welcome~

@leiysky
Copy link
Contributor

leiysky commented Aug 18, 2021

As we discussed in #1492 , the main idea about SQL grammar is based on PostgreSQL. Here's a reference of PostgreSQL grammar https://www.postgresql.org/docs/13/sql-select.html.

So the last problem is technical selection.

@BohuTANG
Copy link
Member Author

Nice, @leiysky which one(or list) do you personally preferred?

@leiysky
Copy link
Contributor

leiysky commented Aug 19, 2021

Nice, @leiysky which one(or list) do you personally preferred?

I prefer parser combinator like nom, the benifits of which are better maintainability and better extensibility.

Thanks to its declaritive style, the parser written with parser combinator follows the grammar specification, which means you can add your own syntax element easily(generally you need to modify the grammar specification first to make sure it's well defined).

@sundy-li
Copy link
Member

nom seems to be a very good parser combinators library, but LocustDB replaced nom parser by sqlparser-rs.

https://togithub.com/cswinter/LocustDB/pull/42/files

@leiysky
Copy link
Contributor

leiysky commented Aug 19, 2021

nom seems to be a very good parser combinators library, but LocustDB replaced nom parser by sqlparser-rs.

https://togithub.com/cswinter/LocustDB/pull/42/files

IMHO, it's because they don't have enough motivation to maintain a SQL parser by themselves.

@BohuTANG
Copy link
Member Author

Ok, make sense to use nom.

@ives9638
Copy link
Contributor

https://github.com/rrevenantt/antlr4rust

@ives9638
Copy link
Contributor

@leiysky
Copy link
Contributor

leiysky commented Aug 19, 2021

https://github.com/rrevenantt/antlr4rust

Antlr is a powerful parser framework as well. I didn't involve it since AFAIK currently there is no official rust target.

Seems this library has been used by some projects, I'll take a look at it.

Thanks very much.

@hustnn
Copy link

hustnn commented Sep 22, 2021

IMHO, it's because they don't have enough motivation to maintain a SQL parser by themselves.

@leiysky What is the major motivation to replace sqlparser-rs with own parser based on nom ? I played with nom recently, it is more like LEGO but still need a lot of effort.

@leiysky
Copy link
Contributor

leiysky commented Sep 22, 2021

IMHO, it's because they don't have enough motivation to maintain a SQL parser by themselves.

@leiysky What is the major motivation to replace sqlparser-rs with own parser based on nom ? I played with nom recently, it is more like LEGO but still need a lot of effort.

Hi, @hustnn.

Actually, we haven't decided which scheme to use yet. nom is one of the candidates.

We'd like to have a more maintainable SQL, which is easy for us to support new SQL syntax.

Do you have any suggestion about this?

@hustnn
Copy link

hustnn commented Oct 8, 2021

I am implementing a demo to show the possibility of the sql parser on top of nom but I feel the effort is still quite high even it already provides some combinators for usage.

The biggest benefit of have a own parser is more flexible like you can define the ASTs structure you want. Also nom did provide some powerful combinators so we can leverage them when building own parser. I feel can try implement own parser based on nom in long term, but I suggest to focus on other high priority tasks first with existing parser.

How do you think? @leiysky

@leiysky
Copy link
Contributor

leiysky commented Dec 1, 2021

@hustnn Sorry for replying so late.

The biggest benefit of have a own parser is more flexible like you can define the ASTs structure you want. Also nom did provide some powerful combinators so we can leverage them when building own parser.

You actually get the point. We do have a set of AST structure defined by ourselves, with which I'm doing some refactor. These AST structures can be directly transformed from sqlparser-rs's AST, so one day we can replace sqlparser-rs with our own parser.

nom is a good choice. We're glad to have you as collaborator on this work. Please let me know if you need any help.

@leiysky
Copy link
Contributor

leiysky commented Jan 11, 2022

https://github.com/rrevenantt/antlr4rust

I've recently taken a look at this project. It's in a very early stage, and the author hasn't put much effort into it.

Besides, antlr relies on java to run the code generator, which may bring unnecessary complexity for our development workflow.

@jiangzhe
Copy link

@hustnn Sorry for replying so late.

The biggest benefit of have a own parser is more flexible like you can define the ASTs structure you want. Also nom did provide some powerful combinators so we can leverage them when building own parser.

You actually get the point. We do have a set of AST structure defined by ourselves, with which I'm doing some refactor. These AST structures can be directly transformed from sqlparser-rs's AST, so one day we can replace sqlparser-rs with our own parser.

nom is a good choice. We're glad to have you as collaborator on this work. Please let me know if you need any help.

I've implemented a simple SQL parser using nom, including complex SELECT(except WINDOW clause) and simple INSERT/DELETE/UPDATE.
The total effort is about 3 weeks in part-time, so not very large, I guess. I didn't count the extra work on DDL.
One note is about expression parsing. The parser combinator is not very good for handling operator precedence. I used pratt parsing technique, which is simple and clear, and works well with combinator.

@andylokandy
Copy link
Contributor

The new parser is now tracking by #1218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature C-want-help Category: want help
Projects
None yet
Development

No branches or pull requests