-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BigQuery hyphenated ObjectName with numbers #1598
Fix BigQuery hyphenated ObjectName with numbers #1598
Conversation
I think I messed up something while fixing clippy. I will fix. |
acd682e
to
0c01b33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @ayman-sigma!
cc @alamb
Hi @ayman-sigma, @iffyio, I'm concerned about this change. Is Based on the BigQuery documentation, dataset names cannot contain spaces or special characters such as This PR modifies the tokenizer for numbers, which breaks SQL (It parsed to a wrong result, see #1619 ) like: SELECT 0. AS c1 This syntax is valid in BigQuery. Additionally, I'm concerned about unquoted hyphenated identifiers, made in #1109 by @jmhain SELECT * FROM foo-bar Generally, SELECT * FROM `foo-bar` Is there any documentation that explicitly covers this 🤔 ? The only relevant information I found was about unquoted identifiers:
|
I think my example here was bad. I meant to write something like
Sorry about breaking that case. I'm actually surprised we don't have a test case for that. |
I see. So you mean the project name in BigQuery. I think it backs to my another concern about unquoted hyphenated identifiers. I agree I tried a similar case in BigQuery (sorry for the non-English UI but I think it's easy to know which one works) Or did I miss something? Can we use it in BigQuery by enabling some configuration?
Never mind. I'm also surprised about it 😢. We should have tests to protect it (I'll do it in #1619) If it's a BigQuery official valid syntax, I prefer to limit it to be a BigQuery-specific behavior in the tokenizer. However, it's not a valid syntax for BigQUery but it's used by some downstream projects (maybe in your case?). I prefer to let it be an optional behavior for the dialect. Maybe add a method in the Dialect trait to control it. 🤔 WDYT? |
I'm not sure about the table name, but I know for sure that project name allows unquoted hyphenated identifier for BigQuery.
I believe the original PR from @jmhain was scoped to BigQuery, so we should be fine there. We just need to fix the tokenizer again for your case. |
Indeed, I have confirmed it: the project name is allowed to be an unquoted hyphenated identifier. 👍
I see. I think it's a BigQuery-specific behavior, not a common rule for others. I will add a method Thanks for your explanation. 🙇 |
This brings couple of fixes I put for databricks and bigquery: - apache#1598 - apache#1600
We currently support hyphenated identifiers for BigQuery. The current code expects the number segment to be the last segment ex:
foo-123
and should be followed by whitespace. That is true except when this identifier is part of an ObjectName. Ex:SELECT * FROM foo-123.bar
.The issue is that tokenizer parse the previous string as:
[Word("foo"), Minus, Number("123."), Word("bar")]
This PR:
foo-123.bar
as[Word('"foo"), Minus, Number("123"), Period, Word("bar")]
foo-123.bar
asObjectName([Ident("foo-123"), Ident("bar")])
instead of erroring out.