Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BigQuery hyphenated ObjectName with numbers #1598

Merged
merged 3 commits into from
Dec 18, 2024

Conversation

ayman-sigma
Copy link
Contributor

We currently support hyphenated identifiers for BigQuery. The current code expects the number segment to be the last segment ex: foo-123 and should be followed by whitespace. That is true except when this identifier is part of an ObjectName. Ex: SELECT * FROM foo-123.bar.
The issue is that tokenizer parse the previous string as: [Word("foo"), Minus, Number("123."), Word("bar")]
This PR:

  1. Fixes the tokenizer to tokenize foo-123.bar as [Word('"foo"), Minus, Number("123"), Period, Word("bar")]
  2. Fixes the parser to parse foo-123.bar as ObjectName([Ident("foo-123"), Ident("bar")]) instead of erroring out.

@ayman-sigma
Copy link
Contributor Author

I think I messed up something while fixing clippy. I will fix.

@ayman-sigma ayman-sigma force-pushed the ayman/fixBigqueryHyphenatedObjectName branch from acd682e to 0c01b33 Compare December 12, 2024 20:35
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @ayman-sigma!
cc @alamb

@iffyio iffyio merged commit e9ab4d6 into apache:main Dec 18, 2024
8 checks passed
@goldmedal
Copy link
Contributor

Hi @ayman-sigma, @iffyio,

I'm concerned about this change. Is SELECT * FROM foo-123.bar valid SQL or a real use case for BigQuery? 🤔

Based on the BigQuery documentation, dataset names cannot contain spaces or special characters such as -, &, @, or %.

This PR modifies the tokenizer for numbers, which breaks SQL (It parsed to a wrong result, see #1619 ) like:

SELECT 0. AS c1

This syntax is valid in BigQuery.

Additionally, I'm concerned about unquoted hyphenated identifiers, made in #1109 by @jmhain
I can't execute the following SQL on my BigQuery project:

SELECT * FROM foo-bar

Generally, - is not a valid character for identifiers. The table name should be quoted, like this:

SELECT * FROM `foo-bar`

Is there any documentation that explicitly covers this 🤔 ? The only relevant information I found was about unquoted identifiers:

Must begin with a letter or an underscore () character.
Subsequent characters can be letters, numbers, or underscores (
).

also c.c. @alamb who reviewed #1109

@ayman-sigma
Copy link
Contributor Author

Hi @ayman-sigma, @iffyio,

I'm concerned about this change. Is SELECT * FROM foo-123.bar valid SQL or a real use case for BigQuery? 🤔

Based on the BigQuery documentation, dataset names cannot contain spaces or special characters such as -, &, @, or %.

I think my example here was bad. I meant to write something like SELECT * FROM foo-123.mydataset.mytable which is a valid table syntax according to the docs you mentioned above.

This PR modifies the tokenizer for numbers, which breaks SQL (It parsed to a wrong result, see #1619 ) like:

SELECT 0. AS c1

This syntax is valid in BigQuery.

Sorry about breaking that case. I'm actually surprised we don't have a test case for that.
I think it easy to fix the case of SELECT 1. AS c1, but it will be tricky to handle SELECT 1.e5 for example.

@goldmedal
Copy link
Contributor

I think my example here was bad. I meant to write something like SELECT * FROM foo-123.mydataset.mytable which is a valid table syntax according to the docs you mentioned above.

I see. So you mean the project name in BigQuery. I think it backs to my another concern about unquoted hyphenated identifiers. I agree foo-123.mydatset.mytable is a valid table name but I don't think we can query it without quoting it in BigQuery.

I tried a similar case in BigQuery (sorry for the non-English UI but I think it's easy to know which one works)
Screenshot 2024-12-27 at 9 18 14 AM
and
Screenshot 2024-12-27 at 9 18 29 AM
or
Screenshot 2024-12-27 at 9 21 09 AM

Or did I miss something? Can we use it in BigQuery by enabling some configuration?

Sorry about breaking that case. I'm actually surprised we don't have a test case for that. I think it easy to fix the case of SELECT 1. AS c1, but it will be tricky to handle SELECT 1.e5 for example.

Never mind. I'm also surprised about it 😢. We should have tests to protect it (I'll do it in #1619)
I think the problem is: should we allow unquoted hyphenated identifiers for BigQuery or any dialect by default?

If it's a BigQuery official valid syntax, I prefer to limit it to be a BigQuery-specific behavior in the tokenizer.
I pretty sure other dialects don't allow this kind of identifier.

However, it's not a valid syntax for BigQUery but it's used by some downstream projects (maybe in your case?). I prefer to let it be an optional behavior for the dialect. Maybe add a method in the Dialect trait to control it. 🤔

WDYT?

@ayman-sigma
Copy link
Contributor Author

I see. So you mean the project name in BigQuery. I think it backs to my another concern about unquoted hyphenated identifiers. I agree foo-123.mydatset.mytable is a valid table name but I don't think we can query it without quoting it in BigQuery.

I'm not sure about the table name, but I know for sure that project name allows unquoted hyphenated identifier for BigQuery.

Sorry about breaking that case. I'm actually surprised we don't have a test case for that. I think it easy to fix the case of SELECT 1. AS c1, but it will be tricky to handle SELECT 1.e5 for example.

Never mind. I'm also surprised about it 😢. We should have tests to protect it (I'll do it in #1619) I think the problem is: should we allow unquoted hyphenated identifiers for BigQuery or any dialect by default?

If it's a BigQuery official valid syntax, I prefer to limit it to be a BigQuery-specific behavior in the tokenizer. I pretty sure other dialects don't allow this kind of identifier.

However, it's not a valid syntax for BigQUery but it's used by some downstream projects (maybe in your case?). I prefer to let it be an optional behavior for the dialect. Maybe add a method in the Dialect trait to control it. 🤔

WDYT?

I believe the original PR from @jmhain was scoped to BigQuery, so we should be fine there. We just need to fix the tokenizer again for your case.
The tokenizer change in this PR allow parsing foo-123.bar as [Word('"foo"), Minus, Number("123"), Period, Word("bar")] instead of [Word("foo"), Minus, Number("123."), Word("bar")]. I believe the former is more correct than the latter.
You can scope that change to BigQuery, but I prefer that we fix it to allow the 1. and 1.e5, ... cases.

@goldmedal
Copy link
Contributor

I'm not sure about the table name, but I know for sure that project name allows unquoted hyphenated identifier for BigQuery.

Indeed, I have confirmed it: the project name is allowed to be an unquoted hyphenated identifier. 👍
Screenshot 2024-12-27 at 10 44 20 AM

I believe the original PR from @jmhain was scoped to BigQuery, so we should be fine there. We just need to fix the tokenizer again for your case. The tokenizer change in this PR allow parsing foo-123.bar as [Word('"foo"), Minus, Number("123"), Period, Word("bar")] instead of [Word("foo"), Minus, Number("123."), Word("bar")]. I believe the former is more correct than the latter. You can scope that change to BigQuery, but I prefer that we fix it to allow the 1. and 1.e5, ... cases.

I see. I think it's a BigQuery-specific behavior, not a common rule for others. I will add a method Dialect::support_unquoted_hyphenated_identifiers to scope it.

Thanks for your explanation. 🙇

ayman-sigma added a commit to sigmacomputing/sqlparser-rs that referenced this pull request Jan 3, 2025
This brings couple of fixes I put for databricks and bigquery:

- apache#1598
- apache#1600
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants