Fix BigQuery hyphenated ObjectName with numbers #1598

ayman-sigma · 2024-12-12T19:47:46Z

We currently support hyphenated identifiers for BigQuery. The current code expects the number segment to be the last segment ex: foo-123 and should be followed by whitespace. That is true except when this identifier is part of an ObjectName. Ex: SELECT * FROM foo-123.bar.
The issue is that tokenizer parse the previous string as: [Word("foo"), Minus, Number("123."), Word("bar")]
This PR:

Fixes the tokenizer to tokenize foo-123.bar as [Word('"foo"), Minus, Number("123"), Period, Word("bar")]
Fixes the parser to parse foo-123.bar as ObjectName([Ident("foo-123"), Ident("bar")]) instead of erroring out.

ayman-sigma · 2024-12-12T19:56:44Z

I think I messed up something while fixing clippy. I will fix.

iffyio

LGTM! Thanks @ayman-sigma!
cc @alamb

goldmedal · 2024-12-26T17:44:40Z

Hi @ayman-sigma, @iffyio,

I'm concerned about this change. Is SELECT * FROM foo-123.bar valid SQL or a real use case for BigQuery? 🤔

Based on the BigQuery documentation, dataset names cannot contain spaces or special characters such as -, &, @, or %.

This PR modifies the tokenizer for numbers, which breaks SQL (It parsed to a wrong result, see #1619 ) like:

SELECT 0. AS c1

This syntax is valid in BigQuery.

Additionally, I'm concerned about unquoted hyphenated identifiers, made in #1109 by @jmhain
I can't execute the following SQL on my BigQuery project:

SELECT * FROM foo-bar

Generally, - is not a valid character for identifiers. The table name should be quoted, like this:

SELECT * FROM `foo-bar`

Is there any documentation that explicitly covers this 🤔 ? The only relevant information I found was about unquoted identifiers:

Must begin with a letter or an underscore () character.
Subsequent characters can be letters, numbers, or underscores ().

also c.c. @alamb who reviewed #1109

ayman-sigma · 2024-12-26T21:31:03Z

Hi @ayman-sigma, @iffyio,

I'm concerned about this change. Is SELECT * FROM foo-123.bar valid SQL or a real use case for BigQuery? 🤔

Based on the BigQuery documentation, dataset names cannot contain spaces or special characters such as -, &, @, or %.

I think my example here was bad. I meant to write something like SELECT * FROM foo-123.mydataset.mytable which is a valid table syntax according to the docs you mentioned above.

This PR modifies the tokenizer for numbers, which breaks SQL (It parsed to a wrong result, see #1619 ) like:
SELECT 0. AS c1
This syntax is valid in BigQuery.

Sorry about breaking that case. I'm actually surprised we don't have a test case for that.
I think it easy to fix the case of SELECT 1. AS c1, but it will be tricky to handle SELECT 1.e5 for example.

goldmedal · 2024-12-27T01:33:23Z

I think my example here was bad. I meant to write something like SELECT * FROM foo-123.mydataset.mytable which is a valid table syntax according to the docs you mentioned above.

I see. So you mean the project name in BigQuery. I think it backs to my another concern about unquoted hyphenated identifiers. I agree foo-123.mydatset.mytable is a valid table name but I don't think we can query it without quoting it in BigQuery.

I tried a similar case in BigQuery (sorry for the non-English UI but I think it's easy to know which one works)

and

or

Or did I miss something? Can we use it in BigQuery by enabling some configuration?

Sorry about breaking that case. I'm actually surprised we don't have a test case for that. I think it easy to fix the case of SELECT 1. AS c1, but it will be tricky to handle SELECT 1.e5 for example.

Never mind. I'm also surprised about it 😢. We should have tests to protect it (I'll do it in #1619)
I think the problem is: should we allow unquoted hyphenated identifiers for BigQuery or any dialect by default?

If it's a BigQuery official valid syntax, I prefer to limit it to be a BigQuery-specific behavior in the tokenizer.
I pretty sure other dialects don't allow this kind of identifier.

However, it's not a valid syntax for BigQUery but it's used by some downstream projects (maybe in your case?). I prefer to let it be an optional behavior for the dialect. Maybe add a method in the Dialect trait to control it. 🤔

WDYT?

ayman-sigma · 2024-12-27T02:22:10Z

I see. So you mean the project name in BigQuery. I think it backs to my another concern about unquoted hyphenated identifiers. I agree foo-123.mydatset.mytable is a valid table name but I don't think we can query it without quoting it in BigQuery.

I'm not sure about the table name, but I know for sure that project name allows unquoted hyphenated identifier for BigQuery.

Sorry about breaking that case. I'm actually surprised we don't have a test case for that. I think it easy to fix the case of SELECT 1. AS c1, but it will be tricky to handle SELECT 1.e5 for example.

Never mind. I'm also surprised about it 😢. We should have tests to protect it (I'll do it in #1619) I think the problem is: should we allow unquoted hyphenated identifiers for BigQuery or any dialect by default?

If it's a BigQuery official valid syntax, I prefer to limit it to be a BigQuery-specific behavior in the tokenizer. I pretty sure other dialects don't allow this kind of identifier.

However, it's not a valid syntax for BigQUery but it's used by some downstream projects (maybe in your case?). I prefer to let it be an optional behavior for the dialect. Maybe add a method in the Dialect trait to control it. 🤔

WDYT?

I believe the original PR from @jmhain was scoped to BigQuery, so we should be fine there. We just need to fix the tokenizer again for your case.
The tokenizer change in this PR allow parsing foo-123.bar as [Word('"foo"), Minus, Number("123"), Period, Word("bar")] instead of [Word("foo"), Minus, Number("123."), Word("bar")]. I believe the former is more correct than the latter.
You can scope that change to BigQuery, but I prefer that we fix it to allow the 1. and 1.e5, ... cases.

goldmedal · 2024-12-27T02:55:58Z

I'm not sure about the table name, but I know for sure that project name allows unquoted hyphenated identifier for BigQuery.

Indeed, I have confirmed it: the project name is allowed to be an unquoted hyphenated identifier. 👍

I believe the original PR from @jmhain was scoped to BigQuery, so we should be fine there. We just need to fix the tokenizer again for your case. The tokenizer change in this PR allow parsing foo-123.bar as [Word('"foo"), Minus, Number("123"), Period, Word("bar")] instead of [Word("foo"), Minus, Number("123."), Word("bar")]. I believe the former is more correct than the latter. You can scope that change to BigQuery, but I prefer that we fix it to allow the 1. and 1.e5, ... cases.

I see. I think it's a BigQuery-specific behavior, not a common rule for others. I will add a method Dialect::support_unquoted_hyphenated_identifiers to scope it.

Thanks for your explanation. 🙇

This brings couple of fixes I put for databricks and bigquery: - apache#1598 - apache#1600

ayman-sigma added 2 commits December 12, 2024 10:59

fix tokenize number as part of ObjectName

63245f6

Fix BigQuery hythenated ObjectName with numbers

4f92baa

fix after clippy

0c01b33

ayman-sigma force-pushed the ayman/fixBigqueryHyphenatedObjectName branch from acd682e to 0c01b33 Compare December 12, 2024 20:35

iffyio approved these changes Dec 17, 2024

View reviewed changes

iffyio merged commit e9ab4d6 into apache:main Dec 18, 2024
8 checks passed

goldmedal mentioned this pull request Dec 26, 2024

Wrong result for parsing special double number value #1619

Closed

ayman-sigma mentioned this pull request Dec 27, 2024

Sync upstream sigmacomputing/sqlparser-rs#31

Merged

goldmedal mentioned this pull request Dec 27, 2024

Fix the parsing result for the special double number #1621

Merged

ayman-sigma added a commit to sigmacomputing/sqlparser-rs that referenced this pull request Jan 3, 2025

Sync upstream (#31)

35d10e0

This brings couple of fixes I put for databricks and bigquery: - apache#1598 - apache#1600

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BigQuery hyphenated ObjectName with numbers #1598

Fix BigQuery hyphenated ObjectName with numbers #1598

ayman-sigma commented Dec 12, 2024

ayman-sigma commented Dec 12, 2024

iffyio left a comment

goldmedal commented Dec 26, 2024

ayman-sigma commented Dec 26, 2024

goldmedal commented Dec 27, 2024

ayman-sigma commented Dec 27, 2024

goldmedal commented Dec 27, 2024

Fix BigQuery hyphenated ObjectName with numbers #1598

Fix BigQuery hyphenated ObjectName with numbers #1598

Conversation

ayman-sigma commented Dec 12, 2024

ayman-sigma commented Dec 12, 2024

iffyio left a comment

Choose a reason for hiding this comment

goldmedal commented Dec 26, 2024

ayman-sigma commented Dec 26, 2024

goldmedal commented Dec 27, 2024

ayman-sigma commented Dec 27, 2024

goldmedal commented Dec 27, 2024