Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add test file format #680
feat: add test file format #680
Changes from 4 commits
39de59e
d4d69e1
8676a4c
759a755
3a541c0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define the specification here and in the Antlr grammar file will we end up changing both places all the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hopefully only when new types are introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the token
string
or isstring
defined somewhere? Canfunction
have a space in it or special characters?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[A-Za-z0-9] [A-Za-z0-9_]*
I will redefine the grammar in antlr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using the definition of string already implemented by the textplan Antlr grammar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options are values passed to the functions but not Substrait literals. Probably best to avoid the confusion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Literal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are escapes handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a "test runner" only validate the answer within 6 digits? In other words, if one engine returns (float32)
divide(1.0, 3.0)
as0.3333333432674407958984375
and another engine returnsdivide(1.0, 3.0)
as0.3333330452442169189453125
should the test runner consider both engines correct?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does P stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While concise this is a very difficult to read and write. For instance remembering that one of the elements has two letters -- and which one -- is difficult. Take a look at the text format for a more readable form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/ISO_8601#Durations
It starts with P, T is the time designator that precedes all Time elements. All elements have single letter to denote it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are already defined in the text format. You do need a way of specifying these that is amenable to recursive parsing. The Substrait C++ type format took an approach that was more error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do short forms still need to be parameterized? Can we refer to the list elsewhere? Can long names be interchangeably used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes parametrization is needed. I prefer to not to use long names in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can comments sometimes be recognized as group names? Do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Empty groups can be ignored by the testing framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the name the whole line or only up to the semicolon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whole line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really demonstrate rounding error. If we're sticking with the 6/15 digits of precision thing then it is impossible to demonstrate rounding error. Should we just remove these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could reuse the text format's definition of literals?
https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to stick with :: format generally. It is a cast format consistent with many systems the people looking at these tests will be familiar with and I believe much more intuitive understood than underscore.
I'm similarly inclined to try to find more standard ways to represent other things as opposed to what the text format has done. There are several examples of alternative ways to implement things to what is commonly done in SQL (which has the most consistent way to approach these things across systems). Examples beyond the use of cast syntax versus underscores include. (1) String literals are double quoted and backticked while sql string literals are single quoted and (2) brackets being overloaded to mean any of intervals, lists, maps and structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :: format is a type subclass operator. As such it requires a mental context switch every time I look at it. The underscore was chosen in the text format to create a visual separation between the literal value and its type thus aiding in readability (and it's also requires less keystrokes).
Braces are used for literal values and angle brackets are used for type composition and parameterization. This helps to reinforce the distinction between value and type. I have yet to be able to determine what the value of any interval literal is in the proposed format. However the text format's use of labels makes it ultra-clear what each part of the interval is.
The text format's string implementation does draw from other sources than SQL but that's mostly because the SQL version has its own warts which make it harder to read. SQL's implicit concatenation is the main SQL string feature that is difficult for humans to parse (it requires the counting of active single quotes).
The text format was reviewed in the community meeting twice (February 1st, 2023 and April 12th, 2023) so presumably there is some community agreement about its design validity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that the text format defined in substrait-cpp isn't valid. It's definitely valid.
I think we come from different perspectives. I'm coming from the perspective that we should try to leverage existing patterns as much as possible. My expectation is that the vast majority of systems that will be tested are SQL systems so leveraging SQL patterns is desirable as much as reasonably possible and comes with the added benefit of being obvious in meaning to a large group of people reviewing the test case. If we look at the set of BFT dialects, for example, we have the following:
So a decent portion are SQL based. Single-quoted literals are the most common way to represent strings in those systems. 3/7 have support for the double colon as meaning cast. Given that, I prefer that over something invented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the short form names as noted above or are both legal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should be short form. Fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the P here optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it was a typo. Fixed it.