-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPL command implementation for appendCol
#990
base: main
Are you sure you want to change the base?
PPL command implementation for appendCol
#990
Conversation
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
appendCol
Signed-off-by: Andy Kwok <[email protected]>
ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java
Show resolved
Hide resolved
ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java
Outdated
Show resolved
Hide resolved
ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java
Outdated
Show resolved
Hide resolved
ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java
Outdated
Show resolved
Hide resolved
static List<Expression> getoverridedlist(LogicalPlan lp, String tableName) { | ||
// When override option present, extract fields to project from sub-search, | ||
// then apply a dfDropColumns on main-search to avoid duplicate fields. | ||
final SparkSession sparkSession = SparkSession.getActiveSession().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question to @LantaoJin : is there a more generic way to extract the table's attributes list ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have lot of example to handle override columns in Eval command. @andy-k-improving , could you refer them? See https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/ppl-eval-command.md#example-2-override-the-existing-field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin Would you mind to point me the specific doc for this purpose, I have looked up the implementation and the doc for eval
but seems all examples are rely on the user to supply the name of the columns to override.
However in this particular case (AppendCol), I need to first figure out the exact fields which sub-query will project (Including resolve select *
to actual columns name), and only then I can take that list and apply a dropDFColumn( )
on the main query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However in this particular case (AppendCol), I need to first figure out the exact fields which sub-query will project (Including resolve select * to actual columns name).
I see, but IMO, we don't need to figure out all column names (overridedList) as long as we restrict the case of select *
:
- For cases with fields or aliases, use the fields or aliases directly, for example
source=employees | FIELDS name, dept, age | APPENDCOL OVERRIDE=true [ ... | FIELDS foo, bar, age ];
Adding foo
, bar
, age
to fieldsToRemove
.
source=employees | FIELDS name, dept, age | APPENDCOL OVERRIDE=true [ stats avg(age) as age, count(name) as cnt ];
Adding age
, cnt
to fieldsToRemove
. (note: no harmful to add a non-conflicting field to DataFrameDropColumns
)
- For cases with
allFields
. throw an exceptionAPPENDCOL should specify the output fields
:
source=employees | FIELDS name, dept, age | APPENDCOL OVERRIDE=true [ where age > 0 ];
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Two high level questions:
And the sub-search syntax is opensearch-spark/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4 Lines 27 to 29 in 957de4e
Seems this PR doesn't follow the sub-search syntax. I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples: source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ] But if this is intentional (appendcol only works for one table), I am okey for current syntax.
instead of
PS, what is the expected result of query |
LogicalPlan joinedQuery = join( | ||
mainSearchWithRowNumber, subSearchWithRowNumber, | ||
Join.JoinType.LEFT, | ||
Optional.of(new EqualTo(t1Attr, t2Attr)), | ||
new Join.JoinHint()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I know why you got NULL in the example.
I think this is not what we expected. How about use inner
and cross
join together.
For example:
- with same group-by key, convert it to join key in inner
source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]
- without group-by key, use a cross join
source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age ]
The implementation equals to
def appendCols(mainSearchDF: DataFrame, subSearchDF: DataFrame, joinKey: Option[String] = None): DataFrame = {
joinKey match {
case Some(key) =>
// If a join key is provided, perform a join
mainSearchDF.join(subSearchDF, Seq(key), "inner")
case None =>
// If no join key is provided, assume a global aggregation and use a cross join
mainSearchDF.crossJoin(subSearchDF)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin thanks for reviewing and feedback !
AFAIK the appendcol
is intended for a single index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YANG-DB , @LantaoJin the comment's seems to contradict what we have discussed over #956 and the Splunk doc, hence would that be possible to clarify?
Join condition
By referring to the prior discussion on Github issue and the Spunk doc, I don't think that is the case, as user won't be asked to enter the neither the join column or the join condition, hence it's up the code logic to generate the natural_oder
for the dataFrame and use it to join, Can I confirm this?
I'm asking this because the above comment seems to indicate user will provide a the name of column for the join (Both inner || cross) join, but that is not the case the for the method signature || grammar of appenCol( )
Expected result
Accordingly the Splunk doc
Appends the fields of the [subsearch](https://docs.splunk.com/Splexicon:Subsearch) results with the input search results.
All fields of the subsearch are combined into the current results,
with the exception of [internal fields](https://docs.splunk.com/Splexicon:Internalfield).
For example, the first subsearch result is merged with the first main result,
the second subsearch result is merged with the second main result, and so on.
Both main
and the sub
are simply being joined row by row from top the bottom, 1:1 without explicit joining condition, and in the case of row.length( ) different between two dataFrame, null
will be used to fill the gap, and this seems to align with described on Github issue.
From Github issue:
Behavior
The new column(s) would be aligned with the rows of the original dataset based on their order of appearance.
Each appended column must produce the same number of rows as the base dataset to ensure proper alignment.
Any discrepancies in row counts could result in null values for mismatched rows.
Hence I don't think neither cross
and inner
join will do the same, as rows will be truncated in both scenario.
And I reckon this is what make appendCol
as standalone implementation, or else user can simply achieved by existing subQuery
or LookUp
command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me clarify it.
I'm asking this because the above comment seems to indicate user will provide a the name of column for the join (Both inner || cross) join, but that is not the case the for the method signature || grammar of appenCol( )
No, my suggestion doesn't ask user to provide a name of column for the join. My point is about the result of current implementation seems incorrect IMO.
Query
source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
should outcome
name dept salary event_count
Lisa Sales------ 10000 9
Fred Engineering 21000 9
Paul Engineering 29000 9
Evan Sales------ 32000 9
Chloe Engineering 23000 9
Tom Engineering 23000 9
Alex Sales 30000 9
Jane Marketing 29000 9
Jeff Marketing 35000 9
rather than
name dept salary event_count
Lisa Sales------ 10000 9
Fred Engineering 21000 NULL
Paul Engineering 29000 NULL
Evan Sales------ 32000 NULL
Chloe Engineering 23000 NULL
Tom Engineering 23000 NULL
Alex Sales 30000 NULL
Jane Marketing 29000 NULL
Jeff Marketing 35000 NULL
Similar, here are two examples:
Q1: output rows are unmatched:
source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats count() as cnt ]
Will outcome
dept total_salary cnt
Sales 72000 9
Engineering 96000 9
Marketing 64000 9
Rather that
dept total_salary cnt
Sales 72000 9
Engineering 96000 NULL
Marketing 64000 NULL
Q2: output rows are matched:
source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats count() as cnt by cnt_dept ]
Will outcome
dept total_salary cnt_dept
Sales 72000 3
Engineering 96000 4
Marketing 64000 2
What are the outputs of above queries in your implementation?
For the Q1, my suggestion is implementing by crossJoin (no join key required)
For the Q2, my suggestion is implementing by innerJoin with group-by keys as join keys.
Yep, the sub-search under |
Description
Introduce the new PPL command
appendCol
which aim to aggregate result from multiple searches into a single comprehensive table for user to view.This is accomplished by reading both main-search and sub-search in the form of
node
then transform it into the following of SQL with by adding_row_number_
for the dataset's natural order, then join both main and sub search with the_row_number_
column.Related Issues
Resolves: #956
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Test plan: