-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Result mismatch in CollectList when 'sort by' clause is involved #8227
Comments
Thank you for reporting. After going through the issue and relevant code I am exploring whether we could rework Velox's CollectListAggregate / CollectSetAggregate to make them match on a Spark-side TypedImperativeAggregate. The existing function As you see, Spark |
Thank you for your idea. I'm trying to implement this solution, but I am struggling to design a new For Velox's However, the intermediate data type for Velox's |
@NEUpanning Thanks for the explanation, very helpful here.
That's right. I think a complete solution should involve changes to Velox code to make sure Velox's collect functions use binary intermediate buffer. Which is reasonable given that Velox could align its Spark functions more precisely with vanilla Spark. Perhaps using binary buffer is faster as well? At least in the fallback cases. |
@zhztheplayer
This solution sounds good to me. I think what we need to do is (correct me if I am wrong) Should we need to open an issue in Velox?
And could you provide more details about this? Thanks! |
Looking good to me, thank for summarizing them up. BTW regarding
Wanted to hear about your thoughts here but I was thinking we may still need to have typed-imperative versions of
Could see code Lines 52 to 61 in f7f801a
VeloxCollectSet is actually doing distinct on a large array buffer when it's falling back. This is wasting memory if there are a lot of duplicated input records.
|
Currently, Spark's collect_list/collect_set uses |
Aha. Sounds good to me if it's compatible with the Velox row SerDes. Let's remove the collect rewrite rules then. Glad to see we can simplify our code in passing, thanks! |
Backend
VL (Velox)
Bug description
Reproducing SQL:
Results:
The vanilla result is deterministic and values_list is sorted by value column:
id values_list
1 ["a", "b", "c"]
2 ["d", "e", "f"]
3 ["g", "h", "i"]
The gluten result is non-deterministic and values_list is not sorted, e.g. :
id values_list
1 ["a", "c", "b"]
3 ["g", "i", "h"]
2 ["f", "e", "d"]
gluten physical plan:
vanilla spark physical plan:
Root cause
CollectList that is a TypedImperativeAggregate function is replaced by VeloxCollectList function that is a DeclarativeAggregate in logical optimization phase. Therefore, SortAggregateExec is used in gluten for VeloxCollectList instead of ObjectHashAggregateExec. The
SortOrder
ofSortExec
that corresponds toSortExecTransformer [(1 - id#0)#15 ASC NULLS FIRST]
in the Gluten physical plan differs from theSortExec
added by the 'sort by' clause, which corresponds toSortExecTransformer [id#0 ASC NULLS FIRST, value#1 ASC NULLS FIRST]
. As a result, the result is mismatched with vanilla spark.Spark version
None
Gluten version
1.2.0
The text was updated successfully, but these errors were encountered: