-
Notifications
You must be signed in to change notification settings - Fork 75
The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999
Comments
@ziyangRen |
@zhouyuan
I'm not sure if this is what you need. If you need any more information, please feel free to let me know |
@zhouyuan I noticed that we use radix sort to sort int type fields and std:: sort to sort string type fields. Use Tim sort to sort multiple keys. Spark native uses Tim sort for string types. Why don't we use Tim sort to sort string fields? In order to improve performance, can we modify the source code and use Tim sort to sort strings? What other problems will this bring? |
Yes, should be able to use Timsort for sort one STRING key - somehow this is missed in this implementation. it looks like your use case is sorting 3 keys(?) which should use Timsort already. I have some idea on reducing the overhead of std::string by switching to string_view in #1009 - please take a look if this can help improve the performance |
@zhouyuan Thank you for your reply. We will verify again tomorrow and give a conclusion as soon as possible |
@zhouyuan Sorry for my late reply. We have done performance verification for several scenarios and modified the logic of single key so that single key can also use timport. The following is the verification conclusion: |
@ziyangRen Thanks for reporting back. The performance looks promising - will try to merge these changes soon. |
Describe the bug
The performance of using columnarsort operator to sort strings is significantly lower than that of native spark sortexec
Here are my test results(note:I changed the calculation method of spark and NSE execution time, using NS as the unit, so it seems that the execution time is very long, but I guarantee that the statistical time is accurate. Please ignore this problem)
The following is the operator execution time in DAG graph:
NSE Int::
Spark Int:
NSE String:
spark String:
Analyzing the NSE code, we find that the sort time is divided into the execution time and the finishByIterator time. After respective statistics, we find that the string sort time is not significantly improved compared with int. the performance bottleneck seems to come from the finishByIterator method. This may help you locate the problem.
here is my way to get the time usinging in finishByIterator:
And the result is:
To Reproduce
and here is my sql:
spark.sql("select t1.* from test.test_orc t1 left join test.test_orc1 t2 on t1.id=t2.id;").show
The text was updated successfully, but these errors were encountered: