Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999

Closed
ziyangRen opened this issue Jun 29, 2022 · 7 comments · Fixed by #1017
Labels
bug Something isn't working

Comments

@ziyangRen
Copy link

Describe the bug
The performance of using columnarsort operator to sort strings is significantly lower than that of native spark sortexec

Here are my test results(note:I changed the calculation method of spark and NSE execution time, using NS as the unit, so it seems that the execution time is very long, but I guarantee that the statistical time is accurate. Please ignore this problem)
image

The following is the operator execution time in DAG graph:
NSE Int::
image

Spark Int:
image

NSE String:
image

spark String:
image

Analyzing the NSE code, we find that the sort time is divided into the execution time and the finishByIterator time. After respective statistics, we find that the string sort time is not significantly improved compared with int. the performance bottleneck seems to come from the finishByIterator method. This may help you locate the problem.
here is my way to get the time usinging in finishByIterator:

val beforeSort = System.nanoTime()
  if (processedNumRows > 0) {
       sort_iterator = sorter.finishByIterator();
   }
sort_finish_elapse += System.nanoTime() - beforeSort

And the result is:
image

To Reproduce
and here is my sql:
spark.sql("select t1.* from test.test_orc t1 left join test.test_orc1 t2 on t1.id=t2.id;").show

@ziyangRen ziyangRen added the bug Something isn't working label Jun 29, 2022
@ziyangRen ziyangRen changed the title The performance of using columnarsort operator to sort strings is significantly lower than that of native spark sortexec The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec Jun 29, 2022
@zhouyuan
Copy link
Collaborator

@ziyangRen
thanks for reporting, could you also share the detail schema for for sort operator? I assume there are several extra fields as payloads in this sort, are those all STRING based?

@ziyangRen
Copy link
Author

@zhouyuan
Thank you for your reply. I will provide some details below:
here is my table schema:
image
In order to eliminate the influence of fields, we made a control group to sort the int type, string type and int, string mixed type respectively,then we get the following results:
image
here is the detail schema for sort operator and the corresponding query statement:
int type:

select t1.*
from test.test_orc_sort2 t1 
left join test.test_orc_sort3 t2 
on t1.id=t2.id and t1.id1=t2.id1 and t1.id2=t2.id2;

image
string type

select t1.*
from test.test_orc_sort2 t1 
left join test.test_orc_sort3 t2 
on t1.name=t2.name and t1.name1=t2.name1 and t1.name2=t2.name2;

image

I'm not sure if this is what you need. If you need any more information, please feel free to let me know

@ziyangRen
Copy link
Author

@zhouyuan I noticed that we use radix sort to sort int type fields and std:: sort to sort string type fields. Use Tim sort to sort multiple keys. Spark native uses Tim sort for string types. Why don't we use Tim sort to sort string fields? In order to improve performance, can we modify the source code and use Tim sort to sort strings? What other problems will this bring?

@zhouyuan
Copy link
Collaborator

zhouyuan commented Jul 4, 2022

@ziyangRen

Yes, should be able to use Timsort for sort one STRING key - somehow this is missed in this implementation.

it looks like your use case is sorting 3 keys(?) which should use Timsort already. I have some idea on reducing the overhead of std::string by switching to string_view in #1009 - please take a look if this can help improve the performance

@ziyangRen
Copy link
Author

@zhouyuan Thank you for your reply. We will verify again tomorrow and give a conclusion as soon as possible

@ziyangRen
Copy link
Author

@zhouyuan Sorry for my late reply. We have done performance verification for several scenarios and modified the logic of single key so that single key can also use timport. The following is the verification conclusion:
image


@zhouyuan
Copy link
Collaborator

@ziyangRen Thanks for reporting back. The performance looks promising - will try to merge these changes soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants