The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999

ziyangRen · 2022-06-29T06:45:04Z

Describe the bug
The performance of using columnarsort operator to sort strings is significantly lower than that of native spark sortexec

Here are my test results（note:I changed the calculation method of spark and NSE execution time, using NS as the unit, so it seems that the execution time is very long, but I guarantee that the statistical time is accurate. Please ignore this problem）

The following is the operator execution time in DAG graph：
NSE Int::

Spark Int:

NSE String:

spark String:

Analyzing the NSE code, we find that the sort time is divided into the execution time and the finishByIterator time. After respective statistics, we find that the string sort time is not significantly improved compared with int. the performance bottleneck seems to come from the finishByIterator method. This may help you locate the problem.
here is my way to get the time usinging in finishByIterator:

val beforeSort = System.nanoTime()
  if (processedNumRows > 0) {
       sort_iterator = sorter.finishByIterator();
   }
sort_finish_elapse += System.nanoTime() - beforeSort

And the result is:

To Reproduce
and here is my sql:
spark.sql("select t1.* from test.test_orc t1 left join test.test_orc1 t2 on t1.id=t2.id;").show

The text was updated successfully, but these errors were encountered:

zhouyuan · 2022-06-30T02:02:06Z

@ziyangRen
thanks for reporting, could you also share the detail schema for for sort operator? I assume there are several extra fields as payloads in this sort, are those all STRING based?

ziyangRen · 2022-06-30T08:34:36Z

@zhouyuan
Thank you for your reply. I will provide some details below：
here is my table schema:

In order to eliminate the influence of fields, we made a control group to sort the int type, string type and int, string mixed type respectively，then we get the following results:

here is the detail schema for sort operator and the corresponding query statement:
int type:

select t1.*
from test.test_orc_sort2 t1 
left join test.test_orc_sort3 t2 
on t1.id=t2.id and t1.id1=t2.id1 and t1.id2=t2.id2;

string type

select t1.*
from test.test_orc_sort2 t1 
left join test.test_orc_sort3 t2 
on t1.name=t2.name and t1.name1=t2.name1 and t1.name2=t2.name2;

I'm not sure if this is what you need. If you need any more information, please feel free to let me know

ziyangRen · 2022-07-04T13:05:45Z

@zhouyuan I noticed that we use radix sort to sort int type fields and std:: sort to sort string type fields. Use Tim sort to sort multiple keys. Spark native uses Tim sort for string types. Why don't we use Tim sort to sort string fields? In order to improve performance, can we modify the source code and use Tim sort to sort strings? What other problems will this bring?

zhouyuan · 2022-07-04T13:53:09Z

@ziyangRen

Yes, should be able to use Timsort for sort one STRING key - somehow this is missed in this implementation.

it looks like your use case is sorting 3 keys(?) which should use Timsort already. I have some idea on reducing the overhead of std::string by switching to string_view in #1009 - please take a look if this can help improve the performance

ziyangRen · 2022-07-04T14:34:04Z

@zhouyuan Thank you for your reply. We will verify again tomorrow and give a conclusion as soon as possible

ziyangRen · 2022-07-12T07:56:59Z

@zhouyuan Sorry for my late reply. We have done performance verification for several scenarios and modified the logic of single key so that single key can also use timport. The following is the verification conclusion:

zhouyuan · 2022-07-13T14:56:02Z

@ziyangRen Thanks for reporting back. The performance looks promising - will try to merge these changes soon.

ziyangRen added the bug Something isn't working label Jun 29, 2022

ziyangRen changed the title ~~The performance of using columnarsort operator to sort strings is significantly lower than that of native spark sortexec~~ The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec Jun 29, 2022

github-actions bot mentioned this issue Jul 4, 2022

[NSE-999] s/string/string_view in sort #1009

Merged

github-actions bot mentioned this issue Jul 15, 2022

[NSE-999] use TimSort for STRING/DECIMAL onekey based sorting #1017

Merged

zhouyuan closed this as completed in #1017 Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999

The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999

ziyangRen commented Jun 29, 2022

zhouyuan commented Jun 30, 2022

ziyangRen commented Jun 30, 2022

ziyangRen commented Jul 4, 2022

zhouyuan commented Jul 4, 2022

ziyangRen commented Jul 4, 2022

ziyangRen commented Jul 12, 2022

zhouyuan commented Jul 13, 2022

The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999

The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec #999

Comments

ziyangRen commented Jun 29, 2022

zhouyuan commented Jun 30, 2022

ziyangRen commented Jun 30, 2022

ziyangRen commented Jul 4, 2022

zhouyuan commented Jul 4, 2022

ziyangRen commented Jul 4, 2022

ziyangRen commented Jul 12, 2022

zhouyuan commented Jul 13, 2022