-
Notifications
You must be signed in to change notification settings - Fork 331
Performance optimization skill:ordered grouping
Generally, the hash method is adopted in grouping operation, i.e., calculate the hash value of the grouping field first, save the records with the same hash value in a small set, then traverse the records in the small set to find those with the same grouping field value and aggregate them into groups. The complexity (number of comparisons) of the grouping depends on the duplicate value rate of the hash function. More specifically, when the hash space is small, the duplicate value rate and the number of comparisons will be high, thus degrading the performance greatly. So to avoid this situation, we need to allocate more memory to store the hash table. Also, the hash calculation will become slow in terms of certain data types (like long strings), which affects the performance badly.
If the grouping fields are ordered, each record only needs to be compared with the previous record when grouping. If the grouping field of the current record is different from that of the previous one, a new group will be created; if the grouping fields are the same, the record will be aggregated into the current group. The complexity of such grouping operation is n (the length of the set to be grouped) without the problem of hash calculation and duplicate value rate, which executes faster than the hash grouping and does not need much memory to store the hash table.
SPL provides such a grouping method, so let's test it with an example and compare it with the hash grouping algorithm in Oracle.
The test server has two intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory, and an SSD hard disk, where a 16-core virtual machine with 8G memory is set for testing.
On the virtual machine, we create the data table orderdetail_1, which consists of three fields, orderid (integers), detailid (integers) and amount (real numbers), respectively. The first two fields are primary keys, generating 80 million rows of records. We import the data table into the Oracle database and use it to generate the esProc SPL composite table for testing.
The data is sorted by orderid field in ascending order and grouped into 50 groups in total by orderid. The test goal is to query the total amount and the number of details of each order.
SQL query statement:
select /*+ parallel(n)*/
orderid, sum(amount) as amount, count(detailid) as details
from orderdetail_1
group by orderid;
/+ parallel(n)/ is used for parallel test and n is the number of parallel.
SPL script:
A | |
---|---|
1 | =now() |
2 | =file("/home/ctx/orderdetail_1.ctx").open().cursor@m(orderid,detailid,amount;;1) |
3 | =A2.groups@o(orderid;sum(amount):amount,count(detailid):details) |
4 | =interval@s(A1,now()) |
The groups function with the @o option is used to only compare with the values of adjacent rows when the grouping field is ordered.
The testing results are as follows (unit: seconds):
Number of parallel | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
Oracle | 24 | 19 | 16 | 13 | 13 |
SPL | 11 | 6 | 3 | 2 | 1 |
In the case of 80 million rows of data, the performance of SPL ordered grouping is doubled, and the effect of parallel processing is also very excellent with a linear increase in performance. However, the parallel optimization effect on the hash grouping in Oracle is not obvious.
The degree of performance improvement is related to the amount of data. When the data amount is very small, the time proportion of grouping to the whole query is very small, so the overall performance improvement is not obvious. However, with the increase of data volume, the improvement will become more and more significant.
Now let's take a look at the big data testing.
On the virtual machine, we create the data table orderdetail_2, which consists of three fields, orderid (strings), detailid (integers) and amount (real numbers). The first two fields are primary keys, generating 2.4 billion rows of records. We import the data table into the Oracle database, and use it to generate the esProc SPL composite table for testing.
The data is sorted by orderid in ascending order and grouped into 800 million groups in total by orderid. The test goal is to count the total amount and the number of details of each order. Because it takes Oracle a lot of time to output the big result set, the grouping result needs to be filtered again and only the orders whose total amount is less than 35 yuan will be output. And there are only 12 results left, so the output will hardly take up any time.
SQL query statement:
select * from (
select /*+ parallel(n)*/
orderid, sum(amount) sum_amount, count(detailid) as details
from orderdetail_2
group by orderid
)
where sum_amount<35;
/+ parallel(n)/ is used for parallel test and n is the number of parallel .
Write the SPL script as follows:
A | |
---|---|
1 | =now() |
2 | =file("/home/ctx/orderdetail_2.ctx").open().cursor@m(orderid,detailid,amount;;1) |
3 | =A2.group(orderid;sum(amount):amount,count(detailid):details).select(amount<35).fetch() |
4 | =interval@s(A1,now()) |
Because the result set of grouping is too large to be loaded in memory, the group function is used to group the data in order, return the cursor corresponding to the grouping result set, and then filter the cursor to get the required query result.
The test results are as follows (unit: seconds):
Number of parallel | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
Oracle | 2647 | 1345 | 1092 | 806 | 737 |
SPL | 451 | 235 | 119 | 65 | 48 |
In the case of none parallel processing, the performance of SPL ordered grouping is nearly six times higher than that of Oracle. Because SPL ordered grouping is well suited for parallel, the performance improvement gets more effective with the increase of the parallel number.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code