Skip to content

“Top N” Queries on Structured Data

esProcSPL edited this page Aug 12, 2024 · 1 revision

A “top N” query gets the first or bottom N values, the records, or their positions, from a data set. Sometimes we use such a query after a grouping operation to get values, specific information, or positions from a subset. This article will tell you how to handle different types of “top N” scenarios through examples and offers sample scripts in esProc SPL. Looking ${article} for details.

1. The maximum/minimum

Getting the maximum or minimum value can be regarded as a special top N scenario where N is 1. It’s so common that I single it out for special illustration. There are many such computing tasks, such as getting the highest math score in class one, finding the age of the youngest employee, and so on. On certain occasions, instead of the specific values we are more concerned with their positions. Often this happens in inter-row calculations. One example is to find how much the sales amount in the company’s best month goes up compared with the previous month. To get this done we need to first get the sequence number of the record holding the month with the highest sales amount, then the sales amount in the previous month, and compare them. Other times we want to get certain information in the record containing the maximum or minimum value. Examples include getting the name of the student who has the highest math score in class one, finding which department the youngest employee belongs to in a company, etc. Now we’ll look at how to deal with the three scenarios of “getting maximum/minimum”. The following table records information of NASDAQ Composite:

Date Open Close Volume
2019/01/02 6506.910156 6665.939941 2261800000
2019/01/03 6584.77002 6463.5 2607290000
2019/01/04 6567.140137 6738.859863 2579550000
2019/01/07 6757.529785 6823.470215 2507550000
2019/01/08 6893.439941 6897.0 2380290000

1.1 Get the maximum/minimum value

【Example 1】 Get the highest closing price in the NASDAQ in the year of 2019. 【SPL script】

A B
1 =file("IXIC.txt").import@t() / Import IXIC data
2 =A1.select(year(Date)==2019) / Get data of 2019
3 =A2.max(Close) / A.max() function gets the highest closing price

To get the lowest closing price in the NASDAQ in the year of 2019:

A B
3 =A2.min(Close) / A.min()function gets the lowest closing price

1.2 Get the sequence number(s) of record(s) holding the maximum/minimum value

【Example 2】 Calculate the growth rate of the trading date with the highest NASDAQ closing price in 2019 compared with the closing price in the previous day. 【SPL script】

A B
1 =file("IXIC.txt").import@t() / Import IXIC data
2 =A1.select(year(Date)==2019).sort(Date) / Get data of 2019 and sort it by Date
3 =A2.pmax(Close) / A.pmax() gets the sequence number of the record holding the highest closing price
4 =A2.calc(A3,Close/Close[-1]-1) / Calculate the growth rate by dividing the highest closing price by the closing price of the previous day

There could be more than one record having the maximum value. To return the sequence numbers of all eligible records, you can use @a option in A.pmax() function:

A B
3 =A2.pmax@a(Close) / Get sequence numbers of all records holding the highest closing price

You can use @z option in A.pmax() function to locate the record(s) from backwards to forwards:

A B
3 =A2.pmax@z(Close) / Get the sequence number of the record holding the highest closing price

1.3 Get the record(s) holding the maximum/minimum value

【Example 3】 Get the date in the record holding the highest NASDAQ closing price in the year of 2019. 【SPL script】

A B
1 =file("IXIC.txt").import@t() / Import IXIC data
2 =A1.select(year(Date)==2019) / Get data of 2019
3 =A2.maxp(Close) / A.maxp() gets the record having the highest closing price
4 =A3.Date / Get the date from A3’s record

A.minp() function is used to get the record(s) containing the minimum value:

A B
3 =A2.minp(Close) / A.minp() gets the record having the lowest closing price

Both A.maxp() and A.minp() functions can work with @a option and @z option to achieve specific computing goals. Here we won’t give examples.

2. The top/bottom N

There are same three scenarios of getting the top/bottom N. Use the same NASDAQ composite data to illustrate them:

Date Open Close Volume
2019/01/02 6506.910156 6665.939941 2261800000
2019/01/03 6584.77002 6463.5 2607290000
2019/01/04 6567.140137 6738.859863 2579550000
2019/01/07 6757.529785 6823.470215 2507550000
2019/01/08 6893.439941 6897.0 2380290000

2.1 Get the top/bottom N values

【Example 4】 Get the top 3 NASDAQ volumes in the year of 2019. 【SPL script】

A B
1 =file("IXIC.txt").import@t() / Import IXIC data
2 =A1.select(year(Date)==2019) / Get data of 2019
3 =A2.top(-3, Volume) / A.top(n,x) function gets the top 3 volumes

To get the bottom 4 NASDAQ volumes in the year of 2019:

A B
3 =A2.top(4, Volume) / A.top(n,x) function gets the bottom 4 volumes

2.2 Get the sequence numbers of the top/bottom N values

【Example 5】 Calculate the growth rate of each of the trading dates having the 3 highest NASDAQ closing prices in 2019 compared with their closing price in the prior day. 【SPL script】

A B
1 =file("IXIC.txt").import@t() / Import IXIC data
2 =A1.select(year(Date)==2019).sort(Date) / Get data of 2019 and sort it by Date
3 =A2.ptop(-3, Close) /A.ptop(n,x) function gets the sequence numbers of the highest 3 closing prices
4 =A3.run(=A2().Volume/A2(~-1).Volume-1) / Calculate the growth rate circularly by dividing the current volume by volume of the prior day

2.3 Get the records holding the top/bottom N values

【Example 6】 Get the records of trading dates with 5 lowest volumes in NASDAQ 2019. 【SPL script】

A B
1 =file("IXIC.txt").import@t() / Import IXIC data
2 =A1.select(year(Date)==2019).sort(Date) / Get data of 2019
3 =A2.top(5; Close) / A.top(n; x) function gets the records of 5 trading dates with the lowest volumes

3. Top N queries on post-grouping subsets

Getting the maximum/minimum value and the top/bottom N values from each subset after a data set is grouped are also common computing goals, like finding the 5 top-selling items for each month and getting the customers whose total order amounts rank in top 3 for each year. Now let’s look at how to handle top N queries on subsets.

3.1 Get the maximum value in each group

【Example 7】 Based on the score table, get the highest math score in each class. Below is part of the source table:

CLASS STUDENTID SUBJECT SCORE
1 1 English 95
1 1 Math 90
1 1 PE 80
1 2 English 75
1 2 Math 84

【SPL script】

A B
1 =file("Score.txt").import@t() / Import Score table
2 =A1.select(Subject:"Math") / Get records of math subject
3 =A2.groups(Class; max(Score):BestScore) / Group the selected records by classes and use max() function to get the highest math score in each class

3.2 Get top N values/records in each group

A top N query can be regarded as a kind of aggregate operation on each subset after a data set is grouped. Here we look at how to get top N values and records respectively. 【Example 8】 Query the highest two math scores in each class. Below is part of the score table:

CLASS STUDENTID SUBJECT SCORE
1 1 English 95
1 1 Math 90
1 1 PE 80
1 2 English 75
1 2 Math 84

【SPL script】

A B
1 =file("Score.txt").import@t() / Import Score table
2 =A1.select(Subject:"Math") / Get records of math subject
3 =A2.group(Class; ~.top(-2, Score):top2) / Group the selected records by classes and use A.top() function to get the highest two math scores in each class
4 =A3.new(Class, top2(1):First, top2(2):Second) / Create the result set where column 1 contains classes, column 2 holds the highest score and column 3 has the second highest score

【Example 9】 For each subject in each class, query the information of students whose scores rank in top 3. Below is part of the score table:

CLASS STUDENTID SUBJECT SCORE
1 1 English 95
1 1 Math 90
1 1 PE 80
1 2 English 75
1 2 Math 84

【SPL script】

A B
1 =file("Score.txt").import@t() / Import Score table
2 =A1.group(Class,Subject;~.top(-3;Score):top3) / Group records by classes and subjects and get the highest two scores in each group
3 =A2.conj(top3) / Concatenate records with the highest 2 scores from all groups

3.3 Get top N values/records cumulatively

etting top N values or records in a cumulative way won’t generate the subsets after the data set is grouped. This method is used to handle scenarios when data volume is huge. There are same two scenarios here – top N values and top N records. 【Example 10】 Get the hire dates of the two employees in each department who have the longest hire durations. Below is part of the employee table:

EID NAME DEPT EntryDate
1 Rebecca R&D 2005/03/11
2 Ashley Finance 2008/03/16
3 Rachel Sales 2010/12/01
4 Emily HR 2006/08/15
5 Ryan R&D 2004/07/30

【SPL script】

A B
1 =file("Employee.txt").cursor@t() / Create cursor of the *Employee *table
2 =A1.groups(Department; top(2,EntryDate):Top2) / Group the cursor by departments and get the earliest two hire dates in each group
3 =A2.news(Top2;Department, ~:EntryDate) / Create a new table where column 1 is Department and column 2 is EntryDate

【Example 11】 Get the information of employees whose salaries rank in top 3 in each department. Below is part of the employee table:

EID NAME DEPT SALARY
1 Rebecca R&D 7000
2 Ashley Finance 11000
3 Rachel Sales 9000
4 Emily HR 7000
5 Ryan R&D 13000

【SPL script】

A B
1 =file("Employee.txt").cursor@t() / Create cursor of the *Employee *table
2 =A1.groups(Department; top(-3;Salary):Top3) / Group the cursor by departments and get the records having a Salary value that ranks in top 3 in each group
3 =A2.conj(Top3) / Concatenate eligible records from all groups

Find more examples in SPL CookBook.

Clone this wiki locally