Skip to content

Latest commit

 

History

History
97 lines (85 loc) · 7.01 KB

select_rows_matching_several_simple_patterns.md

File metadata and controls

97 lines (85 loc) · 7.01 KB

Pandas - Select Rows Matching Several Simple Patterns

TL;DR

  • Operations over DataFrames are, usually, more expensive than operations over NumPy arrays. Try always to use the second, specially when creating criteria to index rows in  DataFrames.
  • Some operations, like creating a datetime using Pandas, may seem harmless but they take a lot of time to execute. If the code is inside a loop, the impact on the total time can be huge. When coding, think if you can write the relatively expensive code outside the loop.
  • Use line_profiler or even %timeit (or %%timeit for multi-line) to test the runtime of your code.
In [1]: %load_ext line_profiler
In [2]: def f1():
   ...:    ... your function here...
In [3]: def f2():
   ...:    ... your other funciton here ...
In [4]: %lprun -f f1 -f f2 [(f1(), f2()) for i in range(1000)]
# run the functions in a closed loop 1000 times.

In depth

Same principle as in Select Rows Matching a Simple Pattern applies when more than one criteria is used over one or several columns. If the DataFrame is a Zeek Log, I may be interested to get all the rows going from a particular client to a particular server, that is, the df[df["id.orig_h"] == "some IP") & (df["id.resp_h"] == "some other IP")]. If your dataset contains particle physics information, you may be interested in collisions with energies between two values, let's say the collisions[(collisions.energy >= 10) & (collisions.energy <= 100)]. If your dataset contains clients, you may be interested in getting the information of all the clients whose names match with either Gabriela or Joaquin that have spent more than 10000 in your store. That is, the clients[((clients.name == "Gabriela") | (clients.name == "Joaquin")) & (clients.total_spent >= 10000)].

The examples of Pandas CookBook documentation shows solutions like the one in the figure below. Here I used several data types to show the impact this could have on the runtime of the solutions.

In [1]: df = pd.DataFrame(
   ...:     {"AAA": ['a', 'b', 'c', 'd'],
   ...:      "BBB": [10, 20, 30, 40],
   ...:      "CCC": pd.date_range(start='2022-01-01', end='2022-01-04')}
   ...: )
   ...: Crit1 = df.AAA == 'b'
   ...: Crit2 = df.BBB >= 20
   ...: Crit3 = df.CCC <= pd.to_datetime('2022-01-03')
   ...: AllCrit = Crit1 | Crit2 & Crit3
   ...: df[AllCrit]
Out[1]:  
 AAA  BBB        CCC
1   b   20 2022-01-02
2   c   30 2022-01-03

In order to compare the execution times of these to methods, let's build two functions, f1 and f2, and measure their execution time line by line using the inline_profiler capabilities. If you are not familiar with this extension, it is installable by pip install line-profiler and allows us to identify which lines of each function are taking more time to execute.

In [2]: %load_ext line_profiler

In [3]: def f1():
   ...:     Crit1 = df.AAA == 'b'
   ...:     Crit2 = df.BBB >= 20.0
   ...:     d = pd.to_datetime('2022-01-03')
   ...:     Crit3 = df.CCC <= d
   ...:     AllCrit = Crit1 | Crit2 & Crit3
   ...:     df[AllCrit]
   ...:  

In [4]: def f2():
   ...:     Crit1 = df.AAA.values == 'b'
   ...:     Crit2 = df.BBB.values >= 20.0
   ...:     d = pd.to_datetime('2022-01-03')
   ...:     Crit3 = df.CCC.values <= d
   ...:     AllCrit = Crit1 | Crit2 & Crit3
   ...:     df[AllCrit]
   ...:
In [5]: %lprun -f f1 -f f2 [(f1(), f2()) for i in range(1000)]

In the line In [2] the line_profiler extension is loaded. In lines In [3] and In [4] the functions f1() and f2() are defined. Both functions do the same but while function f1() uses DataFrames as indexes, the function f2() uses NumPy arrays. Finally in the line In [5], the line profiler magic is called. The -f argument indicates which are the functions we want to profile. In this case, we want to profile both, f1 and f2. Then we are invoking both functions inside an explicit list comprehension, that will execute both functions 1000 times each. 

The output of the magic command includes very detailed information of the run of each function.  First the line Timer unit: 1e-06 s indicates the resolution of the rest of the numbers, in this case, microseconds (μs). The rest of the information is per function. The f1() function took 1.32 seconds of the total run time, while the f2() took 0.55 seconds. The most important differences are between the lines 2, 3, and 5 of each function, where each criteria is calculated, and between line 6 of each function, where the criteria are consolidated together.  I did separate the creation of the datetime on purpose to show how much time it takes in the total run time, and to separate it from the comparison in both cases. Clearly, an improvement in both cases can be to declare the d variable outside the function definition, saving around 175 ms in the total time of each function. 

Timer unit: 1e-06 s

Total time: 1.32061 s
File: <ipython-input-215-f15297257859>
Function: f1 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    1                                           def f1():
    2      1000     187151.0    187.2     14.2      Crit1 = df.AAA == 'b'
    3      1000     164126.0    164.1     12.4      Crit2 = df.BBB >= 20.0
    4      1000     171967.0    172.0     13.0      d = pd.to_datetime('2022-01-03')
    5      1000     198570.0    198.6     15.0      Crit3 = df.CCC <= d
    6      1000     259318.0    259.3     19.6      AllCrit = Crit1 | Crit2 & Crit3
    7      1000     339473.0    339.5     25.7      df[AllCrit]

Total time: 0.554853 s
File: <ipython-input-216-031ea8e9fba2>
Function: f2 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    1                                           def f2():
    2      1000      31025.0     31.0      5.6      Crit1 = df.AAA.values == 'b'
    3      1000      21543.0     21.5      3.9      Crit2 = df.BBB.values >= 20.0
    4      1000     175008.0    175.0     31.5      d = pd.to_datetime('2022-01-03')
    5      1000      29356.0     29.4      5.3      Crit3 = df.CCC.values <= d
    6      1000       3823.0      3.8      0.7      AllCrit = Crit1 | Crit2 & Crit3
    7      1000     294098.0    294.1     53.0      df[AllCrit]

The most interesting improvement occurs when the AllCrit variable is calculated, in line 6 of each function. In f1, operating over DataFrames, each execution takes more than 259 μs, while in f2, operating over NumPy arrays, it takes less than 4 μs. That is more than an order of magnitude improvement, and of course, the more code like this is executed, the bigger the impact in the final run time. 

Even when it is possible to do a more in depth analysis, for example to determine if the execution time grows logarithmically, linearly or exponentially with respect to the size of the input, I think it is clear that working with NumPy arrays whenever is possible, will reduce the execution time, especially in large DataFrames.