Finding performant implementations of functions to be optimized.
A Python notebook that explores different implementations of funcs which are to be optimized by a stochastic optimization algorithm.
When handling large datasets, Data Analysts and Engineers often select Python and especially pandas as go-to solution.
When optimizing problems and writing solutions back in the pd.DataFrame
in an iterative manner1, the intuitive implementation is not always the best choice.
This Jupyter Notebook investigates three different implementations of an optimization function, which needs to loop through the DataFrame.
The loops are necessary due to the iterative nature of the dynamic optimization probem: values that are calculated in the previous row are fed into the calculation of the next row.
Disclaimer: This code is far from perfect. It much more aims to demonstrate the capabilities of different algorithms with hindsight to their computational performance without giving up on Pythons advantage of intuitive programming.
Thus, I encourage readers to optimize the code by themselves and to adjust it to their needs where necessary. Or, just enjoy scrolling thorugh the files 😊.
The three implementations are:
- An "intuitive" approach looping through the DataFrame and writing into each row separately.
- An approach utilizing the
.apply()
method of DataFrames. - A more sophisticated approach using
numba
and converting the DataFrame into annp.array
.
They all do the same:
- using a sample dataset with 100 (obfuscated) measurements of a Waste Water Treatment Plant (WWTP)
- all functions that are being optimized apply a
if-else
condition on a column (by looping through rows) - optimizing the function is done via the stochastic
scipy.optimize.dual_annealing()
algorithm. - the results of optimization are returned for comparison
- the optimized parameters are used to be compared to original data (spoiler: the parameters do not work well at all).
I observed a huge performance difference: While on the sample set, the performance of numba
was about 10 times better compared to next best solution.
numba
outplays the regular algorithm even further on the full data set (35000 samples): While the regular algorithm takes about 41 min to compute, the already jit-compiled numba function takes only 0.48 s (!).
One final remark: When omitting the @njit
decorator of the function, still the naive Python computation with NumPy arrays took only 34.71 s on the complete dataset.
This indicates that the pandas package has pretty high access times.
- Don't loop through DataFrames (especially writing on the df seems slow).
- A better (and not even a lot more complicated) solution would be to take the DataFrames
.values()
and compute inside anp.array
- If you want to make the most out of your Python performance speed while avoiding to switch to compiled languages such as
C
,Java
orJulia
, you should definitely try to make your loop compliant to numba's@njit
.
Use of the repo should be fairly easy (🤞). If you want to read my main learnings, head over to the previous chapter.
If you want to figure it out for yourself, download both files (.csv
and .ipynb
) into the same directory (or fork the repo to your local machine).
Then just run the .ipynb
file on a machine with working Anaconda installation and Jupyter server running. All dependencies should install during first run.
Footnotes
-
A simple solution such as adding columns with pandas
.shift()
method is only more performant when the shifted values are known beforehand. This isn't always possible (especially in dynamic problems). ↩