Python implementation of Gunnar's 1 billion row challenge:
- https://www.morling.dev/blog/one-billion-row-challenge
- https://github.com/gunnarmorling/1brc
First install the Python requirements:
python3 -m pip install -r requirements.txt
The script createMeasurements.py
will create the measurement file:
usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]
Create measurement file
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Measurement file name (default is "measurements.txt")
-r RECORDS, --records RECORDS
Number of records to create (default is 1_000_000_000)
Example:
% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
100%|█████████████████████████████████████████| 100/100 [01:15<00:00, 1.32it/s]
Created file 'measurements.txt' with 1,000,000,000 measurements in 75.86 seconds
Be patient as it can take more than a minute to have the file generated.
Maybe as another challenge is to speed up the generation of the measurements file 🙂
Interpreter | Script | user | system | cpu | total |
---|---|---|---|---|---|
python3 | calculateAveragePolars.py | 77.84 | 3.64 | 703% | 11.585 |
pypy3 | calculateAveragePypy.py | 135.25 |
2.92 |
735% |
18.782 |
python3 | calculateAverageDuckDB.py | 186.78 | 4.21 | 806% | 23.673 |
pypy3 | calculateAverage.py | 242.89 |
6.28 |
780% |
31.926 |
python3 | calculateAverage.py | 329.20 |
3.77 |
793% |
41.941 |
python3 | calculateAveragePypy.py | 510.93 |
1.88 |
793% |
64.660 |
The script calculateAveragePolars.py
was suggested by Taufan on this post.
The script calculateAveragePypy.py
was created by donalm, a +2x improved version of the initial script (calculateAverage.py
) when running in pypy3, even capable of beating the implementation using DuckDB calculateAverageDuckDB.py
.
Olivier Scalbert has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you 🙂
His suggestions were to change from:
if measurement < result[location][0]:
result[location][0] = measurement
if measurement > result[location][1]:
result[location][1] = measurement
result[location][2] += measurement
result[location][3] += 1
to:
_result = result[location]
if measurement < _result[0]:
_result[0] = measurement
if measurement > _result[1]:
_result[1] = measurement
_result[2] += measurement
_result[3] += 1
Python can be surprising sometimes.
Run compare.sh
if you want to check that all the scripts produce the same output.