1BRC in Python #62
Replies: 11 comments 14 replies
-
Was you CPU usage at 100% all cores? If no ,might seem dumb but, what if you spawned 2,3 or 4 workers after you have splitted the work into 4,6 or 8 chuncks (considering you split it in 2). I mean, yeah you are kinda doing it already but what if you used different proccesses to do the chuncks, like "other" scripts. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hmmm, very interesting indeed, you also said that you tried to do the nmap so loading into memory might be slower, might be due to fast flash storage. I wonder if since i have a x86 proccessor it would be faster. ARM is efficient af but you have to emulate some functions if i recall correctly. I will try to run it on my machine, maybe try to tweak a bit. Can you upload your generated file in your repo? |
Beta Was this translation helpful? Give feedback.
-
well, testing on my poor laptop (4 cores 8GB) it doesnt want to finish at all, even after 5 minutes. I might need to do something else or wait to go home lol |
Beta Was this translation helpful? Give feedback.
-
Strange. I have roughly the same code (I use defaultdicts and min/max functions), but it runs in +- 3 mins across 8 CPUS on my 24GB m2 mac book air. I suspect I am paying a large price for unicode de/re conversion, but then so is OP. https://github.com/jovlinger/utils/blob/master/misc/1brc/jovlinger.py am I missing some very clever optimization, or is your hardware some 4 times faster than a very recent MacBook Air? |
Beta Was this translation helpful? Give feedback.
-
I was able to squeeze a little bit more speed out of my solution. Optimizations:
Machine: Apple M1, 16GB |
Beta Was this translation helpful? Give feedback.
-
My version of the 1BRC in python.
Machine: Apple M2 MacBook Air, 16GB (8-core CPU, 8-core GPU) |
Beta Was this translation helpful? Give feedback.
-
Just scanning Raghunandan’s version, I think the broad lesson is that
unicode is a massive overhead. We are lucky that the record separator
character (new line) is Unicode safe, allowing the input to be treated as
pure binary.
…On Tue, Jan 23, 2024 at 23:30 Raghunandan Bhat ***@***.***> wrote:
My version
<https://github.com/raghunandanbhat/1brc/blob/main/calculate_avg.py> of
the 1BRC in python.
- Split the files into chunks and process chunks using multiprocessing.
- Memory mapped chunks for each processes gave the best results.
- Using 16 workers instead of default os.cpu_count() gave slightly
better result.
Machine: Apple M2 MacBook Air, 16GB (8-core CPU, 8-core GPU)
Python Version: 3.12
Best time: 64 seconds / 1:04.66
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACQTZPIBTAXMYWFOIZOJ4HLYQCE4TAVCNFSM6AAAAABBMAS7M2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DEMRYGEYDK>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I updated my solution so that it now uses a C-extension to do most of the heavy lifting. Unfortunately, it seems that the Python API is that main thing that slows it down--could probably improve speeds even more by using a C dict datastructure, and then converting at the end. |
Beta Was this translation helpful? Give feedback.
-
Recently I watched the very interesting talks by David Beazley on generators and coroutines in Python (they are a bit old, I think what he called coroutines back then people refer as enhanced generators today, to avoid confusion with coroutines from the later added async module), and I thought the 1BRC challenge would be a good exercise to get familiar with them, so I attempted to write a version fully based on generators. It's quite compact but not really competitive with the other submissions, as I'm not familiar with parallelizing Python code yet, but I think it provides a good baseline for a sequential solution. In my rig (first generation ryzen 1700) it takes about 530s (~8min50sec) using CPython 3.11 and about 420s (7min) using PyPy 7.3 to process the file as generated by ifnesi measurements creator, using a single processor thread. Maybe someone more knowledgeable than me is able to parallelize it:
|
Beta Was this translation helpful? Give feedback.
-
I've made an implementation which I believe is faster than the the one on the main repo (I'm not sure though as my PC only has 4 cores) https://github.com/fastcodewriter/1brc Is it worth submitting a pull request? |
Beta Was this translation helpful? Give feedback.
-
Standard python libs, splitting file into chunks and using multiprocessing to go through them, then put it all together. Tried several other approaches (numpy and pandas, but fastest was with good and old dict), also tried mmap, but was around 15 secs worse off, but I am pretty sure there is a better way.
Around 45s in M1 Pro 32GB (memory footprint less than 100MB).
https://github.com/ifnesi/1brc/blob/main/calculateAverage.py
Beta Was this translation helpful? Give feedback.
All reactions