1BRC in Python #62

ifnesi · 2024-01-03T21:34:25Z

ifnesi
Jan 3, 2024

Standard python libs, splitting file into chunks and using multiprocessing to go through them, then put it all together. Tried several other approaches (numpy and pandas, but fastest was with good and old dict), also tried mmap, but was around 15 secs worse off, but I am pretty sure there is a better way.

Around 45s in M1 Pro 32GB (memory footprint less than 100MB).

https://github.com/ifnesi/1brc/blob/main/calculateAverage.py

chrsBlank · 2024-01-04T07:31:09Z

chrsBlank
Jan 4, 2024

Was you CPU usage at 100% all cores? If no ,might seem dumb but, what if you spawned 2,3 or 4 workers after you have splitted the work into 4,6 or 8 chuncks (considering you split it in 2). I mean, yeah you are kinda doing it already but what if you used different proccesses to do the chuncks, like "other" scripts.

0 replies

ifnesi · 2024-01-04T11:38:28Z

ifnesi
Jan 4, 2024
Author

Hi Chris, thank you. I didn't check the actual cores, but each process was running at >99%. The main problem here, despite the fact the file is huge, is that this is more CPU than I/O bound. For example, see the chart below. I have run the script several times with different number of processes and executing the work partially so we can find out where it is spending more time. Like for example:

The line un blue "Time (read file)" it is just the time it took to read the file in full (line by line) using 1 to 15 processes. As we can see, when we get to around 8~10 processes the gains are not that much (my machine has 12 cores)
The line in red "Time (read + split + float)" is the time to split the line by ";" and convert the measurement to float, it takes in average 1.88 times more just to do that part
When we run the full script, line in orange "Time (everything)" it takes in average 3.19 times more. I have tried several enhancements to see what would work best, like for example an IF is way more efficient than min(x,y) and also an one liner IF ELSE. I see some of the Java scripts where it was used min(x,y), I believe the time would be much shorter by using IF instead.

If Python were as efficient as other languages, the processing of it would be closer to the parsing of it (~14 secs). One thing I can try, if I have time, is to create a C lib to do the split/float/min&max checks + sums just to see if we can get closer to the ~14secs limit (in my machine/algorithm).

1 reply

donalm Jan 4, 2024

The pure-python code runs about twice as fast if I use pypy instead of cpython.
On Linux (Ryzen 5 3600, 32GB RAM) it completes in 51s with pypy and 1:44 with vanilla cpython.
On MacOS (M1 Mac Mini, 16GB RAM) it completes in 1:10 with pypy and 2:04 with cpython, however on my Mac I couldn't get either interpreter to use 100% of the CPU.
Also on the Mac Mini, pypy would use only 68% of each core, and cpython used roughly 80%, so with faster storage (or enough RAM to support a RAM disk), better results might be possible.

I tried the DuckDB code on Linux and it completed in 41s, so about 10s faster than Pypy. There's no advantage to running DuckDB with Pypy over cpython, so I didn't test that.

chrsBlank · 2024-01-04T13:46:42Z

chrsBlank
Jan 4, 2024

Hmmm, very interesting indeed, you also said that you tried to do the nmap so loading into memory might be slower, might be due to fast flash storage. I wonder if since i have a x86 proccessor it would be faster. ARM is efficient af but you have to emulate some functions if i recall correctly. I will try to run it on my machine, maybe try to tweak a bit. Can you upload your generated file in your repo?

0 replies

chrsBlank · 2024-01-04T14:48:39Z

chrsBlank
Jan 4, 2024

well, testing on my poor laptop (4 cores 8GB) it doesnt want to finish at all, even after 5 minutes. I might need to do something else or wait to go home lol

9 replies

DomagojKorais Jan 7, 2024

I tried with Polars as well, but instead of using scan_csv I tried using read_csv:

import polars as pl
df = pl.read_csv(
    "measurements.txt",
    separator=";",
    has_header=False,
    new_columns=["city","value"],
    )

grouped = df.group_by("city").agg(
    pl.min("value").alias("min"),
    pl.mean("value").alias("mean"),
    pl.max("value").alias("max"),
).sort('city')

for data in grouped.iter_rows():
    print(f"{data[0]}={data[1]:.1f}/{data[2]:.1f}/{data[3]:.1f},", end=" ")

Interestingly the results are much worse since I got an execution time of 24s. This is to me quite a surprise because for this specific task I would have expected the two methods to be be roughly equivalent, since we need to read the full dataset in order to perform the aggregations, there is no filtering.

Also I noticed that your implementation is very much affected by disk caching in ram, on the first run I got a timing of 16s, while on the second run I got 7s. You can try by yourself by running sync && echo 3 | sudo tee /proc/sys/vm/drop_caches before executing the code.

mtaufanr Jan 7, 2024

Thanks for the review @DomagojKorais , I'm not aware of disk caching, this is new to me. 💡

ritchie46 Jan 17, 2024

If you read csv, you force polars to materialize the whole dataset at once. It is expected to be slower as you don't allow polars to determine how to process the query.

DomagojKorais Jan 17, 2024

thanks for the reply Ritchie. Now I'm wondering, do you know perhaps if there is any use case where read_csv has some advantage over scan_csv?

DomagojKorais Mar 7, 2024

I reimplemented the solution using datafusion as well, because I wanted to compare the results with Polars.
datafusion code:

from datafusion import SessionContext


# Create a DataFusion session
ctx = SessionContext()
ctx.register_csv(name="measurements", path="./measurements.txt", delimiter=';', has_header=False, file_extension=".txt")

#column_1 is city, column_2 is value. I don't care atm about the names let's use the defaults
query = """
SELECT 
    column_1,
    MIN(column_2) AS min,
    AVG(column_2) AS mean,
    MAX(column_2) AS max
FROM measurements
GROUP BY column_1
ORDER BY column_1
"""

df = ctx.sql(query)
df_pd = df.to_pandas()
# execute and collect the first (and only) batch

for index, data in df_pd.iterrows():
    print(f"{data[0]}={data[1]:.1f}/{data[2]:.1f}/{data[3]:.1f},", end=" ")

timings (computed using time):

first read: 0m7,100s
second read: 6,759s
observations: on the first read datafusion is significantly better then polars, while for subsequent reads the results are very very close.

checking disk activity read speed is at around 2GiB/s, which is insanely fast:

machine info: 24 cores (AMD Ryzen 9 5900X, nvme disk, 128GB DDR4 ram)

jovlinger · 2024-01-09T20:03:42Z

jovlinger
Jan 9, 2024

Strange. I have roughly the same code (I use defaultdicts and min/max functions), but it runs in +- 3 mins across 8 CPUS on my 24GB m2 mac book air. I suspect I am paying a large price for unicode de/re conversion, but then so is OP.

https://github.com/jovlinger/utils/blob/master/misc/1brc/jovlinger.py

am I missing some very clever optimization, or is your hardware some 4 times faster than a very recent MacBook Air?

0 replies

juliusgeo · 2024-01-16T22:18:51Z

juliusgeo
Jan 16, 2024

I was able to squeeze a little bit more speed out of my solution. Optimizations:

Use mmap with offset and length to avoid having to check whether I've read enough bytes for a given chunk on each iteration (no idea if any of the mmap flags I used helped that much with performance 🤷 , was mainly to avoid the bytes check)
Read the file in bytes, and only convert to a unicode string at the end once all the chunks have been collected to avoid the unicode conversion overhead
Otherwise, very similar to other solutions in this thread.

Machine: Apple M1, 16GB
Timing for my solution (Python 3.12):
python main.py 752.52s user 38.16s system 610% cpu 2:09.42 total
@ifnesi 's solution also ran on my machine for comparison:
python ifnesi.py 933.71s user 38.18s system 617% cpu 2:37.33 total
@jovlinger 's solution run on my machine for comparison:
python jovlinger.py 1737.53s user 50.15s system 612% cpu 4:51.97 total
@raghunandanbhat 's solution on my machine for comparison:
python3 raghunandanbhat.py 856.52s user 40.64s system 593% cpu 2:31.27 total

0 replies

raghunandanbhat · 2024-01-24T04:29:50Z

raghunandanbhat
Jan 24, 2024

My version of the 1BRC in python.

Split the files into chunks and process chunks using multiprocessing.
Memory mapped chunks for each processes gave the best results.
Using 16 workers instead of default os.cpu_count() gave slightly better result.

Machine: Apple M2 MacBook Air, 16GB (8-core CPU, 8-core GPU)
Python Version: 3.12
Best time: 64 seconds / 1:04.66

3 replies

juliusgeo Jan 24, 2024

If you change the try: except blocks to use existence checks, your solution becomes much faster.

if city in ws_dict:
    # min temp
    if ws_dict[city][0]>temp:
        ws_dict[city][0]=temp
    # max temp
    if ws_dict[city][1]<temp:
        ws_dict[city][1]=temp
    ws_dict[city][2]+=temp
    ws_dict[city][3]+=1
else:
    ws_dict[city]=[temp,temp,temp,1]

Speeds it up by nearly 17 seconds on my machine.

raghunandanbhat Jan 24, 2024

Interesting. My previous version had existence check using in, changing it to try-except block improved my time.
My reasoning was, if you use in check, it checks for all 1 billion rows whereas try-except goes to exception at most 413 times per chunk.

juliusgeo Jan 24, 2024

Try-except blocks are faster, but only if no exception gets thrown (https://docs.python.org/3.12/faq/design.html#how-fast-are-exceptions). In your code you are already doing an "in" check implicitly every time a line is encountered. The KeyError is only raised after the key is not "in", so it not only does the check for all 1 billion rows, but if the check fails uses a more expensive fallback.

jovlinger · 2024-01-24T11:33:52Z

jovlinger
Jan 24, 2024

Just scanning Raghunandan’s version, I think the broad lesson is that unicode is a massive overhead. We are lucky that the record separator character (new line) is Unicode safe, allowing the input to be treated as pure binary.

…

On Tue, Jan 23, 2024 at 23:30 Raghunandan Bhat ***@***.***> wrote: My version <https://github.com/raghunandanbhat/1brc/blob/main/calculate_avg.py> of the 1BRC in python. - Split the files into chunks and process chunks using multiprocessing. - Memory mapped chunks for each processes gave the best results. - Using 16 workers instead of default os.cpu_count() gave slightly better result. Machine: Apple M2 MacBook Air, 16GB (8-core CPU, 8-core GPU) Python Version: 3.12 Best time: 64 seconds / 1:04.66 — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACQTZPIBTAXMYWFOIZOJ4HLYQCE4TAVCNFSM6AAAAABBMAS7M2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DEMRYGEYDK> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

juliusgeo · 2024-01-24T16:09:55Z

juliusgeo
Jan 24, 2024

I updated my solution so that it now uses a C-extension to do most of the heavy lifting. Unfortunately, it seems that the Python API is that main thing that slows it down--could probably improve speeds even more by using a C dict datastructure, and then converting at the end.
Regardless, shaves off another good bit of time:
python3 main.py 534.04s user 37.06s system 569% cpu 1:40.31 total

0 replies

kniveslessing · 2024-01-28T16:35:09Z

kniveslessing
Jan 28, 2024

Recently I watched the very interesting talks by David Beazley on generators and coroutines in Python (they are a bit old, I think what he called coroutines back then people refer as enhanced generators today, to avoid confusion with coroutines from the later added async module), and I thought the 1BRC challenge would be a good exercise to get familiar with them, so I attempted to write a version fully based on generators. It's quite compact but not really competitive with the other submissions, as I'm not familiar with parallelizing Python code yet, but I think it provides a good baseline for a sequential solution. In my rig (first generation ryzen 1700) it takes about 530s (~8min50sec) using CPython 3.11 and about 420s (7min) using PyPy 7.3 to process the file as generated by ifnesi measurements creator, using a single processor thread. Maybe someone more knowledgeable than me is able to parallelize it:

from functools import cache
from math import inf

@cache
def turbo_float(s):
    '''
    Like built-in float(), but memoized to go faster.
    '''
    return float(s)

def reader(file):
    '''
    Generator that reads file and splits lines.
    '''
    with open(file, 'r', encoding = 'utf-8') as f:
        yield from (line.split(';') for line in f)

def coro_tally(cityname, sink):
    '''
    Enhanced generator that keeps tallies on the running values
    for min, med, max. When receives invalid data (non-arithmetic),
    sends formatted totals to printer generator.
    '''
    minimum, maximum, i, S = inf, -inf, 0, 0.0
    while True:
        try:
            value = turbo_float((yield))
            if value < minimum:
                minimum = value
            if value > maximum:
                maximum = value
            S += value
            i += 1
        except TypeError:
            sink.send(''.join((cityname, '=',
                               str(round(minimum, 1)), '/',
                               str(round(S/i, 1)), '/',
                               str(round(maximum, 1)))))

def printer():
    '''
    Receives data from each tally keeper and prints full list on closing.
    '''
    L = []
    try:
        while True:
            L.append((yield))
    except GeneratorExit:
        print(L)

def main():
    d = {}
    source = reader('measurements.txt')
    sink = printer()
    sink.send(None)
        
    for city, temp in source:
        try:
            d[city].send(temp)
        except KeyError:
            d[city] = coro_tally(city, sink)
            d[city].send(None)
            d[city].send(temp)
        
    keys = sorted(d.keys())
    for key in keys:
        d[key].send(None)
    sink.close()

if __name__ == '__main__':
    main()

1 reply

juliusgeo Jan 28, 2024

David Beazley's talks are amazing. I really like this one. I tried a similar approach but using a multiprocessing queue, and each process in the process pool just reading off the values from the queue as it was populated. This approach was unfortunately slower, and I think there's 2 main reasons for that. 1. The main process that reads in the elements of the file and puts them on the queue is not able to run fast enough to keep all of the workers busy. Perhaps this could be improved using multiple processes to read from the file, but: 2. Using a single queue to produce the results, and then having multiple processes read off that queue introduces a lot of coordination overhead because all of the processes are contending for a single resource. When I did profiling, I was seeing a lot of time being wasted just waiting for locks to be acquired. Additionally, I think coroutines provide speed benefits primarily when they're used in code that does a lot of network I/O--in this case, because it's reading from disk, the I/O is much faster than the actual processing of each line.

fastcodewriter · 2024-04-14T16:36:08Z

fastcodewriter
Apr 14, 2024

I've made an implementation which I believe is faster than the the one on the main repo (I'm not sure though as my PC only has 4 cores)

https://github.com/fastcodewriter/1brc

Is it worth submitting a pull request?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1BRC in Python #62

{{title}}

Replies: 11 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

1BRC in Python #62

Replies: 11 comments · 14 replies

ifnesi Jan 4, 2024 Author

Replies: 11 comments 14 replies

ifnesi
Jan 4, 2024
Author