Another microbenchmark compared to Ruby #106

PeterMozesMerl · 2015-03-26T22:08:35Z

A couple months ago I had a benchmark which compared the Ruby pg gem’s, the Go library’s and the rust-postgres crate’s insert performance. The speed was Rust < Go < Ruby, probably because the well-optimized Ruby gem written in C. Now this benchmark runs with almost the same speed as the Ruby version and it runs faster than the Go one.

However, I also had a select benchmark with rust-postgres 0.4. It used to run slightly slower than the Ruby one. (I still have the old binary and I could confirm this). However, the recent version is magnitudes slower. Since I am a beginner with Rust I can’t tell whether this is an upstream issue or not.

The database has ten thousands of records.

extern crate postgres;

use postgres::{Connection, SslMode};

fn main() {
    println!("Hello, world!");

    // I could not figure out how to use the unix_socket with v0.7 (and did’t try hard)
    let conn = Connection::connect("postgres://mage@localhost/xxx", &SslMode::None).unwrap();

    let cycles = 10;

    let query = "select id from users order by id limit 10000";
    let stmt = conn.prepare(query).unwrap();

    let mut bar: i32;

    for cycle in 0..cycles {
        for row in stmt.query(&[]).unwrap() {
            bar = row.get(0);
            println!("Done: {}", bar);
        }
        println!("Cycle: {}", cycle);
    }
}

$ time cargo run --release
...
Done: 10000
Cycle: 9

real    0m21.885s
user    0m18.290s
sys 0m3.544s

Of course the compile time is not counted, I have compiled it already.

The Ruby version:

require 'pg'

conn = PG.connect(dbname: 'xxx', user: 'mage', host: 'localhost')
stmt = conn.prepare('hoi', "select id from users order by id limit 10000") 

0.upto(9) do |cycle|
    results = conn.exec_prepared("hoi")
    results.each do |row|
        puts "Result: #{row['id']}"
    end
    puts "Done: #{cycle}"
end

$ time ruby postgres2.rb
Result: 9999
Result: 10000
Done: 9

real    0m1.232s
user    0m0.783s
sys 0m0.349s

As for the Rust version, displaying the results from 1 to 10000 is superfast. There is, however, a couple of seconds spent between each cycle.

The old binary (which used rust-postgres version 0.4) does’t have this pause between the cycles. (Note that the old binary uses unix_socket but I don’t think that’s the reason).

$ rustc --version
rustc 1.0.0-nightly (27901849e 2015-03-25) (built 2015-03-26)

This is the old version (compiled on Dec 29, 2014).

$ time ./target/postgres2
Done: 10000
Cycle: 100

real    0m15.154s
user    0m13.401s
sys 0m1.489s

The cycles is set to 100, so it would run for 1.5s with 10 cycles.

sfackler · 2015-03-26T22:34:47Z

Interesting, I will look into this.

sfackler · 2015-03-27T02:43:03Z

This line appears to be the offender:

rust-postgres/src/message.rs

Line 360 in 7bcb8fb

try!(buf.take(len as u64).read_to_end(&mut data));

Flamegraph (from an unoptimized build): https://sfackler.com/static/flamegraph.svg

sfackler · 2015-03-27T03:05:33Z

I believe this commit should fix the issue: 9cd413d

I'm seeing ~.66 second runtimes after that change:

Done: 10000
Cycle: 9
0.03user 0.34system 0:00.66elapsed 57%CPU (0avgtext+0avgdata 7824maxresident)k
0inputs+0outputs (0major+2621minor)pagefaults 0swaps

Thanks for pointing this out! What a weird place to find such a huge performance issue. I've published a new release to crates.io with this fix. I'll leave this issue open until you can confirm that everything looks good on your end as well.

PeterMozesMerl · 2015-03-27T10:33:47Z

I’ve tested on a different machine, the Ruby version runs for 0.82s (avg) and the updated Rust version runs for 0.43s. This is awesome, thank you for the fix. I look forward to using rust-postgres in production.

sfackler · 2015-03-27T15:36:02Z

Perfect!

@alexcrichton

with_end_to_cap is enormously expensive now that it's initializing memory since it involves 64k allocation + memset on every call. This is most noticable when calling read_to_end on very small readers, where the new version if **4 orders of magnitude** faster. BufReader also depended on with_end_to_cap so I've rewritten it in its original form. As a bonus, converted the buffered IO struct Debug impls to use the debug builders. I first came across this in sfackler/rust-postgres#106 where a user reported a 10x performance regression. A call to read_to_end turned out to be the culprit: sfackler/rust-postgres@9cd413d. The new version differs from the old in a couple of ways. The buffer size used is now adaptive. It starts at 32 bytes and doubles each time EOF hasn't been reached up to a limit of 64k. In addition, the buffer is only truncated when EOF or an error has been reached, rather than after every call to read as was the case for the old implementation. I wrote up a benchmark to compare the old version and new version: https://gist.github.com/sfackler/e979711b0ee2f2063462 It tests a couple of different cases: a high bandwidth reader, a low bandwidth reader, and a low bandwidth reader that won't return more than 10k per call to `read`. The high bandwidth reader should be analagous to use cases when reading from e.g. a `BufReader` or `Vec`, and the low bandwidth readers should be analogous to reading from something like a `TcpStream`. Of special note, reads from a high bandwith reader containing 4 bytes are now *4,495 times faster*. ``` ~/foo ❯ cargo bench Compiling foo v0.0.1 (file:///home/sfackler/foo) Running target/release/foo-7498d7dd7faecf5c running 13 tests test test_new ... ignored test new_delay_4 ... bench: 230768 ns/iter (+/- 14812) test new_delay_4_cap ... bench: 231421 ns/iter (+/- 7211) test new_delay_5m ... bench: 14495370 ns/iter (+/- 4008648) test new_delay_5m_cap ... bench: 73127954 ns/iter (+/- 59908587) test new_nodelay_4 ... bench: 83 ns/iter (+/- 2) test new_nodelay_5m ... bench: 12527237 ns/iter (+/- 335243) test std_delay_4 ... bench: 373095 ns/iter (+/- 12613) test std_delay_4_cap ... bench: 374190 ns/iter (+/- 19611) test std_delay_5m ... bench: 17356012 ns/iter (+/- 15906588) test std_delay_5m_cap ... bench: 883555035 ns/iter (+/- 205559857) test std_nodelay_4 ... bench: 144937 ns/iter (+/- 2448) test std_nodelay_5m ... bench: 16095893 ns/iter (+/- 3315116) test result: ok. 0 passed; 0 failed; 1 ignored; 12 measured ``` r? @alexcrichton

sfackler closed this as completed Mar 27, 2015

sfackler mentioned this issue Mar 29, 2015

Fix massive performance issue in read_to_end rust-lang/rust#23820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another microbenchmark compared to Ruby #106

Another microbenchmark compared to Ruby #106

PeterMozesMerl commented Mar 26, 2015

sfackler commented Mar 26, 2015

sfackler commented Mar 27, 2015

sfackler commented Mar 27, 2015

PeterMozesMerl commented Mar 27, 2015

sfackler commented Mar 27, 2015

Another microbenchmark compared to Ruby #106

Another microbenchmark compared to Ruby #106

Comments

PeterMozesMerl commented Mar 26, 2015

sfackler commented Mar 26, 2015

sfackler commented Mar 27, 2015

sfackler commented Mar 27, 2015

PeterMozesMerl commented Mar 27, 2015

sfackler commented Mar 27, 2015