-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance of your publicsuffix library? #15
Comments
Hi, I've never actually tried to optimise the performance of this library but this is something I plan to do when I get more free time. Having said that, I also noticed a couple of issues with the benchmark code itself. For example:-
You are including the overhead of
Also println! is slow. Those two are what jumps at me when looking at that code. We would need to profile to know why exactly it's that slow. Lastly, I took a quick look at the documentation of that Python module and it says:-
So |
@rushmorem I tried to address your comments in the revision no. 2: https://gist.github.com/d33tah/25db3f7e00970b9bd70cf0529fe77831/revisions Is there a way to disable the checks in your library in order to get a fairer comparison? |
Well, from their README it sounds to me like the Python code is only doing this part:- Domain::find_possible_matches(&domain, list)
.and_then(|res| Domain::find_match(input, &domain, res)) However, both impl List {
pub fn public_suffix(&self, domain: &str) -> Result<Domain> {
Domain::find_possible_matches(domain, self)
.and_then(|res| Domain::find_match(domain, domain, res))
}
} But even that part of the code still does a few allocations so there is definitely room for improvement. |
I don't quite understand the jump from |
The jump was because I did second test on my laptop, sorry for the
confusion.
2018-06-27 0:55 GMT+02:00 Rushmore Mushambi <[email protected]>:
… I don't quite understand the jump from 74.80 to 103.39. Those changes you
made have nothing to do with this library but still we see such a slowdown.
Don't get me wrong though, this library itself needs quite a bit of
perfomance optimisations. For starters, I collect into a Vec in a number
of places which introduces allocations. I've always wanted to get rid of
those but I haven't had time to do it yet.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASBmk0yodfQVoFsE6nnJmtiR6vPiYVqks5uArvXgaJpZM4U3_kK>
.
|
Ah, I see. No problem. Thanks for the explanation. |
Consider revision 3 - repeated on my PC like rev 1, as opposed to rev 2 which was on laptop: https://gist.github.com/d33tah/25db3f7e00970b9bd70cf0529fe77831/revisions Looks like I shaved some time off by using my fork that has some functionality removed: But still the thing is 10x as slow. I looked at your code and while I'm not good at figuring out allocation schemes in Rust, I think that it's actually about data structures. You seem to be using a flat list while Python library uses a nested dictionary, split by dot, like this:
|
I'm pretty sure it's both, among other things.
Yes, but it's not totally flat. The rules are stored as println!("{:#?}", list.rules.get("uk").unwrap()); gives us [
Suffix {
rule: "uk",
typ: Icann
},
Suffix {
rule: "ac.uk",
typ: Icann
},
Suffix {
rule: "co.uk",
typ: Icann
},
Suffix {
rule: "gov.uk",
typ: Icann
},
Suffix {
rule: "ltd.uk",
typ: Icann
},
Suffix {
rule: "me.uk",
typ: Icann
},
Suffix {
rule: "net.uk",
typ: Icann
},
Suffix {
rule: "nhs.uk",
typ: Icann
},
Suffix {
rule: "org.uk",
typ: Icann
},
Suffix {
rule: "plc.uk",
typ: Icann
},
Suffix {
rule: "police.uk",
typ: Icann
},
Suffix {
rule: "*.sch.uk",
typ: Icann
},
Suffix {
rule: "service.gov.uk",
typ: Private
},
Suffix {
rule: "homeoffice.gov.uk",
typ: Private
},
Suffix {
rule: "blogspot.co.uk",
typ: Private
},
Suffix {
rule: "glug.org.uk",
typ: Private
},
Suffix {
rule: "lug.org.uk",
typ: Private
},
Suffix {
rule: "lugs.org.uk",
typ: Private
},
Suffix {
rule: "barsy.co.uk",
typ: Private
},
Suffix {
rule: "barsyonline.co.uk",
typ: Private
},
Suffix {
rule: "barsy.uk",
typ: Private
},
Suffix {
rule: "nh-serv.co.uk",
typ: Private
},
Suffix {
rule: "no-ip.co.uk",
typ: Private
},
Suffix {
rule: "wellbeingzone.co.uk",
typ: Private
},
Suffix {
rule: "gwiddle.co.uk",
typ: Private
}
] So for |
@rushmorem .com has a pretty long list, I can imagine that backfiring. It's still surprisingly slow though. |
@rushmorem btw I might have commercial interest in those optimizations. Contact me over GMail if having this MR sponsored is an option. |
This commit merges both crates, which used to live in different repos. The `psl` crate is now meant to supersede the `publicsuffix` crate. I have redesigned and rewritten both crates to address rushmorem/publicsuffix#15. The `psl` crate is the main crate. It's now very lean and lightweight. It is the lowest level in the suite of crates. See the README for how to use it.
@d33tah Kindly clone https://github.com/rushmorem/psl and run the benchmarks against master and let me know the results. I haven't had a chance to setup benchmarks against the libraries you mentioned yet. Heads up:-
|
If you only care about particular TLDs, for example PSL_TLDS="com, net, org" cargo build --release This will not only boost your build times, it will also filter out all the other TLDS, only including the ones you care about in your binary. Thereby, making your binary smaller too. If you only care about one TLD, you can use You can also pass in your own list or lists using the environment variables Lastly you can derive your own list using using the extern crate psl;
#[macro_use]
extern crate psl_codegen;
// import the trait
use psl::Psl;
#[derive(Psl)]
// You can also use `url` instead of `path`. If you don't include this attribute,
// it will be downloaded from the official site during the build.
#[psl(path = "/path/to/list.dat")]
struct List;
// use the list so |
I have now pushed After updating your benchmark code to use extern crate psl;
use std::io::{self, BufRead, Write};
use psl::{Psl, List};
fn main() {
let stdout = io::stdout();
let mut handle = stdout.lock();
let list = List::new();
let stdin = io::stdin();
for line in stdin.lock().lines() {
let domain_str = line.unwrap();
let domain = match list.registrable_domain(&domain_str) {
Some(x) => x,
None => continue
};
handle.write(domain.as_str().as_bytes()).unwrap();
handle.write(b"\n").unwrap();
}
} and using your Python code as is, here are the results after running your benchmark on my laptop... ~> time ./target/release/psltest < domains.txt | wc -l
2.87user 1.04system 0:03.91elapsed 99%CPU (0avgtext+0avgdata 2932maxresident)k
0inputs+0outputs (0major+142minor)pagefaults 0swaps
1036595
~> time python main.py public_suffix_list.dat < domains.txt | wc -l
5.82user 0.04system 0:05.86elapsed 99%CPU (0avgtext+0avgdata 26176maxresident)k
0inputs+0outputs (0major+9273minor)pagefaults 0swaps
1048576 The ~> time python main.py public_suffix_list.dat
0.35user 0.01system 0:00.37elapsed 100%CPU (0avgtext+0avgdata 26072maxresident)k
0inputs+0outputs (0major+9244minor)pagefaults 0swaps So indeed the |
I was curious to see how a slightly more idiomatic version like extern crate psl;
use std::{env, fs};
use psl::{Psl, List};
fn main() {
let list = List::new();
let filename = env::args().nth(1).expect("missing arg: filename");
let domains = fs::read_to_string(filename).expect("file not found");
for input in domains.lines() {
if let Some(domain) = list.registrable_domain(input) {
println!("{}", domain);
}
}
} would perform. The answer was ~> time ./target/release/psltest domains.txt | wc -l
2.69user 1.12system 0:03.82elapsed 99%CPU (0avgtext+0avgdata 37808maxresident)k
0inputs+0outputs (0major+9040minor)pagefaults 0swaps
1036595 |
Hi @rushmorem! Thanks for working on this! Looks like we came up with similar solutions - I experimented with a similar approach here: https://gitlab.com/d33tah/psl-rewrite I didn't really manage to beat Python though, so it looks I did something quite wrong. Your implementation has better performance, but makes me wonder - is there a way to speed compilation time? Would it be expensive to load the list on startup? I'm OK trading half a second for that. Also, here's an error I get in the Dockerfile:
|
BTW, have you compared this against PyPy as well? |
Hi @d33tah
It's my pleasure. I really enjoyed working on this. A nice side effect is that the library is now
Awesome! Thanks for the link, I will take a look.
Currently, I think you can only improve subsequent builds by caching the build artifacts. Unfortunately, the current design actually compiles the list to an actual Rust match expression like let mut suffix = Info::Incomplete;
let mut index = 1;
match labels.next() {
Some(label) => {
// ... snip
"com" => {
suffix = Info::Suffix(index, Type::Icann);
index += 1;
match labels.next() {
// ... snip
}
None => Some(suffix)
// ... snip
}
None => None
} so a lazy static won't cut it. Having said that though, this shouldn't affect your development experience. When developing, you can set
The nice thing about the new design is that the list is now a trait so it supports multiple implementations easily. I will try to implement a runtime list and see if the performance will be acceptable. If it goes well both implementations can live side by side.
Your compiler doesn't have the improved match ergonomics. I have forgotten in which version that RFC landed on stable but this does work on the latest stable Rust as shown by Travis. Is updating to latest stable an option for you? If not, I can update the code to support older compilers. |
I hadn't tried with PyPy yet. Here are the results from PyPy: ~> time pypy main.py public_suffix_list.dat < domains.txt | wc -l
2.24user 0.06system 0:02.32elapsed 99%CPU (0avgtext+0avgdata 93272maxresident)k
0inputs+0outputs (0major+12805minor)pagefaults 0swaps
1048576 So we haven't beaten that one yet but we are very close. There is also the ~> time psl --load-psl-file public_suffix_list.dat --print-unreg-domain < domains.txt | wc -l
1.32user 0.03system 0:01.36elapsed 100%CPU (0avgtext+0avgdata 5264maxresident)k
0inputs+0outputs (0major+362minor)pagefaults 0swaps
1048576 So we still have a bit of work ahead of us if we plan to beat it as well. |
It's the default one Ubuntu 18.04 and since it's LTS, I think it might hamper adoption - unless you aim for people who mostly use rustup. If it's a big deal, I would probably not bother. And once again, thanks for working on beating PyPy! It would be amazing to get there. Possibly ignorant question, but perhaps it's a good idea to try to trim the data structures down so that they'd fit in CPU cache if possible? |
After removing a couple of unnecessary checks, since we are already assuming that the domain itself is valid, our performance is now up to:- 1.81user 1.02system 0:02.84elapsed 99%CPU (0avgtext+0avgdata 37892maxresident)k
0inputs+0outputs (0major+9038minor)pagefaults 0swaps
1048575 Note that the number of domains we are returning are now much closer to those being returned by both Python and C. We are probably performing at least one extra check that the other guys are not doing. I've been looking at your code
Nice! As for why it's slower, both |
What's EDIT: It looks like it's |
I have just pushed version
Are you referring to the giant match statement? I don't know how much LLVM optimises it but I purposefully made it very simple so the compiler can easily understand what it's trying to do. At some point I would also like to try rust-pf. Depending on how the match statement is being optimised, it might give us faster lookups at the expense of more complex lookup code. |
@rushmorem can't figure it out yet, but I'm still getting lower performance on 0.1.1:
The "71224inputs" looks interesting. |
I updated the Gist so you can compare the test code. |
Surprisingly, on the second run it's faster:
|
Updated the Gist again - I suggest you clone it using |
Thanks, I have done that. It's building now.
It's OK, I will use the Dockerfile. Trying to reproduce my results might be a bit tricky because our environments are so different. I run I have pushed version 0.2 to crates.io. It has a couple of breaking changes to make the API a bit more pleasant and consistent. You can use the following new script:- extern crate psl;
use std::io::{self, BufRead, Write};
use psl::{Psl, List};
fn main() {
let stdout = io::stdout();
let mut handle = stdout.lock();
let list = List::new();
let stdin = io::stdin();
for line in stdin.lock().lines() {
let domain_str = line.unwrap();
if let Some(domain) = list.suffix(&domain_str) {
handle.write(domain.as_str().as_bytes()).unwrap();
handle.write(b"\n").unwrap();
};
}
} Also note that I changed the script to get the public suffix rather than the registrable domain name. That behavior matches what the Python code is doing. As you will see, they will now return the same results. |
Here are the results of the build for me after running it on a Linode server:- Removing intermediate container 8f286c7ee49b
---> 15a0df77eaf0
Step 8/15 : ADD ./Cargo.toml .
---> 5d8f6370fa7c
Step 9/15 : ADD ./main.rs .
---> 1e842bfb5ac9
Step 10/15 : RUN time cargo build --release --quiet
---> Running in 050a59353578
1908.31user 10.31system 31:11.92elapsed 102%CPU (0avgtext+0avgdata 1442296maxresident)k
358528inputs+395792outputs (1441major+2422274minor)pagefaults 0swaps
Removing intermediate container 050a59353578
---> 2ed41b06b303
Step 11/15 : ADD ./main.py .
---> 4f5448271e9e
Step 12/15 : ADD ./domains.txt .
---> 648490ff725a
Step 13/15 : RUN cat domains.txt > /dev/null
---> Running in b9afa7033a20
Removing intermediate container b9afa7033a20
---> da9c9acb372d
Step 14/15 : RUN time ./target/release/publicsuffixtest public_suffix_list.dat < domains.txt | wc -l
---> Running in d8cc8a2a6eaa
2.43user 1.31system 0:03.87elapsed 96%CPU (0avgtext+0avgdata 2372maxresident)k
73720inputs+0outputs (9major+94minor)pagefaults 0swaps
1048576
Removing intermediate container d8cc8a2a6eaa
---> 8caa81173ea6
Step 15/15 : RUN time pypy main.py public_suffix_list.dat < domains.txt | wc -l
---> Running in 3c91a81de36d
1048576
2.50user 0.16system 0:02.72elapsed 97%CPU (0avgtext+0avgdata 117560maxresident)k
212664inputs+0outputs (252major+18746minor)pagefaults 0swaps
Removing intermediate container 3c91a81de36d
---> b17756176543
Successfully built b17756176543
Successfully tagged pslbench:latest Note that I used a local copy of domains.txt because the file wasn't getting downloaded. I didn't limit to the first 1 million domains, I just used the entire file. |
@rushmorem updated the gist again, this version pulls latest public rDNS snapshot since Rapid7 clearly makes URLs expire. Test was on my laptop: Step 15/16 : RUN time ./target/release/publicsuffixtest public_suffix_list.dat < domains.txt | wc -l |
@d33tah I've just released version 0.2.2 with more optimisations. I've also made some adjustments to your Gist, mainly to enable LTO via |
@d33tah Version ~/bench]# docker run --rm -it pslbench
Running PyPy benchmark
2.20user 0.12system 0:02.38elapsed 97%CPU (0avgtext+0avgdata 116076maxresident)k
143392inputs+0outputs (238major+18498minor)pagefaults 0swaps
1048576
Running Rust benchmark
0.40user 0.95system 0:01.36elapsed 99%CPU (0avgtext+0avgdata 2796maxresident)k
2400inputs+0outputs (10major+133minor)pagefaults 0swaps
1048576
Running C benchmark
1.70user 0.06system 0:01.77elapsed 99%CPU (0avgtext+0avgdata 2884maxresident)k
1752inputs+0outputs (8major+266minor)pagefaults 0swaps
1048576 |
@d33tah I've added back support for By default, the debug versions will now only include suffices under the |
Thanks! Apologies for not responding - had some work to sort out first. I'll test it soon and let you know, OK? |
OK |
Niice! I changed head to 10M because it was actually too fast:
Can't wait to give it a bigger try! |
Actually, it doesn't look correct - I changed ENTRYPOINT to target/release/pslbench and here's what I got:
|
Which script are you using? This one? |
That is correct. EDIT: if you try the same thing for |
Thanks. Those results are pretty close though. Both the EDIT: I suspect that the extra result that the Python library is returning is totally invalid since it doesn't even try to check that. |
I have taken a look at the Python library's source code. I can see that it only exports that one method, |
In light of the bug I highlighted in my previous comment, I have adjusted the benchmarks on my machine to make both
Note that |
After looking at those results again, I noticed that we were issuing way more system calls than PyPy and C, so I did an |
I took the opportunity to make the
|
So a lot of the overhead was just printing the domains to |
@d33tah I think you will love this 😄 |
I've just rebuilt the docker container. It took about 8 and a half seconds.
|
The latest version now builds in about 1 minute in debug mode. |
@rushmorem nice! Could you also update the gist? Here are my changes: diff --git a/Cargo.toml b/Cargo.toml
index 6d61810..5c2f337 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -8,7 +8,7 @@ name = "publicsuffixtest"
path = "main.rs"
[dependencies]
-psl = "0.2.2"
+psl = "*"
[profile.release]
-lto = true
\ No newline at end of file
+lto = true
diff --git a/Dockerfile b/Dockerfile
index 672e552..4669d08 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -20,8 +20,15 @@ ADD ./main.rs .
ENV RUSTFLAGS "-Ctarget-cpu=native"
RUN time cargo build --release --quiet
+RUN curl https://publicsuffix.org/list/public_suffix_list.dat -o public_suffix_list.dat
+RUN curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | \
+ grep 'href="/sonar.rdns_v2/' | cut -d'"' -f2 > url.txt
+RUN curl --location https://opendata.rapid7.com/`cat url.txt` \
+ | pigz -dc | head -n 10M | jq -r .value > domains.txt
+
+
ADD ./main.py .
ADD ./bench.sh .
-ENTRYPOINT ./bench.sh
\ No newline at end of file
+ENTRYPOINT ./bench.sh
diff --git a/bench.sh b/bench.sh
old mode 100644
new mode 100755
diff --git a/main.rs b/main.rs
index 92a3a25..012a1ee 100644
--- a/main.rs
+++ b/main.rs
@@ -12,7 +12,7 @@ fn main() {
for line in stdin.lock().lines() {
let domain_str = line.unwrap();
if let Some(suffix) = list.suffix(&domain_str) {
- handle.write(suffix.as_str().as_bytes()).unwrap();
+ handle.write(suffix.as_bytes()).unwrap();
handle.write(b"\n").unwrap();
};
} And here's a rather unexpected benchmark:
|
In your updates you didn't include write buffering so the benchmark script itself is issuing a system call for each line printed to stdout. I have updated the Gist to match what's in https://github.com/addr-rs/pslbench. It also prints the registrable domain rather than a suffix as per #15 (comment). Computing a suffix is faster but using a registrable domain makes the benchmark fairer to the Python lib because of the bug it has #15 (comment). |
@rushmorem sorry for not responding lately. I tried this modification to your gist: diff --git a/Dockerfile b/Dockerfile
index 8794e95..f31b3ac 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,12 +1,6 @@
FROM ubuntu:18.04
-RUN apt-get update && apt-get install -y cargo curl jq libssl-dev pkg-config time pigz && apt-get clean
-
-RUN curl https://publicsuffix.org/list/public_suffix_list.dat -o public_suffix_list.dat
-RUN curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | \
- grep 'href="/sonar.rdns_v2/' | cut -d'"' -f2 > url.txt
-RUN curl --location https://opendata.rapid7.com/`cat url.txt` \
- | pigz -dc | head -n 1M | jq -r .value > domains.txt
+RUN apt-get update && apt-get install -y cargo curl jq libssl-dev pkg-config time pigz psl && apt-get clean
RUN apt-get update && apt-get -y install pypy && apt-get clean
RUN curl -O https://bootstrap.pypa.io/get-pip.py && pypy get-pip.py
@@ -20,10 +14,14 @@ ADD ./main.rs .
ENV RUSTFLAGS "-Ctarget-cpu=native"
RUN time cargo build --release --quiet
+RUN curl https://publicsuffix.org/list/public_suffix_list.dat -o public_suffix_list.dat
+RUN curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | \
+ grep 'href="/sonar.rdns_v2/' | cut -d'"' -f2 > url.txt
+RUN curl --location https://opendata.rapid7.com/`cat url.txt` \
+ | pigz -dc | head -n 10M | jq -r .value > domains.txt
+
ADD ./main.py .
ADD ./bench.sh .
-RUN apt-get install -y psl
-
ENTRYPOINT ./bench.sh And these are the results:
|
No problem. We are all busy with other things. The Gist is using Ubuntu 18.04, which has Rust 1.25 in its repos. Rust 1.25 does not support slice patterns, so we use the much slower string matches there. To take advantage of slice patterns you need at least Rust 1.27. Kindly run the benchmark in https://github.com/addr-rs/pslbench and let me know your results. That one uses Rust 1.27. EDIT: Having said that, I'm glad that we are still beating both the |
For future reference, |
Hi!
I'd really love to use your publicsuffix library, but it's currently very slow (even compared to a Python implementation, 74s vs 5s per 1M domains). Any chance there are any easy fixes that could improve that?
Take a look at the Docker image I created and those few lines from results.txt:
https://gist.github.com/d33tah/25db3f7e00970b9bd70cf0529fe77831#file-results-txt-L36
Cheers,
d33tah
The text was updated successfully, but these errors were encountered: