-
Notifications
You must be signed in to change notification settings - Fork 557
Threadsafing HLL #39
base: master
Are you sure you want to change the base?
Threadsafing HLL #39
Conversation
Thanks for the patch. We had a similar pull request in the past, #17. After some discussion we agreed that the best option is to maintain a separate HLL per thread to avoid the overhead of the atomic objects. What are your thoughts on this tradeoff? |
I agree with Matt. I don't see the point to introduce this kind of complexity since estimator can be efficiently merged. I just ran my benchmark on your branch (single threaded). Most of the operations are not slowed down but merge is ~3.5x slower than in master. Have you benchmarked the multi-threaded case ? I am quite sure that separate HLL are more efficient due to contention, CPU cache and false sharing. I believe that we can apply the same pattern than in LongAdder if required. |
Agree - remove complexity by calculating separate HLL's on desperate threads and rely on native lossless union. |
While I'd prefer to stick to a single writer pattern, we simply can't avoid it in some of our use cases. My choices are either to threadproof things here or add a bunch of complexity elsewhere to maintain thread locals and then pull them back together at the appropriate instances. In any event, I'd be happy to put in the work to speed up the merge scenario, I think that can be improved back up to where it was. But I'd prefer to only spend that time if the patch is going to be considered. Otherwise we'll simply maintain our own fork. |
Can you just wrap it so that the base implementation doesn't slow down? On Tue, Jul 2, 2013 at 3:21 PM, Cliff Moon [email protected] wrote:
|
Why don't we make RegisterSet an interface and then provide a new constructor that allows for selecting between safe and unsafe versions of the RegisterSet? |
Before doing that it would be good to benchmark the proposed implementation against plain old locks for a realistic multi-threaded use case. Wrapping with locks provides greater flexibility (you can wrap only mutable operations if you don't care about stale reads, you can force happens-before to control the staleness of the reads, you can have mutual exclusion if it is what you need etc.) and is much simpler (easy to understand what is the semantic of complex operation like merge). I don't believe that using lock will be much slower on In addition it will be good to keep the RegisterSet implementation simple and orthogonal to thread safety. Indeed, Hotspot is not yet able to produce SIMD code and a native implementation of RegisterSet leads to 10-20x faster merges. It would be good to keep the door open for a such optimization. I will try to run a simple benchmark to get some numbers. |
Thanks for running the benchmarks and it sounds like Cliff is willing to address any performance issues that are uncovered. So I think if we can get a performance neutral (or even positive) solution than this works for me. SIMD RegisterSets would be great. |
Here are the preliminary results. Please remember these are micro-benchmarks and it is very easy to get them wrong or meaningless. I tried to check the overhead in single thread mode and the scalability up to 4 threads (I don't have access to a bigger machine now). I also implemented a very basic locking strategy. I ran the multi-threaded benchmark on master branch too, even if this is obviously broken. It sets a kind of upper bound. I am using JMH, it is a great micro & macro benchmark framework from the OpenJDK guys. OfferMaster:
With Locks:
Atomic Integer:
As you can see:
CardinalityMaster:
Atomic:
Performance is good and scales well MergeMaster:
Atomic:
Awful performance. Scalability seems correct. I did not have the time to perform a clean investigation. According to hprof/flamegraph it seems that the merge function is not inlined by the JIT with Atomic but it does not explain the whole story. I will try to find time to check the assembler output and CPU cache hit/miss to get a better overview of the issue. BTW: I did not reviewed the code but I also noticed their was a bug in the implementation leading to wrong result when the update is contended ( Conclusion
|
First, very nice work on the benchmarks. Regarding the cost of merge, I don't see a huge cost since the merge cost |
Hey great work on the benches. Glad to see it performing well. I agree the HLL++ will need tests to get threadsafed. I would also be game to take a crack at threadsafing countmin as well. As for the merge performance I'd be willing to bet it's due to the eliding of System.arraycopy from bits(). Will think on how to improve this. |
Also would you mind sharing your benchmarking code? We've got test machines with 24 - 32 cores, would be nice to produce a big old spread of measurements across larger numbers of threads, as well as give the JVM a warmup period for JIT to kick in. |
It depends of your use case. HLL(++) are often used in web analytics. You precompute things like unique visitors according bazillion of criteria (time-range, hierarchy etc.), and store them on disk. This phase can be done is parallel on many machines and scalability is easy. Then you merge the estimator to create the final report(s). It can be done on the fly or in a batch so you want fast response time or good throughput. The performance goal is to be IO bound, not CPU bound. On my machine I achieve 300MB/s with the master branch but "only" 100MB/s with current Cliff's code (LOG2M = 14). So it's a regression if you have fast disks or network and want to keep your code simple and single-threaded. It is not a show-stopper but we should try to avoid this slow down if possible since it impacts real workloads. ICardinality agg = new HyperLogLog(LOG2M);
try (FileInputStream fis = new FileInputStream(file)) {
while (true) {
int s = fis.read(buf, 0, len);
if (s == -1) break;
byteCount += s;
agg = agg.merge(HyperLogLog.Builder.build(buf));
}
}
I did not have the time to investigate further today but I don't think this is the issue. In my private repo I have added a mutable merge operation implemented like this: + public void merge(HyperLogLog other) throws HyperLogLogMergeException {
+ if (other.sizeof() != sizeof())
+ {
+ throw new HyperLogLogMergeException("Cannot merge estimators of different sizes");
+ }
+ registerSet.merge(other.registerSet);
+ } A you can see it bypass the BTW: Adding a mutable merge method would be great anyway.
I will upload it ASAP, most likely next week. It is rudimentary but could be a starting point to try to monitor/improve stream-lib performances. |
Ping on benchmarking code. |
Keep in mind that the paint is wet. Comment, contribution & criticism are welcome. Let me known if you need help. |
Any update on this one? We are considering merging another branch from #38 into master soon. That branch improves the serialization and memory footprint of the HLL Plus considerably but will add more thread safety concerns. I'm happy to wait to merge the results from this issue into the other branch in order to avoid breaking thread safety if you are going to be able to push your changes soon. |
any update on this pull request ? thanks |
Replaced all of the mutating operations of RegisterSet to be atomic so that HLL's can safely be updated by multiple threads. Tests pass on multiple runs, performance impact on the HLL appears to be negligible in the single threaded case.