`Contains` isn't O(1) with respect to the size of the filter? #84

vlovich · 2024-04-27T23:36:47Z

I'm seeing something surprising - the performance of querying the filter seems to depend somewhat on how many elements are contained within the filter. Is that expected? I would have thought lookups are O(1).

Contains performance with 1M elements in the filter:

         ╰─ bfuse    3.48 ms       │ 4.601 ms      │ 3.571 ms      │ 3.577 ms      │ 100     │ 100
                     287.2 Mitem/s │ 217.3 Mitem/s │ 280 Mitem/s   │ 279.5 Mitem/s │         │

Contains performance with 9M elements in the filter:

         ╰─ bfuse    12.22 ms      │ 16.63 ms      │ 13.38 ms      │ 13.48 ms      │ 100     │ 100
                     81.76 Mitem/s │ 60.1 Mitem/s  │ 74.73 Mitem/s │ 74.17 Mitem/s │         │

Contains performance with 99M elements in the filter:

         ╰─ bfuse    18.22 ms      │ 24.79 ms      │ 19.63 ms      │ 19.97 ms      │ 100     │ 100
                     54.86 Mitem/s │ 40.32 Mitem/s │ 50.92 Mitem/s │ 50.07 Mitem/s │         │

It does seem to level off from there, but that's still a 4x difference reduction in speed for a 100x increase in filter size. BinaryFuse8 vs BinaryFuse32 doesn't seem to matter. 1M elements is 4 MiB, 9M filter is 40 MiB, while 100M elements is a 2 GiB filter so maybe this is just an artifact of the microbenchmark sitting in the cache? Just wanted to double-check my assumptions.

Benchmark code:

    #[divan::bench]
    pub fn bfuse(bencher: divan::Bencher) {
        #[derive(Clone, Copy)]
        struct Counter(u64, u64);
        impl Iterator for Counter {
            type Item = u64;

            fn next(&mut self) -> Option<Self::Item> {
                let next = self.0;
                if next >= self.1 {
                    None
                } else {
                    self.0 += 1;
                    Some(next)
                }
            }
        }

        impl ExactSizeIterator for Counter {
            fn len(&self) -> usize {
                (self.1 - self.0) as usize
            }
        }

        let containment_range = 1_000_000..500_000_000;
        const fn lookup_range() -> std::ops::Range<u64> {
            500_000..1_500_000
        }

        let filter =
            BinaryFuse32::try_from_iterator(Counter(containment_range.start, containment_range.end))
                .unwrap();
        bencher
            .counter(divan::counter::ItemsCount::new(
                lookup_range().end - lookup_range().start,
            ))
            .bench_local(|| {
                for i in lookup_range() {
                    std::hint::black_box(filter.contains(&i));
                }
            });
    }

The text was updated successfully, but these errors were encountered:

ayazhafiz · 2024-09-06T05:34:34Z

I can't say without more investigation but I wonder if indeed it has to do with the cache.

It is the case that filters with a lower false-positive rate have a slower query time. So that may be part of it too.

The implementation of lookups is here. You can see that indexing into an array of fingerprints is a part of the function, so if the array is large and not in the cache I can see how that would lead to subpar results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Contains` isn't O(1) with respect to the size of the filter? #84

`Contains` isn't O(1) with respect to the size of the filter? #84

vlovich commented Apr 27, 2024

ayazhafiz commented Sep 6, 2024

Contains isn't O(1) with respect to the size of the filter? #84

Contains isn't O(1) with respect to the size of the filter? #84

Comments

vlovich commented Apr 27, 2024

ayazhafiz commented Sep 6, 2024

`Contains` isn't O(1) with respect to the size of the filter? #84

`Contains` isn't O(1) with respect to the size of the filter? #84