Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to multiply with overflow when using i32 #138

Open
hamirmahal opened this issue Nov 11, 2023 · 22 comments
Open

attempt to multiply with overflow when using i32 #138

hamirmahal opened this issue Nov 11, 2023 · 22 comments

Comments

@hamirmahal
Copy link

No description provided.

@hamirmahal
Copy link
Author

https://github.com/hamirmahal/rstar-i32/ has some code that reproduces the issue.

@hamirmahal
Copy link
Author

https://github.com/hamirmahal/rstar-i32/actions/runs/6836250994/job/18590953607#step:3:43 also shows the output on a machine that isn't mine, although you have to be signed in to see it via this link.

@hamirmahal
Copy link
Author

I'm not sure if the error is in my code, or in rstar. Am I using rstar incorrectly? src/main.rs outputs

$  cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/rstar-boilerplate`
thread 'main' panicked at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/arith.rs:346:1:
attempt to multiply with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Rust code that resulted in the above output
use rstar::Point;

#[derive(Clone, Copy, Debug, PartialEq)]
struct Data {
    coordinates: [i32; 18],
}

impl Point for Data {
    type Scalar = i32;
    const DIMENSIONS: usize = 18;

    fn nth(&self, index: usize) -> Self::Scalar {
        self.coordinates[index]
    }

    fn nth_mut(&mut self, index: usize) -> &mut Self::Scalar {
        &mut self.coordinates[index]
    }

    fn generate(mut generator: impl FnMut(usize) -> Self::Scalar) -> Self {
        Data {
            coordinates: [
                generator(0),
                generator(1),
                generator(2),
                generator(3),
                generator(4),
                generator(5),
                generator(6),
                generator(7),
                generator(8),
                generator(9),
                generator(10),
                generator(11),
                generator(12),
                generator(13),
                generator(14),
                generator(15),
                generator(16),
                generator(17),
            ],
        }
    }
}

fn main() {
    let mut tree = rstar::RTree::new();
    tree.insert(Data {
        coordinates: [
            30, 30, 62, 30, 31, 30, 20, 12, 30, 80, 40, 132, 28, 78, 140, 139, 195, 67,
        ],
    });
    tree.insert(Data {
        coordinates: [
            40, 9, 9, 144, 136, 149, 148, 140, 198, 36, 50, 167, 179, 234, 38, 2, 103, 38,
        ],
    });
    tree.insert(Data {
        coordinates: [
            38, 38, 38, 6, 2, 3, 155, 61, 61, 195, 155, 15, 70, 134, 158, 126, 94, 63,
        ],
    });
    tree.insert(Data {
        coordinates: [
            154, 154, 138, 179, 121, 75, 143, 31, 7, 67, 11, 3, 113, 113, 65, 65, 73, 65,
        ],
    });
    tree.insert(Data {
        coordinates: [
            221, 215, 209, 115, 210, 198, 224, 236, 111, 6, 7, 7, 85, 92, 203, 197, 36, 44,
        ],
    });
    tree.insert(Data {
        coordinates: [
            176, 240, 176, 176, 112, 241, 240, 240, 192, 44, 44, 12, 206, 204, 142, 4, 44, 2,
        ],
    });
    tree.insert(Data {
        coordinates: [
            212, 218, 202, 70, 70, 23, 23, 23, 1, 230, 250, 55, 30, 23, 60, 60, 92, 16,
        ],
    });
}

@urschrei
Copy link
Member

urschrei commented Nov 12, 2023

rstar uses a fold to calculate the area of the envelope (an AABB) of the geometry (In your case, the 18-dimensional point you're trying to insert on line 78). Fold uses an accumulator and a multiplication function:

diag.fold(one, |acc, cur| {
    max_inline(cur, zero) * acc
})

During the insertion operation on line 78 of your code, you can see the problem:

Acc: 359131136 Cur: 68.

Multiplying acc with cur overflows i32. That's a panic in debug builds.

@hamirmahal
Copy link
Author

Ah, because the product, 24,420,917,248, is larger than the max i32 value, 2,147,483,647.

If each coordinate is a u8, the max possible value when multiplying them all together is 255^numCoordinates, which is currently 18, but the number of coordinates could easily double, or triple, for my use case.

So the maximum possible product currently is 255^18 = 2.078371831996e43, which I don’t even think the maximum unsigned integer size, u128, can support, since I think that’s approximately 3.402823669209e38.

Is calculating the area of the bounding box required to use an R-tree? I’m wondering if it’s possible to write an implementation that avoids doing this calculation altogether to avoid running the risk of overflow.

Alternatively, would checking for overflow before the multiplication, and setting the product to u128::MAX if there would be an overflow, result in any weird or unexpected behavior? I’m also considering that approach.

During the insertion operation on line 78 of your code, you can see the problem:

Acc: 359131136 Cur: 68.

Multiplying acc with cur overflows i32. That's a panic in debug builds.

Hmm... where does the current value of 68 come from? I don't see any elements in https://github.com/hamirmahal/rstar-i32/commit/aef70723a73ca76c71387ef93d825b7afcf5a0ea#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc with the value 68.

Also, how were you able to see the current and total values in fold?

I tried adding

        diag.fold(one, |acc, cur| {
            std::println!("acc: {}, cur: {}", acc, cur);
            return max_inline(cur, zero) * acc;
        })

to ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rstar-0.11.0/src/aabb.rs, but that gave me this error.

ailed to resolve: use of undeclared crate or module `std`
use of undeclared crate or module `std`rustc

Thanks for the detailed response.

@michaelkirk
Copy link
Member

Just to clarify: You're using 18-dimensional coordinates, is that right?

@michaelkirk
Copy link
Member

michaelkirk commented Nov 13, 2023

Ok, I see your example:
https://github.com/hamirmahal/rstar-i32/commit/aef70723a73ca76c71387ef93d825b7afcf5a0ea#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc

So you are using 18-D coordinates.

Just to ask a silly question: Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates?

@hamirmahal
Copy link
Author

Did you really intend to have a single coordinate with 18 dimensions, or were you trying to have something like 9 2-D (XY) coordinates?

This is correct.

@hamirmahal
Copy link
Author

I'm actually trying to find the element in my dataset with the closest Hamming distance to some input.

The things I'm comparing currently have 18 bytes, hence the 18 coordinates that can each be as large as 255, and it looks like we may need to increase the number of bytes we store to 32, or possibly even more.

Do you think R-trees are a suitable data structure for this purpose?

@michaelkirk
Copy link
Member

michaelkirk commented Nov 14, 2023

Do you think R-trees are a suitable data structure for this purpose?

Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣

But more seriously... I've never worked with hamming distance. The definition I searched up was:

While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.

So, hamming distance is about bitwise distance - not euclidean distance.

In other words:

hamming_distance(0b_1000, 0b_0000) == 1  // Near!
hamming_distance(0b_1000, 0b_0111) == 4  // Far!

But with typical numeric distance we get a very different answer:

numeric_distance(0b_1000, 0b_0000) == 8  // Far!
numeric_distance(0b_1000, 0b_0111) == 1  // Near!

So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance.

@urschrei
Copy link
Member

Hmm. Hamming distance satisfies the triangle inequality, so I think it's possible to use R[*]-trees, but I have no idea how we're going to make it work for rstar…

@hamirmahal
Copy link
Author

Do you think R-trees are a suitable data structure for this purpose?

Well, I can say with certainty that this particular R-Tree implementation is apparently not currently suitable for this purpose. 🤣

Ha!

But more seriously... I've never worked with hamming distance. The definition I searched up was:

While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.

So, hamming distance is about bitwise distance - not euclidean distance.

In other words:

hamming_distance(0b_1000, 0b_0000) == 1  // Near!
hamming_distance(0b_1000, 0b_0111) == 4  // Far!

But with typical numeric distance we get a very different answer:

numeric_distance(0b_1000, 0b_0000) == 8  // Far!
numeric_distance(0b_1000, 0b_0111) == 1  // Near!

That all sounds correct.

So intuitively, I don't understand how you'd use an rtree, which deals with euclidean numeric distance, to efficiently compute hamming distance.

I ended up doing

impl PointDistance for Data {
    fn distance_2(&self, point: &[f32; 18]) -> f32 {
        let hamming_distance: f32 = self
            .hash
            .as_slice()
            .iter()
            .zip(point)
            .map(|(l, r)| {
                // cast l and r to u8, because we know that the values are in the range [0, 255].
                let l = *l as u8;
                let r = *r as u8;
                (l ^ r).count_ones() as f32
            })
            .sum();

        // We must return the squared distance!
        hamming_distance * hamming_distance
    }
}

but it seems to be quite computationally expensive, and not very efficient.

I used f32s instead of i32s, even though I'm only dealing with u8s, because rstar doesn't support u8s out of the box, and because I didn't get attempt to multiply with overflow when using f32s.

@urschrei
Copy link
Member

What's the hash for?

@hamirmahal
Copy link
Author

Oh, so I'm ideally trying to find the closest Hamming distance between &self's hash, which is just 18 bytes, and every other hash in my dataset.

@urschrei
Copy link
Member

Hm. If you're using the default Hash impl, you're using a very high-quality (and expensive) hashing algorithm, which I would imagine is complete overkill in this case.

@hamirmahal
Copy link
Author

Ah, I see… I’ll look into some different approaches.

@urschrei
Copy link
Member

@rmanoka
Copy link
Contributor

rmanoka commented Nov 16, 2023

Do we support bnum ? It supports a u256, which should suffice for above calc?

@adamreichold
Copy link
Member

adamreichold commented Nov 16, 2023

I think before we go ahead and make this crate more difficult to maintain, it should be established whether R* trees are actually suitable to querying Hamming distances.

While it could be innovative cross-pollination between different areas of mathematics, I am somewhat doubtful that the geometric intuitions underpinning the construction of R* trees transfer over to code spaces. Generally, there exist a lot of metric spaces which are decidedly weird compared to the two- or three-dimensional Euclidean space our intuition is still based on even when alternative metrics like Manhattan are used.

So, I think the first step would be to just modify the source of rstar in a local fork to try to hack around the limits of the current design, e.g. use saturating arithmetic if that is helpful or use an high (e.g. bnum) or arbitrary (e.g. num-bigint) precision number library. Then this hacky local fork can be used to figure out if R* trees can be useful for the given problem and based on these experiences we can decide whether and how to extend this library to support this use case.

(But if the changes are too extensive, the answer might still be no as as maintainers, saying no is often what we need to do to keep complexity in check. We have also have represent the interests of our current user base which I guess is mostly GIS, games and then some simulation stuff in order of popularity.)

Personally, my first guess when reading the original problem would be to look into Levenshtein automata which are used for typo resistant search, i.e. to find a matching term with the fewest possible edits away from the query term.

@urschrei
Copy link
Member

Oh, I don't think anyone was actually proposing a change. Certainly when I said "make it work" I didn't mean it in the sense of modification.

@rmanoka
Copy link
Contributor

rmanoka commented Nov 17, 2023

@hamirmahal I think you can get it working by just converting the [u8; 18] slice into a [f64; 18] slice before storing it in the hash variable. f64 or f32, with float should support the range you need. There'll be some loss in precision in that calculation, but I think it's only used for re-balancing of the r-tree that is constructed (and so must still work, with possibly poorer performance).

@hamirmahal
Copy link
Author

@rmanoka I'll give that a look. Thanks for the suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants