-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10561: [Rust] Simplified Buffer's write
and write_bytes
and fixed undefined behavior
#8645
Conversation
@alamb @jhorstmann @vertexclique , I think I need your help here. This change is causing a segfault in one of the benchmarks, even though I have not changed any unsafe code and think the change is reasonable. I narrowed it down to the function (IMO we should just make that thing safe...) |
My suggestion for this is merging #8598 then implementing bit ops over that interface. And get rid of bit_util.rs. That's going to remove all these issues in a single shot. |
self.len += len_added; | ||
Ok(()) | ||
} | ||
fn write_bytes(&mut self, bytes: &[u8], len_added: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao My guess would be that the issue is related to this len_added
parameter. In the filter kernel this was used for additional padding, most other users probably interpreted this as the length of the bytes array. I would suggest removing this parameter, since you already implemented a workaround in the filter kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jhorstmann here that we should remove len_added
(maybe as another PR) -- the length that is actually added is the bytes.len()
rather than len_added
so having the caller have to provide both leaves the opportunity for additional latent bugs
@@ -525,7 +515,7 @@ impl<T: ArrowPrimitiveType> ArrayBuilder for PrimitiveBuilder<T> { | |||
let sliced = array.buffers()[0].data(); | |||
// slice into data by factoring (offset and length) * byte width | |||
self.values_builder | |||
.write_bytes(&sliced[(offset * mul)..((len + offset) * mul)], len)?; | |||
.write_bytes(&sliced[(offset * mul)..((len + offset) * mul)], len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't reproduce the issue yet, but this line looks a bit suspicious. The first parameter has a larger len (in bytes) than the second len
parameter indicates (number of T
elements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👁️ -- I agree that removing the len
parameter entirely would be the best course of action here
write
and write_bytes
write
and write_bytes
and fixed undefined behavior
Ok, I confirm that this error is due to a wrong pointer offset on the
This PR now also removes that function altogether and replaces it by a safe counterpart. I suspect that there will be a performance regression, but I do not know enough of bit operations to be able to fix the function (and it is not clear whether we should, as per @vertexclique suggestion of using |
I am going to take another hard look at #8598 and see if we can get enough consensus to get it merged |
In terms of evidence that there is a problem on master, I ran the arrow test suite under
My interpretation of this report is that The actual command I used is below in case anyone is interested |
(if you be so kind, could you quickly run fd75933 , just to test whether this PR addresses the issue?) |
@jorgecarleitao -- I did so. There are no errors reported by valgrind when I ran commit fd75933 (HEAD, jorgecarleitao/buffer_write)
|
FWIW I ran the code in #8598 under valgrind and it does not appear to fix the issue -- see details in #8598 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind this PR improves the code and could be merged as is -- it fixes at least one valgrind complaint.
I do think a follow on PR to remove len_added
as suggested by @jhorstmann https://github.com/apache/arrow/pull/8645/files#r522001562 would make things even better (and maybe fix more bugs)
/// Note this doesn't do any bound checking, for performance reason. The caller is | ||
/// responsible to guarantee that both `start` and `end` are within bounds. | ||
#[inline] | ||
pub unsafe fn set_bits_raw(data: *mut u8, start: usize, end: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
self.len += len_added; | ||
Ok(()) | ||
} | ||
fn write_bytes(&mut self, bytes: &[u8], len_added: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jhorstmann here that we should remove len_added
(maybe as another PR) -- the length that is actually added is the bytes.len()
rather than len_added
so having the caller have to provide both leaves the opportunity for additional latent bugs
@@ -525,7 +515,7 @@ impl<T: ArrowPrimitiveType> ArrayBuilder for PrimitiveBuilder<T> { | |||
let sliced = array.buffers()[0].data(); | |||
// slice into data by factoring (offset and length) * byte width | |||
self.values_builder | |||
.write_bytes(&sliced[(offset * mul)..((len + offset) * mul)], len)?; | |||
.write_bytes(&sliced[(offset * mul)..((len + offset) * mul)], len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👁️ -- I agree that removing the len
parameter entirely would be the best course of action here
Thanks a lot, @alamb , really useful data points ❤️ For me that is enough of a reason: fix UB with I also agree with @jhorstmann about the |
… fixed undefined behavior This PR addresses a major issue on builders and 3 small issues on `MutableBuffer`: 1. [major] fixes undefined behavior due to a incorrect pointer arithmetic in `set_bits_raw`, causing a bench to segfault 1. `write_bytes` is incorrect, as it double-increments `len`: the length is incremented both on `self.write` and also by `write_bytes` itself. This leads to more allocations than necessary. 2. `write` is implemented from the trait `io::Write`. However, this trait is suitable for fallible IO operations. In the case of a write to memory, it isn't really fallible because we can always call `reserve` to allocate more space. 3. `write` and `write_bytes` are really similar. This PR replaces both `write_bytes` and `write` by `extend_from_slice` (inspired by [`Vec::extend_from_slice`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.extend_from_slice)) that checks the available capacity and reserves more if necessary. This has the same or better performance than `write`, as it performs a single comparison per call. Closes apache#8645 from jorgecarleitao/buffer_write Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
I would like to propose that we outline and enforce guidelines on the arrow crate implementation with respect to the usage of `unsafe`. The background of this proposal are PRs #8645 and #8829. In both cases, while addressing an unrelated issue, they hit undefined behavior (UB) due to an incorrect usage of `unsafe` in the code base. This UB was very time-consuming to identify and debug: combined, they accounted for more than 12hs of my time. Safety against undefined behavior is the core premise of the Rust language. In many cases, the maintenance burden (time to find and fix bugs) does not justify the performance improvements and the decrease in motivation in handling them (they are just painful due to how difficult they are to debug). In particular, IMO those 12 hours would have been better spent in other parts of the code if `unsafe` would have not been used in the first place, which would have been likely translated in faster code or more features. There are situations where `unsafe` is necessary, and the guidelines outline these cases. However, I also see many uses of `unsafe` that are not necessary nor properly documented. The goal of these guidelines is to motivate contributors of the Rust implementation to be conscious about the maintenance cost of `unsafe`, and outline specific necessary conditions for any new `unsafe` to be introduced in the code base. Closes #8901 from jorgecarleitao/arrow_unsafe Lead-authored-by: Jorge Leitao <[email protected]> Co-authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
… fixed undefined behavior This PR addresses a major issue on builders and 3 small issues on `MutableBuffer`: 1. [major] fixes undefined behavior due to a incorrect pointer arithmetic in `set_bits_raw`, causing a bench to segfault 1. `write_bytes` is incorrect, as it double-increments `len`: the length is incremented both on `self.write` and also by `write_bytes` itself. This leads to more allocations than necessary. 2. `write` is implemented from the trait `io::Write`. However, this trait is suitable for fallible IO operations. In the case of a write to memory, it isn't really fallible because we can always call `reserve` to allocate more space. 3. `write` and `write_bytes` are really similar. This PR replaces both `write_bytes` and `write` by `extend_from_slice` (inspired by [`Vec::extend_from_slice`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.extend_from_slice)) that checks the available capacity and reserves more if necessary. This has the same or better performance than `write`, as it performs a single comparison per call. Closes apache#8645 from jorgecarleitao/buffer_write Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
I would like to propose that we outline and enforce guidelines on the arrow crate implementation with respect to the usage of `unsafe`. The background of this proposal are PRs apache#8645 and apache#8829. In both cases, while addressing an unrelated issue, they hit undefined behavior (UB) due to an incorrect usage of `unsafe` in the code base. This UB was very time-consuming to identify and debug: combined, they accounted for more than 12hs of my time. Safety against undefined behavior is the core premise of the Rust language. In many cases, the maintenance burden (time to find and fix bugs) does not justify the performance improvements and the decrease in motivation in handling them (they are just painful due to how difficult they are to debug). In particular, IMO those 12 hours would have been better spent in other parts of the code if `unsafe` would have not been used in the first place, which would have been likely translated in faster code or more features. There are situations where `unsafe` is necessary, and the guidelines outline these cases. However, I also see many uses of `unsafe` that are not necessary nor properly documented. The goal of these guidelines is to motivate contributors of the Rust implementation to be conscious about the maintenance cost of `unsafe`, and outline specific necessary conditions for any new `unsafe` to be introduced in the code base. Closes apache#8901 from jorgecarleitao/arrow_unsafe Lead-authored-by: Jorge Leitao <[email protected]> Co-authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This PR addresses a major issue on builders and 3 small issues on
MutableBuffer
:set_bits_raw
, causing a bench to segfaultwrite_bytes
is incorrect, as it double-incrementslen
: the length is incremented both onself.write
and also bywrite_bytes
itself. This leads to more allocations than necessary.write
is implemented from the traitio::Write
. However, this trait is suitable for fallible IO operations. In the case of a write to memory, it isn't really fallible because we can always callreserve
to allocate more space.write
andwrite_bytes
are really similar.This PR replaces both
write_bytes
andwrite
byextend_from_slice
(inspired byVec::extend_from_slice
) that checks the available capacity and reserves more if necessary. This has the same or better performance thanwrite
, as it performs a single comparison per call.