Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added COW semantics to Buffer, Bitmap and some arrays #794

Merged
merged 9 commits into from
Feb 2, 2022

Conversation

ritchie46
Copy link
Collaborator

I make this PR to have some context for the discussion. I understand this hits the core of all memory operations so we must ensure its correct.

This allows for a PrimitiveArray to get a MutablePrimitiveArray zero copy. Allowing for mutations, pushes etc.

This conversion from PrimitiveArray to MutablePrimitiveArray can only be done if

  • The Arc<> pointer is not shared. This is checked with Arc::get_mut and the borrow checker
  • The data is allocated in Rust by a Vec and not by an FFI.
  • the data has an offset of 0. (let's keep it simple)
  • Both the validity Bitmap and the values Buffer<T> suffice above requirements.

@jorgecarleitao
Copy link
Owner

cc @sundy-li @Dandandan

@codecov
Copy link

codecov bot commented Jan 30, 2022

Codecov Report

Merging #794 (5ed0c43) into main (601fa08) will increase coverage by 0.38%.
The diff coverage is 81.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
+ Coverage   71.08%   71.46%   +0.38%     
==========================================
  Files         319      320       +1     
  Lines       16672    16731      +59     
==========================================
+ Hits        11851    11957     +106     
+ Misses       4821     4774      -47     
Impacted Files Coverage Δ
src/array/mod.rs 71.05% <ø> (ø)
src/buffer/bytes.rs 51.35% <65.21%> (-0.27%) ⬇️
src/array/primitive/mod.rs 80.23% <81.81%> (+0.54%) ⬆️
src/buffer/foreign.rs 88.88% <88.88%> (ø)
src/buffer/immutable.rs 96.42% <90.90%> (-1.40%) ⬇️
src/bitmap/immutable.rs 87.80% <100.00%> (+1.31%) ⬆️
src/ffi/ffi.rs 83.67% <100.00%> (ø)
src/io/parquet/read/nested_utils.rs 77.45% <0.00%> (-0.99%) ⬇️
src/array/binary/mutable.rs 78.57% <0.00%> (+1.29%) ⬆️
src/io/avro/write/schema.rs 66.66% <0.00%> (+3.03%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 601fa08...5ed0c43. Read the comment docs.

@jorgecarleitao
Copy link
Owner

Thanks!!

Miri passed 🎉 .

I cross-posted it here: https://users.rust-lang.org/t/add-cow-to-an-optional-foreign-pointer-soundness/71091 and will ping dark-arts folks for feedback.

@jorgecarleitao
Copy link
Owner

Could it make sense to add an API for Buffer -> Vec and Bitmap -> MutableBitmap and base the Primitive -> MutablePrimitive on that API?

@ritchie46
Copy link
Collaborator Author

Could it make sense to add an API for Buffer -> Vec and Bitmap -> MutableBitmap and base the Primitive -> MutablePrimitive on that API?

I believe we have that.

for Buffer

    /// Try to get a hold to the inner data as `&mut Vec`.
    /// This will only succeed if the `reference count == 1` and the data
    /// is allocated by this program.
    pub fn get_vec(&mut self) -> Option<&mut Vec<T>> {
        if self.offset != 0 {
            None
        } else {
            Arc::get_mut(&mut self.data).map(|b| b.get_vec()).flatten()
        }
    }

for Bitmap

    /// Try to convert this `Bitmap` to a `MutableBitmap`
    pub fn into_mut(mut self) -> MaybeMut<Self, MutableBitmap> {
        match (
            self.offset,
            Arc::get_mut(&mut self.bytes).map(|b| b.get_vec()).flatten(),
        ) {
            (0, Some(v)) => {
                let data = std::mem::take(v);
                MaybeMut::Mutable(MutableBitmap::from_vec(data, self.length))
            }
            _ => MaybeMut::Immutable(self),
        }
    }

The reason we take self is to consume the container and be able to clean things up properly. However, we already decided to take self before we know we are able to suffice the requirements.

For this reason we return Either a MutableBitmap or a Bitmap, where the latter is the same data structure untouched.

@jorgecarleitao
Copy link
Owner

I went through this and looks great. Great work, @ritchie46 (and Imagine quite fun)!

I want to give this a spin locally before merging, if that is ok for you.

@ritchie46
Copy link
Collaborator Author

I went through this and looks great. Great work, @ritchie46 (and Imagine quite fun)!

I want to give this a spin locally before merging, if that is ok for you.

Awesome! This will open up really cool things! :)

@jorgecarleitao
Copy link
Owner

I went through the changes locally and dropped a PR to your branch.

I would like to propose a further modification (separate from that PR).

While reviewing, I noticed that we currently deref Buffer using vec[offset..offset + length]. However, we do not need to perform those bound checks, since offset + length is guaranteed to be in bounds by the struct's invariant. This check can be removed.

However, if we offer a &mut Vec, a user may .clear or truncate, violating this invariant (and thus not allowing the optimization mentioned above).

One way to approach this to offer an into_mut(mut self) -> MaybeMutable<Vec<T>> instead of a get_mut, which ensures that we do not break the invariant, just like we do with Bitmap -> MutableBitmap.

Also, I would consider renaming MaybeMutable to Either (or better, depend on either, since it is the de-facto enum for this behavior.

What do you think, @ritchie46 ?

@ritchie46
Copy link
Collaborator Author

Thank you! I will take a look.

Regarding

However, if we offer a &mut Vec, a user may .clear or truncate, violating this invariant (and thus not allowing the optimization mentioned above).

This should not matter because we only share a &mut Vec iff

  1. it is allocated by Vec<T>
  2. Arc::get_mut succeeds, thus there is only a single owner

In that case a deallocation is completely safe. Even a reallocation is fine because &mut Vec will update its internal pointer and our invariants that it should be allocated by Vec still hold.

In the case of a foreign allocated pointer you just cannot get a &mut Vec.

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Feb 2, 2022

Sorry, the invariant that I am referring to is that buffer.offset + buffer.length < buffer.data.length(), that we check whenever the user calls buffer.slice. Once we allow a mutable buffer.data, buffer.data.length() may change; in particular it can be smaller than buffer.offset + buffer.length

An alternative behavior is to not allow get_mut when length != self.data.length() (besides checking for .offset != 0)

@ritchie46
Copy link
Collaborator Author

ritchie46 commented Feb 2, 2022

Sorry, the invariant that I am referring to is that buffer.offset + buffer.length < buffer.data.length(), that we check whenever the user calls buffer.slice. Once we allow a mutable buffer.data, buffer.data.length() may change; in particular it can be smaller than buffer.offset + buffer.length

Right.. Then we maybe must return an owned Vec an deallocate the Buffer? Then we can always create a new buffer, need be.

One way to approach this to offer an into_mut(mut self) -> MaybeMutable<Vec> instead of a get_mut, which ensures that we do not break the invariant, just like we do with Bitmap -> MutableBitmap.

Sorry, I now understand your suggestion. Agree! :)

Also, I would consider renaming MaybeMutable to Either (or better, depend on either, since it is the de-facto enum for this behavior.

Yes, I thought about this one. I will modify it and use Either.

@ritchie46
Copy link
Collaborator Author

ritchie46 commented Feb 2, 2022

While reviewing, I noticed that we currently deref Buffer using vec[offset..offset + length]. However, we do not need to perform those bound checks, since offset + length is guaranteed to be in bounds by the struct's invariant. This check can be removed.

Awesome. So we can even remove very hot a check by this. 🎉

@jorgecarleitao jorgecarleitao merged commit 54797de into jorgecarleitao:main Feb 2, 2022
@ritchie46 ritchie46 deleted the get_mutable branch February 2, 2022 20:07
@houqp
Copy link
Collaborator

houqp commented Feb 3, 2022

This is very cool, nice work @ritchie46 . Exactly what we needed in delta-rs :D

@jorgecarleitao jorgecarleitao changed the title PrimitiveArray -> MutableArray Added COW semantics to Buffer, Bitmap and some arrays Mar 6, 2022
@declark1
Copy link

declark1 commented Apr 8, 2022

This is awesome. As a Rust newbie, I'm struggling with fighting the borrow checker when attempting to go from a Box'd or Arc'd Array trait object to a MutableArray. Any idea how I could do this?

e.g.

// This works
let arr = create_primitive_array_with_seed::<i64>(50, 0.0 as f32, 42);
let mut mut_arr = arr.into_mut().unwrap_right();

// This fails
let arr =
    Box::new(create_primitive_array_with_seed::<i64>(50, 0.0 as f32, 42)) as Box<dyn Array>;
let mut mut_arr = arr
    .as_any()
    .downcast_ref::<PrimitiveArray<i64>>()
    .unwrap()
    .into_mut()
    .unwrap_right();
/*
error[E0507]: cannot move out of a shared reference
    |
191 |       let mut mut_arr = arr
    |  _______________________^
192 | |         .as_any()
193 | |         .downcast_ref::<PrimitiveArray<i64>>()
194 | |         .unwrap()
195 | |         .into_mut()
    | |          ---------^
    | |__________|________|
    |            |        move occurs because value has type `arrow2::array::PrimitiveArray<i64>`, which does not implement the `Copy` trait
    |            value moved due to this method call
    |
note: this function takes ownership of the receiver `self`, which moves value
    |
216 |     pub fn into_mut(self) -> Either<Self, MutablePrimitiveArray<T>> {
    |                     ^^^^
*/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants