Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager/early serialize of components to arrow in Rust and C++ #7245

Closed
10 tasks done
Wumpf opened this issue Aug 20, 2024 · 0 comments · Fixed by #8793
Closed
10 tasks done

Eager/early serialize of components to arrow in Rust and C++ #7245

Wumpf opened this issue Aug 20, 2024 · 0 comments · Fixed by #8793
Assignees
Labels
🌊 C++ API C/C++ API specific 💬 discussion 🪵 Log & send APIs Affects the user-facing API for all languages 🦀 Rust API Rust logging API

Comments

@Wumpf
Copy link
Member

Wumpf commented Aug 20, 2024

As we'll soon introduce tagged components and simple multi-datatype components, it gets harder and harder to represent Archetypes (and concrete ComponentBatches) as collection of concrete types.
Let's take the example of a generalized rotation component/archetype field which may be represented by various datatypes: we no longer can store concrete types on an archetype and have to type-earse them right away instead.
Note that this way C++ and Rust get much closer to the Python SDK in this regard.

This fits very well into our desire to get rid of concrete component types in the SDK languages which today almost always take the form of struct ComponentType(pub datatypes::TheDataType) together with myriad of constructors, trait impls and utilities. I.e. a lot of forwarding code.
Eager serialization allows us to implement component semantics on archetypes instead with concrete construction methods. E.g. with_quaternion and with_axis_angle would both populate the multi-datatype rotation component which gets tagged appropriately.
When logging raw component batches/columns this would become more explicit as you're expected to supply a datatype array/collection together with the appropriate component tag (which will still be provided by the SDK, but more in registry fashion rather a class/struct per component). This follows the exact same mechanism of how an archetype construct its internal ComponentBatches.

A drawback of this approach is that most accesses of archetypes requires deserialization back into the source datatypes which can be cumbersome in some cases. However, this is what we expect to do when a user reads back data from the store, so this is something that may soon become common-place anyways.

Another nice side effect is that the "ephemeral rerun::Collection hazard" goes away as we'd no longer store pointers to user data, making the API a lot safer to use. (rerun::Collection becomes a pure pass-through type as it should be)


This ticket is a meetup discussion outcome of @jleibs and @Wumpf with some additional input by @emilk

Advantages

Related


After #8703 the following types need eager serialization on Rust:

(checked means there's a branch where it's fixed)

  • AnnotationContext
  • AssetVideo
  • Asset3D
  • Boxes2D
  • Boxes3D
  • Image
  • Mesh3D
  • Pinhole
  • ViewCoordinates
  • Tensor
@Wumpf Wumpf added 💬 discussion 🦀 Rust API Rust logging API 🌊 C++ API C/C++ API specific 🪵 Log & send APIs Affects the user-facing API for all languages labels Aug 20, 2024
teh-cmc added a commit that referenced this issue Aug 23, 2024
Remove unused old traits.

Part of a lot of clean up I want to while we head towards:
* #7245
* #3741
teh-cmc added a commit that referenced this issue Aug 23, 2024
It doesn't make any sense for a `ComponentBatch` to have any say in what
the final `ArrowField` should look like.

An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that
only makes sense during IO/transport/FFI/storage/etc, and which requires
external context that a single `ComponentBatch` on its own has no idea
of.

---

Part of a lot of clean up I want to while we head towards:
* #7245
* #3741
@emilk emilk changed the title Eagerly serialize components upon Archetype & ComponentBatch serialization in Rust and C++ Eager/early serialize of components to arrow in Rust and C++ Nov 4, 2024
@teh-cmc teh-cmc self-assigned this Jan 10, 2025
teh-cmc added a commit that referenced this issue Jan 13, 2025
This introduces `SerializedComponentBatch`, which will become the main
type we use to carry user data around internally.
```rust
/// The serialized contents of a [`ComponentBatch`] with associated [`ComponentDescriptor`].
///
/// This is what gets logged into Rerun:
/// * See [`ComponentBatch`] to easily serialize component data.
/// * See [`AsComponents`] for logging serialized data.
///
/// [`AsComponents`]: [crate::AsComponents]
#[derive(Debug, Clone)]
pub struct SerializedComponentBatch {
    pub array: arrow::array::ArrayRef,

    // TODO(cmc): Maybe Cow<> this one if it grows bigger. Or intern descriptors altogether, most likely.
    pub descriptor: ComponentDescriptor,
}
```

The goal is to keep the `ComponentBatch` trait isolated at the edge,
where it is used as a means of easily converting any data into arrow
arrays, instead of simultaneously being used as a means of transporting
data around through the internals.
`ComponentBatch` is here to stay, if only for its conversion
capabilities.

This opens a lot of opportunities of improvements in terms of DX, UX and
future features (e.g. generics).

The two code paths will co-exist for the foreseeable future, until all
archetypes have been made eager.

* Part of #7245
teh-cmc added a commit that referenced this issue Jan 13, 2025
This introduces a new `attr.rust.archetype_eager` codegen attribute.
When toggled, the associated archetype will now only carry raw Arrow
data around, and go through the new eager logging APIs.

The attribute has been set on `Points3D`:

![image](https://github.com/user-attachments/assets/cb520e0c-5160-4ff6-b6a3-4bf10b4ac045)

Legacy and eagerly-serialized archetypes can co-exist, making it
possible to migrate everything incrementally.

* DNM: requires #8644 
* Part of #7245
teh-cmc added a commit that referenced this issue Jan 14, 2025
### Related

* Part of #7245

### What

Use new eager serialization & update API for transforms.

The only breaking change here is that Transform3D is no longer copy,
otherwise it's fully compatible.

---------

Co-authored-by: Clement Rey <[email protected]>
teh-cmc pushed a commit that referenced this issue Jan 16, 2025
#8697)

### Related

* Part of #7245

### What

What it says on the tin!
Commit by commit - first commit does all the easy ones, followed by the
trickier ones (just two)


Tested by...
* mess with tensor
* mess with time series in plots example
* run `docs/snippets/all/views/timeseries.py` snippet (uses explicit
time series)
* [x] full check passed
teh-cmc pushed a commit that referenced this issue Jan 24, 2025
### Related

* sister PR to..
	* #8789
	* #8785
	* #8793
* missed piece of #7245

### What

Ports the Tensor archetype in rust to the new eager serialized
interface.
Unfortunately this meant I had to remove some direct access methods of
the underlying tensor data. Curiously, this didn't affect any of our
test/snippet/example code.
While doing so I also fixed some wording issues in the (very similar)
C++ implementation of `with_dim_names`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌊 C++ API C/C++ API specific 💬 discussion 🪵 Log & send APIs Affects the user-facing API for all languages 🦀 Rust API Rust logging API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants