DataProvider routing, lazy deserialization, caching, and overlays #1246

sffc · 2021-10-31T18:40:55Z

I wanted to put together an updated, comprehensive model of how different types of data providers interact with one another.

I. Routing

A "routing data provider" or "data router" is one that sends a data request to one or more downstream data providers.

Multi-Blob Data Provider

The multi-blob data provider (#1107) is a specific case. Its data model can be a set of ZeroMaps, and perhaps some metadata to help know which ZeroMap to query for a particular key and locale.

struct MultiBlobDataProvider {
    blobs: Vec<Yoke<ZeroMap<str, [u8]>, Rc<[u8]>>>,
    metadata: // optional
}

General-Purpose Data Router

The more general case requires using dyn Any as an intermediate. We already have ErasedDataStruct for this purpose. Please note that ErasedDataStruct is a different module with a different purpose than the one that uses erased_serde.

struct DataRouter {
    providers: Vec<Box<dyn DataProvider<ErasedDataStruct>>>
}
impl DataProvider<ErasedDataStruct> for DataRouter { /* ... */ }

In order to convert from ErasedDataStruct to a concrete type, we need lazy deserialization.

II. Lazy Deserialization

In #837, I suggest making a data provider that converts from u8 buffers to concrete data structs. Something like:

struct DataDeserializer<P: DataProvider<BufferMarker>> {
    provider: P
}
impl<M> DataProvider<M> for DataDeserializer where M::Yokeable::Output: Deserialize {
    // ...
}

where BufferMarker is a data struct that has not been parsed yet. BlobDataProvider, FsDataProvider, MultiBlobDataProvider, etc., would all produce BufferMarker.

To go one step further, DataDeserializer could work on ErasedDataStruct as well. It would first attempt to downcast the data struct to the concrete type, and if that fails, it then attempts to downcast to a BufferMarker and deserializes it. (It is unexpected for both both downcasts to fail; in such a case, we would return an error result.)

Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on? The code we currently use is here, where we essentially have Cargo features that turn on or off the different deserializers. We want to avoid using erased_serde in the general case, because of the impact on code size that we discovered. The cargo feature might be the best option for now, because apps should hopefully know which deserializers they need to use at compile time. We could add an option for erased_serde later for apps that don't care as much about code size but want to dynamically load new deserializers at runtime.

III. Caching

The rule of thumb is that there is no such thing as a one-size-fits-all caching solution. Clients have different use cases and resource constraints, which may favor heavy caching, light caching, or no caching at all.

A basic cache would look something like this:

struct LruDataCache<P: DataProvider<ErasedDataStruct>> {
    provider: P,
    data: LruCache<DataRequest, DataResponse<ErasedDataStruct>>
}
impl<P> LruDataCache<P> {
    fn new(max_size: usize, provider: P) -> Self { /* ... */ }
}

Note that we load from a DataProvider but cache a DataResponse.

Depending on whether the cache is inserted before or after the deserializer, the cache could track raw buffers or resolved data. In general, the intent would be that the cache is inserted after the deserializer, such that we keep track of resolved data structs that the app has previously requested.

Open Question: The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe. The alternative would be to make DataProvider work on mutable references instead of shared references.

IV. Overlays

One of the main use cases for chained data providers has been the idea of data overlays.

Until we have specialization, data overlays probably need to operate through the dyn Any code path like caches and general-purpose routers. A data overlay would likely take the following form:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        let mut res = self.provider.load_payload(req);
        if (/* data overlay conditional */) {
            let mut payload: DataPayload<ConcreteType> = res.payload.downcast()?;
            // mutate the payload as desired
            // question: is there a such thing as downcast_mut() ?
            res.payload = payload.upcast();
        }
        res
    }
}

Seeking feedback from:

The text was updated successfully, but these errors were encountered:

Manishearth · 2021-11-08T22:21:26Z

I like this overall plan.

Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on?

I think cargo feature is the right call here.

The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe

The standard pattern in Rust for caches is interior mutability (mutex or refcell). We can use things like Weak/Rc as well to build caches.

zbraniecki · 2021-11-08T22:42:05Z

That looks good!

I think we should use a mutex-like abstraction to make this thread-safe

I'd go for Mutex.

I'm a bit concerned about your snippet at the end with overlays. The design you propose requires loading payload, modifying it, and then returning.

This differs from what I see as the most important use case, which I'd show as:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        if (/* data overlay conditional */) {
            load_local_payload(req);
        } else {
            self.provider.load_payload(req)
        }
    }
}

and:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        let mut res = load_local_payload(req);
        if (!res.contains(something)) {
            res.extend_with(self.provider.load_payload(req));
        }
        res
    }
}

sffc · 2021-12-09T17:12:40Z

#1369 implements much of the infrastructure for this design to work.

I consider the remaining deliverable for this issue to be tests/examples for the remaining constructions in the OP.

sffc · 2022-01-11T20:58:56Z

Given CrabBake and the fact that the erased data provider needs a more prominent role, and based on further experience with FFI, here is my updated trait structure.

BufferProvider

A data provider that provides blobs.

Function Signature: fn load_buffer(req: &DataRequest) -> Result<DataResponse<BufferMarker>>

Features:

FFI friendly (trait object safe)
Supports deserialization and reading data from a broad spectrum of data sources

Status: Implemented.

AnyProvider

A data provider that provides Rust objects in memory as dyn Any trait objects.

Function Signature: fn load_any(req: &DataRequest) -> Result<AnyResponse>

Features:

FFI friendly (trait object safe)
Supports CrabBake, StructProvider, and InvariantDataProvider/UndProvider

Status: Tracked by #1479 and #1494

KeyProvider `<M>`

A data provider that provides Rust objects for specific data keys.

Function Signature: fn load_key(options: &ResourceOptions) -> Result<DataResponse<M>>

Features:

DataMarker is in the trait signature
Works for data transformers
Can be put in sequence with an AnyProvider for override support

Status: Depends on #570

DataProvider

The core data provider trait that supports all data keys.

Function Signature: fn load_payload<M>(req: &DataRequest) -> Result<DataResponse<M>>

Features:

This is the trait taken by all try_new constructors in Rust
Auto-implemented on BufferProvider and AnyProvider (or on a wrapper struct)
Supports the caching (lazy deserialization) use case

sffc · 2022-07-28T17:38:36Z

To-do: make sure everything here is well documented.

sffc · 2022-09-26T18:45:08Z

Document the following in the data provider tutorial:

Design is such that caching is not needed, but could be added based on specific client needs
Examples of how to do overlays/overrides

sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design S-large Size: A few weeks (larger feature, major refactoring) needs-approval One or more stakeholders need to approve proposal labels Oct 31, 2021

sffc removed the needs-approval One or more stakeholders need to approve proposal label Nov 18, 2021

sffc added this to the 2021 Q4 0.5 Sprint C milestone Nov 18, 2021

sffc self-assigned this Nov 24, 2021

This was referenced Nov 24, 2021

Refactor [Fs/Blob]DataProvider to unify Serde support #837

Closed

Remove DataPayloadInner and migrate to a single Yoke type #1342

Merged

sffc mentioned this issue Dec 8, 2021

Replace SerdeDeDataProvider with BufferProvider #1369

Merged

sffc added S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library and removed S-large Size: A few weeks (larger feature, major refactoring) labels Dec 9, 2021

sffc modified the milestones: 2021 Q4 0.5 Sprint C, 2021 Q4 0.5 Sprint E Dec 9, 2021

sffc mentioned this issue Dec 10, 2021

Coalesce more impls into the new BufferProvider framework #1384

Merged

sffc modified the milestones: 2021 Q4 0.5 Sprint E, ICU4X 0.6 Mar 24, 2022

sffc modified the milestones: ICU4X 0.6, ICU4X 1.0 (Polish) May 25, 2022

sffc modified the milestones: ICU4X 1.0 (Polish), ICU4X 1.0 (Final) Aug 3, 2022

sffc modified the milestones: ICU4X 1.0 (Final), 1.0 Post-Release Sep 26, 2022

sffc modified the milestones: 1.0 Post-Release, ICU4X 1.1 Nov 10, 2022

sffc mentioned this issue Dec 23, 2022

Adding LruDataCache and overlay examples #2914

Merged

sffc closed this as completed in #2914 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataProvider routing, lazy deserialization, caching, and overlays #1246

DataProvider routing, lazy deserialization, caching, and overlays #1246

sffc commented Oct 31, 2021 •

edited

Loading

Manishearth commented Nov 8, 2021

zbraniecki commented Nov 8, 2021

sffc commented Dec 9, 2021

sffc commented Jan 11, 2022

sffc commented Jul 28, 2022

sffc commented Sep 26, 2022

DataProvider routing, lazy deserialization, caching, and overlays #1246

DataProvider routing, lazy deserialization, caching, and overlays #1246

Comments

sffc commented Oct 31, 2021 • edited Loading

I. Routing

Multi-Blob Data Provider

General-Purpose Data Router

II. Lazy Deserialization

III. Caching

IV. Overlays

Manishearth commented Nov 8, 2021

zbraniecki commented Nov 8, 2021

sffc commented Dec 9, 2021

sffc commented Jan 11, 2022

BufferProvider

AnyProvider

KeyProvider <M>

DataProvider

sffc commented Jul 28, 2022

sffc commented Sep 26, 2022

sffc commented Oct 31, 2021 •

edited

Loading

KeyProvider `<M>`