Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataProvider routing, lazy deserialization, caching, and overlays #1246

Closed
2 tasks done
sffc opened this issue Oct 31, 2021 · 6 comments · Fixed by #2914
Closed
2 tasks done

DataProvider routing, lazy deserialization, caching, and overlays #1246

sffc opened this issue Oct 31, 2021 · 6 comments · Fixed by #2914
Assignees
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality T-docs-tests Type: Code change outside core library

Comments

@sffc
Copy link
Member

sffc commented Oct 31, 2021

I wanted to put together an updated, comprehensive model of how different types of data providers interact with one another.

I. Routing

A "routing data provider" or "data router" is one that sends a data request to one or more downstream data providers.

Multi-Blob Data Provider

The multi-blob data provider (#1107) is a specific case. Its data model can be a set of ZeroMaps, and perhaps some metadata to help know which ZeroMap to query for a particular key and locale.

struct MultiBlobDataProvider {
    blobs: Vec<Yoke<ZeroMap<str, [u8]>, Rc<[u8]>>>,
    metadata: // optional
}

General-Purpose Data Router

The more general case requires using dyn Any as an intermediate. We already have ErasedDataStruct for this purpose. Please note that ErasedDataStruct is a different module with a different purpose than the one that uses erased_serde.

struct DataRouter {
    providers: Vec<Box<dyn DataProvider<ErasedDataStruct>>>
}
impl DataProvider<ErasedDataStruct> for DataRouter { /* ... */ }

In order to convert from ErasedDataStruct to a concrete type, we need lazy deserialization.

II. Lazy Deserialization

In #837, I suggest making a data provider that converts from u8 buffers to concrete data structs. Something like:

struct DataDeserializer<P: DataProvider<BufferMarker>> {
    provider: P
}
impl<M> DataProvider<M> for DataDeserializer where M::Yokeable::Output: Deserialize {
    // ...
}

where BufferMarker is a data struct that has not been parsed yet. BlobDataProvider, FsDataProvider, MultiBlobDataProvider, etc., would all produce BufferMarker.

To go one step further, DataDeserializer could work on ErasedDataStruct as well. It would first attempt to downcast the data struct to the concrete type, and if that fails, it then attempts to downcast to a BufferMarker and deserializes it. (It is unexpected for both both downcasts to fail; in such a case, we would return an error result.)

Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on? The code we currently use is here, where we essentially have Cargo features that turn on or off the different deserializers. We want to avoid using erased_serde in the general case, because of the impact on code size that we discovered. The cargo feature might be the best option for now, because apps should hopefully know which deserializers they need to use at compile time. We could add an option for erased_serde later for apps that don't care as much about code size but want to dynamically load new deserializers at runtime.

III. Caching

The rule of thumb is that there is no such thing as a one-size-fits-all caching solution. Clients have different use cases and resource constraints, which may favor heavy caching, light caching, or no caching at all.

A basic cache would look something like this:

struct LruDataCache<P: DataProvider<ErasedDataStruct>> {
    provider: P,
    data: LruCache<DataRequest, DataResponse<ErasedDataStruct>>
}
impl<P> LruDataCache<P> {
    fn new(max_size: usize, provider: P) -> Self { /* ... */ }
}

Note that we load from a DataProvider but cache a DataResponse.

Depending on whether the cache is inserted before or after the deserializer, the cache could track raw buffers or resolved data. In general, the intent would be that the cache is inserted after the deserializer, such that we keep track of resolved data structs that the app has previously requested.

Open Question: The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe. The alternative would be to make DataProvider work on mutable references instead of shared references.

IV. Overlays

One of the main use cases for chained data providers has been the idea of data overlays.

Until we have specialization, data overlays probably need to operate through the dyn Any code path like caches and general-purpose routers. A data overlay would likely take the following form:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        let mut res = self.provider.load_payload(req);
        if (/* data overlay conditional */) {
            let mut payload: DataPayload<ConcreteType> = res.payload.downcast()?;
            // mutate the payload as desired
            // question: is there a such thing as downcast_mut() ?
            res.payload = payload.upcast();
        }
        res
    }
}

Seeking feedback from:

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design S-large Size: A few weeks (larger feature, major refactoring) needs-approval One or more stakeholders need to approve proposal labels Oct 31, 2021
@Manishearth
Copy link
Member

I like this overall plan.

Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on?

I think cargo feature is the right call here.

The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe

The standard pattern in Rust for caches is interior mutability (mutex or refcell). We can use things like Weak/Rc as well to build caches.

@zbraniecki
Copy link
Member

That looks good!

I think we should use a mutex-like abstraction to make this thread-safe

I'd go for Mutex.

I'm a bit concerned about your snippet at the end with overlays. The design you propose requires loading payload, modifying it, and then returning.

This differs from what I see as the most important use case, which I'd show as:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        if (/* data overlay conditional */) {
            load_local_payload(req);
        } else {
            self.provider.load_payload(req)
        }
    }
}

and:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        let mut res = load_local_payload(req);
        if (!res.contains(something)) {
            res.extend_with(self.provider.load_payload(req));
        }
        res
    }
}

@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Nov 18, 2021
@sffc sffc added this to the 2021 Q4 0.5 Sprint C milestone Nov 18, 2021
@sffc sffc self-assigned this Nov 24, 2021
@sffc
Copy link
Member Author

sffc commented Dec 9, 2021

#1369 implements much of the infrastructure for this design to work.

I consider the remaining deliverable for this issue to be tests/examples for the remaining constructions in the OP.

@sffc sffc added S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library and removed S-large Size: A few weeks (larger feature, major refactoring) labels Dec 9, 2021
@sffc
Copy link
Member Author

sffc commented Jan 11, 2022

Given CrabBake and the fact that the erased data provider needs a more prominent role, and based on further experience with FFI, here is my updated trait structure.

BufferProvider

A data provider that provides blobs.

Function Signature: fn load_buffer(req: &DataRequest) -> Result<DataResponse<BufferMarker>>

Features:

  • FFI friendly (trait object safe)
  • Supports deserialization and reading data from a broad spectrum of data sources

Status: Implemented.

AnyProvider

A data provider that provides Rust objects in memory as dyn Any trait objects.

Function Signature: fn load_any(req: &DataRequest) -> Result<AnyResponse>

Features:

  • FFI friendly (trait object safe)
  • Supports CrabBake, StructProvider, and InvariantDataProvider/UndProvider

Status: Tracked by #1479 and #1494

KeyProvider <M>

A data provider that provides Rust objects for specific data keys.

Function Signature: fn load_key(options: &ResourceOptions) -> Result<DataResponse<M>>

Features:

  • DataMarker is in the trait signature
  • Works for data transformers
  • Can be put in sequence with an AnyProvider for override support

Status: Depends on #570

DataProvider

The core data provider trait that supports all data keys.

Function Signature: fn load_payload<M>(req: &DataRequest) -> Result<DataResponse<M>>

Features:

  • This is the trait taken by all try_new constructors in Rust
  • Auto-implemented on BufferProvider and AnyProvider (or on a wrapper struct)
  • Supports the caching (lazy deserialization) use case

@sffc sffc modified the milestones: 2021 Q4 0.5 Sprint E, ICU4X 0.6 Mar 24, 2022
@sffc sffc modified the milestones: ICU4X 0.6, ICU4X 1.0 (Polish) May 25, 2022
@sffc
Copy link
Member Author

sffc commented Jul 28, 2022

To-do: make sure everything here is well documented.

@sffc
Copy link
Member Author

sffc commented Sep 26, 2022

Document the following in the data provider tutorial:

  1. Design is such that caching is not needed, but could be added based on specific client needs
  2. Examples of how to do overlays/overrides

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants