Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast path for trusted data #1290

Closed
sffc opened this issue Nov 11, 2021 · 15 comments · Fixed by #1855
Closed

Fast path for trusted data #1290

sffc opened this issue Nov 11, 2021 · 15 comments · Fixed by #1855
Assignees
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole S-epic Size: Major project (create smaller child issues) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Nov 11, 2021

By default, we assume that data is untrusted: we don't guarantee anything about the bytes ready by the data provider at runtime. We assume that the bytes could be malformed or malignant. This is important for dynamically loaded data. For further discussion, see #1183.

However, it is a common case that we have fully trusted data as well. For example, data built statically into the binary, or downloaded and verified with a cryptographic signature, does not need to be validated at runtime, because it is not possible to manipulate it between icu4x-datagen and the code reading the bytes from memory.

This issue is to track design and implementation around a fast path for data loading that is built with different assumptions (trusted data).

@sffc sffc added T-enhancement Type: Nice-to-have but not required C-meta Component: Relating to ICU4X as a whole A-design Area: Architecture or design S-epic Size: Major project (create smaller child issues) labels Nov 11, 2021
@sffc
Copy link
Member Author

sffc commented Nov 13, 2021

Another possible path to take here is to use compile-time code generation of the data provider structures; basically reaping the benefits of a build.rs codegen approach with the flexibility of the Serde-based data provider.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Dec 9, 2021
@sffc
Copy link
Member Author

sffc commented Dec 10, 2021

We had a 2.5-hour discussion on this subject today: https://docs.google.com/document/d/1P2JDD06ERrVADYG9a8-iTHRMlaPpErqZnk6maTO1FOE/edit#

Principles where I think we agree:

  1. Loading static, trusted data is a use case that the ICU4X project should support
  2. Loading dynamic, untrusted data is a use case that the ICU4X project should support
  3. Loading untrusted data requires some additional validation steps
  4. Loading trusted data should ideally skip those validation steps

Principles where I think we do not agree:

  1. Prioritization: Which use case is more important
    • Position A: Dynamic data is a key value proposition of the project, and one of the main differentiators between ICU4C and ICU4X. We should guarantee that ICU4X works fully for that use case as a first-class citizen.
    • Position B: Static data is the more common use case; it is reasonable for us to make decisions to benefit the static data case that we might not have made if ICU4X were purely targeted at dynamic data users
  2. Safety Definition: What is the definition of "safety" when loading untrusted data
    • Position A: 3-point definition: ICU4X should not panic, not hit unsafe code, and always terminate
    • Position B: Memory safety is a non-negotiable requirement for ICU4X. Panicking should be avoided where possible, but may be an option in cases where it represents a clearly identifiable programming error
  3. GIGO: Whether garbage-in-garbage-out (GIGO) is a viable strategy
    • Position A: We can restructure panicky code to be GIGO, and it is a productive use of engineering time, since it helps us achieve the safety goals stated above
    • Position B: It is more difficult to reason about the correctness of code that has been restructured to be GIGO, and it produces surprising behavior, so we should avoid it
  4. Validation Step: Whether adding an additional, optional validate function is a viable strategy
    • Position A: The loading of data is internal to ICU4X; validation should be handled internally by us; the place where a client could express whether data is trusted is in the data provider constructor
    • Position B: Clients consuming untrusted data have unique requirements, and are likely to be more sophisticated; they may have their own strategies for establishing trust; it is harder to write a one-size-fits-all implementation of untrusted data
  5. Overhead: What degree of overhead is acceptable for loading trusted data
    • Position A: We should accept overhead that is low-cost and linear-time, like we already do for UTF-8 strings
    • Position B: Running validation functions on trusted data, even if they are cheap, is pointless and should be avoided

Position A is generally held by Shane, and Position B is generally held by Iain.

Options for how to proceed:

Option 1: Uniformly enforce invariants or use GIGO

In Option 1, all data coming from the Serde deserialize function should either be validated to enforce the safety requirements (3-point: no panic, memory safety, termination), or the data should relax its invariants and instead use GIGO. This decision can be made on a case-by-case basis; GIGO should be preferred if the validation function is expensive (superlinear).

Pros:

  • Satisfies all requirements for Position A
  • Keeps a single, consistent code path for all users

Cons:

  • Carries nonzero overhead for the static data use case that Position B would like to avoid
  • GIGO makes code correctness hard to reason about

Option 2: Use Serde for untrusted data, and a novel solution for trusted data

Option 2 is an extension of Option 1 that adds an additional code path, separate from Serde, to be used for static, trusted data. The new code path can be hyper-optimized for the static data use case, allowing the Serde code path to be hyper-optimized for the dynamic data use case. One possibility here is to use Rust codegen for the static data. We will accept more expensive validation functions instead of GIGO.

Pros:

  • Satisfies all requirements for Position B
  • Allows static data to be hyper-optimized

Cons:

  • Introduces two code paths, which increases maintenance and testing burden
  • Carries additional overhead for the dynamic data use case relative to Option 1

Option 3: Use Serde with a ValidationNeeded trait

Option 3 keeps the status quo of Serde, but allows validation code to be only run conditionally by introducing a trait to mark certain data structures as requiring additional validation. In this solution, Serde deserialization would guarantee memory safety, and the trait would be used for panic and termination safety.

Pros:

  • Relatively easy to implement
  • Avoids GIGO
  • Compromise that is acceptable for both Position A and Position B

Cons:

  • Most likely only a minimal performance improvement for trusted data relative to Option 1*
  • Possible performance regression for untrusted data relative to Option 1
  • More complicated than Option 1, since it requires adding an additional trait that we need to reason about

* I do not have data to specifically back this up, but we do have data that Serde deserialization is already the most expensive part of ICU4X in terms of both code size and cycles, and my full expectation is that an efficient validation function contributes only a small percentage to the overall runtime.

Option 4: Use Serde and add a separate validation function

Option 4 is what @iainireland suggested in #1183. Serde would produce panicky structures on untrusted data, and the user would be required to call an additional validation function to prevent panics. This is similar to Option 3, but it moves the validation into userland rather than as part of the data provider.

Pros:

  • Satisfies Position B

Cons:

  • The lack of a centralized validation trait makes it hard for dynamic data clients to reason about safety
  • Other cons inherited from Option 3

Option 5: Introduce an Unsafe mode for Postcard/Serde

Option 5 goes further than Option 3 by engineering Postcard/Serde to use a fast path for trusted data that skips UTF-8 and ULE validation. We would likely introduce a DeserializeUnsafe trait that is similar to Deserialize but is able to run unsafe code.

Pros:

  • Allows for hyper-optimized code for trusted data, while still fitting into the overall Serde-based data provider architecture
  • Safety is easily achieved by using the standard, off-the-shelf Postcard/Serde impls

Cons:

  • Requires work outside of ICU4X in order to add support for this in Serde
  • Two code paths may be harder to reason about and maintain

Please let me know if everything I stated above is true and that we can agree with the options that we have on the table.

@sffc
Copy link
Member Author

sffc commented Dec 10, 2021

I tried to keep the above post as neutral as possible, though I'm sure some of my biases slipped in. Here is my opinion on the options:

  • Option 1 is my preference. I am highly unconvinced of the argument that validation is expensive (it's a small fraction of the overall deserialization cost) and that GIGO is hard (we did it in CPT and it was easy).
  • Option 3 is acceptable to me, but it seems like pointless complexity relative to Option 1.
  • I dislike Option 4 because it makes untrusted data a strongly second-class citizen.
  • Options 2 and 5 are both acceptable, but are more work; I would like to have concrete numbers to justify that effort before embarking on them.

@iainireland
Copy link
Contributor

iainireland commented Dec 11, 2021

1. **Prioritization:** Which use case is more important
   
   * Position A: Dynamic data is a key value proposition of the project, and one of the main differentiators between ICU4C and ICU4X
   * Position B: Static data is the most common use case and the only one we currently have clients for; performance wins are the key differentiator

I disagree with the phrasing here. I've repeatedly agreed that dynamic data is an important value proposition, and I don't think performance wins are the only differentiator. My position is simply that dynamic data is not the only value proposition for ICU4X. Our support for dynamic data in ICU4X means that it's not optimally targeted towards users of static data (because of eg ULE overhead), and that is a reasonable design choice. In the same way, it is reasonable for us to make decisions to benefit the static data case that we might not have made if ICU4X were purely targeted at dynamic data users. Neither value proposition should overrule the other; we should make this sort of decision on a case by case basis.

2. **Safety Definition:** What is the definition of "safety" when loading untrusted data
   
   * Position A: 3-point definition: ICU4X should not panic, not hit unsafe code, and always terminate
   * Position B: We should only require that we not hit unsafe code; panicking is OK

I would prefer something along the lines of "Memory safety is a non-negotiable requirement for ICU4X. Panicking should be avoided where possible, but may be an option in cases where it represents a clearly identifiable programming error." (Compare "The panic! macro signals that your program is in a state it can’t handle and lets you tell the process to stop instead of trying to proceed with invalid or incorrect values.", from the Rust book.)

(I don't have strong feelings about termination: if pressed, I would probably lean towards the idea that a clean panic is better than non-termination in the case of garbage data. I'm not aware of any other cases where non-termination is a real concern.)

3. **GIGO:** Whether garbage-in-garbage-out (GIGO) is a viable strategy
   
   * Position A: We can restructure panicky code to be GIGO, and it is a productive use of engineering time, since it helps us achieve the safety goals stated above
   * Position B: Restructuring code to be GIGO is difficult to reason about and produces surprising behavior, so we should avoid it

I would reword "Restructuring code to be GIGO is difficult to reason about" as "It is more difficult to reason about the correctness of code that has been restructured to be GIGO...", but otherwise this seems accurate to me.

I'll also add: as a browser developer, code dealing with untrusted data is exactly where I would have the strongest preference for panicking over GIGO. If I'm accepting data from a potentially adversarial source, the last thing I want is unpredictability. Security bugs often live in the space between your mental model of the code and the actual implementation, because "how does this code behave when it's wedged into an unexpected state?" is not a well-tested code path. Panics are a bad user experience, but in security-sensitive code like browsers, a panic is way better than an exploitable vulnerability. As ICU4X developers, we can't know what assumptions our users are making about our output, so it's hard to say for certain that GIGO can't introduce security risks.

4. **Validation Step:** Whether adding an additional, optional validate function is a viable strategy
   
   * Position A: The loading of data is internal to ICU4X; validation should be handled internally by us; the place where a client could express whether data is trusted is in the data provider constructor
   * Position B: We should expect clients consuming untrusted data to have to jump through additional hoops

My long-winded version of position B: The majority of clients are not likely to consume untrusted data. Clients consuming untrusted data have unique requirements, and are likely to be more sophisticated. They may have their own strategies for establishing trust. It is harder to write a one-size-fits-all implementation of untrusted data, so te ideal API for consuming untrusted data will likely be more involved / less automatic than the API for consuming static data. We should probably be guiding unsophisticated users towards the simpler static APIs. We should still make untrusted data as performant and ergonomic as we can, but we shouldn't bake in assumptions about how clients will use untrusted data in ICU4X before we have any such clients. For example, as Henri has mentioned, clients may choose to use a cryptographic signature to validate data bundles, in which case they could very reasonably choose not to do any additional load-time validation. Giving them that flexibility is beneficial.

5. **Overhead:** What degree of overhead is acceptable for loading trusted data
   
   * Position A: We should accept overhead that is low-cost and linear-time, like we already do for UTF-8 strings
   * Position B: Running validation functions on trusted data, even if they are cheap, is pointless and should be avoided

No arguments here, so long as it's clear that I don't think the current serde overhead is a pressing issue, and my main concern is to avoid adding additional overhead that isn't required by the architecture of ICU4X.

I also think Henri's framing during the meeting was good: "where reasonable, we should not make users pay a performance cost for features they don't use."

Options for how to proceed:
...

This seems mostly reasonable to me. A few comments:

  1. I would add "GIGO makes code correctness hard to reason about" as a con for option 1, and "Avoids GIGO" as a pro for option 3.
  2. I proposed option 4 as a simple approach that we could do right now with no additional work, and option 3 as a possible follow-up building on 4. Note that once you have 4, I think getting to 3 is just a matter of adding an additional trait method in the right place with an empty default implementation, hooking up the validation code from step 3 to the trait implementation, and then calling that method in the appropriate place in the data provider. It's probably too late now, but it might have made more sense to put 3 and 4 in the other order.
  3. I think we should do 4 right now, and probably 3 later (unless we come up with something better). Note that we don't currently have any clients using untrusted data, so it's not a catastrophe if we don't implement the trait immediately.
  4. Options 2 and 5 are interesting, but I don't think they're short-term priorities.

@sffc
Copy link
Member Author

sffc commented Dec 11, 2021

Thanks; I updated my post to reflect your suggestions.

@sffc
Copy link
Member Author

sffc commented Dec 11, 2021

A few responses to new points you raised:

I'll also add: as a browser developer, code dealing with untrusted data is exactly where I would have the strongest preference for panicking over GIGO. If I'm accepting data from a potentially adversarial source, the last thing I want is unpredictability. Security bugs often live in the space between your mental model of the code and the actual implementation, because "how does this code behave when it's wedged into an unexpected state?" is not a well-tested code path. Panics are a bad user experience, but in security-sensitive code like browsers, a panic is way better than an exploitable vulnerability. As ICU4X developers, we can't know what assumptions our users are making about our output, so it's hard to say for certain that GIGO can't introduce security risks.

My response: Preventing panics/crashes removes an attack vector. There are plenty of CVEs that are based on crashing programs. If data is coming from an adversarial source, the attacker can just as easily give you data that passes the validator but produces malicious results. They can exploit this whether or not we have a validator. So GIGO cannot introduce new security risks.

we shouldn't bake in assumptions about how clients will use untrusted data in ICU4X before we have any such clients. For example, as Henri has mentioned, clients may choose to use a cryptographic signature to validate data bundles, in which case they could very reasonably choose not to do any additional load-time validation.

My response: I see two modes: trusted data and untrusted data. Cryptographically signed data goes into the trusted mode. Untrusted means you got a Postcard or JSON blob from somewhere you may not trust and you want to parse it with 3-point safety.

I proposed option 4 as a simple approach that we could do right now with no additional work, and option 3 as a possible follow-up building on 4.

My response: I proposed option 1 as a simple approach that we could do right now with no additional work, and option 3 (or 2 or 5) as a possible follow-up if there is a perf benefit.

On Option 4: It puts unsafe data as a second-class citizen. It removes data validation from the data provider, which means that clients can no longer rely on ICU4X having 3-point safety out-of-the-box. I see this as the opposite of good API design. It changes the contract of ICU4X from "robust algorithms converting data to localized results" to "algorithms that normally produce localized results but could crash your app unless you either check for data integrity or read a tutorial on how to manually validate at runtime".

@sffc
Copy link
Member Author

sffc commented Dec 11, 2021

To be clear: my preferred path forward is to run the data validator when deserializing unstructured data blobs (or do GIGO, which you don't want to do, which is okay), put it all together into an end-to-end test suite, and measure the performance difference with the validator turned on and off. If there is a significant enough perf improvement, then we can prioritize designing option 2, 3, or 5.

@sffc
Copy link
Member Author

sffc commented Dec 11, 2021

One more point I'll raise. This problem is coming up now only because we are starting to add more unstructured data blobs to ICU4X. I have said before, and my position remains, that we should generally express data with as much structure as possible. The more structured our data structs are, the less validation we need to perform: the more we can just hand off to Serde. Running a data validator in serde::Deserialize is simply your way of stating that you are deserializing data that Serde doesn't know how to deserialize properly.

@Manishearth
Copy link
Member

Option 3 keeps the status quo of Serde, but allows validation code to be only run conditionally by introducing a trait to mark certain data structures as requiring additional validation. In this solution, Serde deserialization would guarantee memory safety, and the trait would be used for panic and termination safety.

Can you sketch out what this would look like? I struggle to see a design for this that doesn't rely on specialization.

@Manishearth
Copy link
Member

Option 4 is what iainireland suggested in #1183. Serde would produce panicky structures on untrusted data, and the user would be required to call an additional validation function to prevent panics. This is similar to Option 3, but it moves the validation into userland rather than as part of the data provider.

While #1183 is more tightly scoped, note that this option would require a lot more work on zerovec to do because for zerovec data validation is a safety guarantee; especially around stuff like VarZeroVec and nested VarZeroVecs. We would have to introduce panicky unvalidated variants of ZV/VZV/ZS/VZS.

That said I'm not sure if this is actually a nontrivial amount of runtime cost.

@Manishearth
Copy link
Member

Note: I whipped up a design doc which gives us truly zero-cost Option 2 (which is great because the drawback of option 2 where mixing dynamic and static data loading leads to more code bloat is no longer a problem; since there is no runtime or even codesize cost of this)

@iainireland
Copy link
Contributor

Two senses of "validate" are being conflated here, and I think it's confusing the discussion somewhat. The first kind of validation is what currently happens in serde. Our choice to use serde+zerovec bakes in a certain amount of overhead that is necessary to do safe deserialization (in the Rust sense of memory safety, which we all agree is non-negotiable). It would certainly be nice for users of static data if we could eliminate this overhead, but right now it looks like we'd have to go all galaxy-brain to do so, and I'm not totally convinced that it's the best use of time / our complexity budget.

The second sense of "validate" is "make sure that the internal invariants of a datastructure are upheld to prevent panics". The original example of where this sense of validation is helpful is CodePointTrie, where one array stores indices into another array, and we would like to validate that those indices are in-bounds. Within Rust's type system, I am aware of no more structured way to represent that than our existing code.

The other example I know well is case mapping exceptions, where a code point trie stores indices into a packed array of variable-length data structures. There are ways to add structure to this, but after spending some time looking at them, I couldn't find a design that didn't add additional indirections or increase memory usage. Because this is core case-mapping code and is likely to be called in hot loops, I decided to go with the performant option. If we reconsidered that decision, we would still be faced with the problem that, for reasons of space, most character mapping information is stored as a signed delta from the original character to the mapped character, and it's necessary to check that the resulting mapped character is valid. Storing the mapped character directly would be a major memory regression: right now we can map A-Z to a-z using a single trie value with delta 32, where we'd need to store 26 separate mappings without the delta representation.

In short, there are good reasons for us to have internal invariants in our data structures, and I don't think it's a sign that we've made a design error, or that Serde doesn't know how to serialize our data. I reiterate that none of these invariants affect safety in the Rust sense: the kind of safety validation for VarZeroVec that Manish mentions is distinct. Even if we did no validation at all, we can safely panic in the face of malformed or malicious data. (That is, it's safe in the Rust sense. It's not "3-point safe" in Shane's terminology, but I'm not convinced that conflating panics with memory unsafety is useful.)

In the immediate term, there is a question about how we should address the second kind of validation. Avoiding the overhead of the first would be nice, but it's more urgent to reach consensus on how we should handle bad data in code people are currently trying to land. In short: what are the preconditions before we allow ourselves an expect in cases that can only be reached with bad data?

We all agree that it makes sense to validate data in the transformer when we create a data structure with internal invariants. If you are using static data, then this is enough validation to ensure that you will never panic at runtime. Depending on confidence about data integrity, this may also be sufficient for some users of dynamic data. (For example, if you are shipping cryptographically signed opaque blobs to phones, and they were validated server-side when you created them, then there will be no panics.) If you are loading untrusted dynamic data, though, you likely want to validate that data.

Option 4 boils down to "we write the internal-invariants validation code that we already want for the transformer, and make it public for anybody who wants it". This covers the static case and the trusted dynamic case with no effort on their part, and requires users of untrusted dynamic data to do some additional work.

Shane's stance is that asking users of untrusted dynamic data to call a validate method on data structures they receive from a data provider is an undue burden. In particular, he raises the very reasonable point that this has to be done in a variety of places, and it would be less error-prone if it could be done in a central location.

My response to this is option 3: okay, so let's add a validate_internal_invariants() trait method in an appropriate place (maybe DataMarker?), with an empty default implementation. Data structures with internal invariants add an implementation that just calls the validation code we already wrote for option 4. It's then easy for us to make data providers that do or don't call validate_internal_invariants on the payload before returning it. This lets us avoid completely unnecessary overhead in the case of trusted data providers, while still providing panic-safety for untrusted data users who want it.

We will need 3 (or some improved alternative) before 1.0. Right now, especially for code in experimental, option 4 should be sufficient. We should support untrusted dynamic data as a fully equal use case, but that doesn't imply we should deliberately make deserialization slower for our existing static data users to avoid a short-term ergonomic burden on users of untrusted dynamic data who aren't even using our code yet.

@sffc
Copy link
Member Author

sffc commented Dec 14, 2021

I'll post a full reply soon, but to start, here are some figures on the breakdown of cost in work_log.rs:

  • If I remove everything from work_log so that it is only the main function: 246487 Ir
  • If I add back icu_testdata::get_static_provider(): 556406 Ir
  • If I add back the DTF constructor that reads the data: 713981 Ir
  • If I add back everything: 760019 Ir

In other words, the bulk of time in work_log is spent in data loading, which is mostly deserialization. Zero-copy helped a bunch, but if we can use CrabBake, it will go significantly lower.

@sffc
Copy link
Member Author

sffc commented Dec 14, 2021

Two senses of "validate" are being conflated here, and I think it's confusing the discussion somewhat.

Serde's internal validation is ensuring that the bytes it is given conforms to the invariants it knows about: everything that can be represented as structured data. When a data struct is fully structured, Serde covers 100% of validation that needs to occur.
But when a data struct contains unstructured data that relies on invariants that can't be represented in the type system, the core Serde library is no longer able to help us; the requirement is transferred to the serde::Deserialize impl.

Therefore, I see the two types of validation ("automatic" for structured data and "manual" for unstructured data) as being equal for the purposes of the correctness of serde::Deserialize.

A corollary is that structs using #[derive(serde::Deserialize)] should have all public fields, just to clearly delineate the limits of "automatic" or "type 1" validation. If you have private fields with invariants, it means that you need to perform "manual" or "type 2" validation.

I want to emphasize though that "automatic" validation exists and will continue happening so long as we keep using Serde for data loading, even if you don't see it when you write ICU4X code.

Because this is core case-mapping code and is likely to be called in hot loops, I decided to go with the performant option. ... In short, there are good reasons for us to have internal invariants in our data structures, and I don't think it's a sign that we've made a design error, or that Serde doesn't know how to serialize our data.

This is a totally valid reason to use unstructured data. I trust your judgement to make that call.

I'm just trying to say that if you choose to use unstructured data, the burden is on you to write the code to do the work that Serde would normally be doing if you were able to express your data with more structure. In other words, unstructured data means you are making "type 1" validation weaker, and you need to make "type 2" validation stronger to compensate. Using unstructured data does not mean you can abdicate the responsibility to enforce all invariants when deserializing.

I reiterate that none of these invariants affect safety in the Rust sense: the kind of safety validation for VarZeroVec that Manish mentions is distinct. Even if we did no validation at all, we can safely panic in the face of malformed or malicious data. (That is, it's safe in the Rust sense. It's not "3-point safe" in Shane's terminology, but I'm not convinced that conflating panics with memory unsafety is useful.) ... In short: what are the preconditions before we allow ourselves an expect in cases that can only be reached with bad data?

The style guide for ICU4X has said since the beginning that we do not panic in the core library. This is the practice we have applied for all library code so far.

For example, when accessing a weekday name from the data struct in DateTimeFormat, we return an error if the weekday name is not there (if the vector is too short). In some sense, this is GIGO, except we take advantage of the fact that we are already inside of a fallible API in order to return an Err.

We could alternatively choose to impose an invariant on the length of the array. To do this, we could make an exotic type called WeekdayNames with a private field and write a custom serde::Deserialize impl for it. This would count as "type 2" validation.

Both of these approaches are valid, since they both avoid a panic. The only invalid thing would be to panic when trying to access a weekday name that isn't there. In other words, serde::Deserialize does lots of validation to guarantee that it gives us type-safe structured and unstructured data. Then, when consuming that data, the burden is on us to not panic.

It sounds like the world you are trying to propose, and correct me if I'm wrong, is that we basically change the style guide to allow an exception for panics when reading invalid data. If this is your position, that is a very concrete proposal we could make in front of the larger group.

Right now, especially for code in experimental, option 4 should be sufficient.

One of the key constraints we relax for experimental code is that it doesn't yet need to conform to the ICU4X style guide. So, yes, I agree that "option 4" is satisfactory for experimental code.

We will need 3 (or some improved alternative) before 1.0. ... We should support untrusted dynamic data as a fully equal use case, but that doesn't imply we should deliberately make deserialization slower for our existing static data users to avoid a short-term ergonomic burden on users of untrusted dynamic data who aren't even using our code yet.

If Manish's CrabBake proposal made it into 1.0, then would you be happy adding the "type 2" validation to any type that needs it?

@sffc
Copy link
Member Author

sffc commented Mar 31, 2022

2022-03-31:

  • CrabBake is on the roadmap for 1.0; @Manishearth and @robertbastian are working on it. All data will be supported except maybe DateTime skeletons. There may be paths forward to support more keys.
  • In a world with CrabBake, we can let Serde-based data rule validation by default, since CrabBake data is pre-validated at compile time. LGTM: @sffc, @iainireland
  • @robertbastian - I wonder if we still want to have a trusted Serde path.
  • @sffc - I prefer to keep just the two paths (trusted CrabBake, untrusted Serde).

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 31, 2022
@sffc sffc added this to the ICU4X 1.0 milestone Mar 31, 2022
@robertbastian robertbastian removed this from the ICU4X 1.0 milestone Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole S-epic Size: Major project (create smaller child issues) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants