-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast path for trusted data #1290
Comments
Another possible path to take here is to use compile-time code generation of the data provider structures; basically reaping the benefits of a |
We had a 2.5-hour discussion on this subject today: https://docs.google.com/document/d/1P2JDD06ERrVADYG9a8-iTHRMlaPpErqZnk6maTO1FOE/edit# Principles where I think we agree:
Principles where I think we do not agree:
Position A is generally held by Shane, and Position B is generally held by Iain. Options for how to proceed: Option 1: Uniformly enforce invariants or use GIGOIn Option 1, all data coming from the Serde deserialize function should either be validated to enforce the safety requirements (3-point: no panic, memory safety, termination), or the data should relax its invariants and instead use GIGO. This decision can be made on a case-by-case basis; GIGO should be preferred if the validation function is expensive (superlinear). Pros:
Cons:
Option 2: Use Serde for untrusted data, and a novel solution for trusted dataOption 2 is an extension of Option 1 that adds an additional code path, separate from Serde, to be used for static, trusted data. The new code path can be hyper-optimized for the static data use case, allowing the Serde code path to be hyper-optimized for the dynamic data use case. One possibility here is to use Rust codegen for the static data. We will accept more expensive validation functions instead of GIGO. Pros:
Cons:
Option 3: Use Serde with a ValidationNeeded traitOption 3 keeps the status quo of Serde, but allows validation code to be only run conditionally by introducing a trait to mark certain data structures as requiring additional validation. In this solution, Serde deserialization would guarantee memory safety, and the trait would be used for panic and termination safety. Pros:
Cons:
* I do not have data to specifically back this up, but we do have data that Serde deserialization is already the most expensive part of ICU4X in terms of both code size and cycles, and my full expectation is that an efficient validation function contributes only a small percentage to the overall runtime. Option 4: Use Serde and add a separate validation functionOption 4 is what @iainireland suggested in #1183. Serde would produce panicky structures on untrusted data, and the user would be required to call an additional validation function to prevent panics. This is similar to Option 3, but it moves the validation into userland rather than as part of the data provider. Pros:
Cons:
Option 5: Introduce an Unsafe mode for Postcard/SerdeOption 5 goes further than Option 3 by engineering Postcard/Serde to use a fast path for trusted data that skips UTF-8 and ULE validation. We would likely introduce a DeserializeUnsafe trait that is similar to Deserialize but is able to run unsafe code. Pros:
Cons:
Please let me know if everything I stated above is true and that we can agree with the options that we have on the table. |
I tried to keep the above post as neutral as possible, though I'm sure some of my biases slipped in. Here is my opinion on the options:
|
I disagree with the phrasing here. I've repeatedly agreed that dynamic data is an important value proposition, and I don't think performance wins are the only differentiator. My position is simply that dynamic data is not the only value proposition for ICU4X. Our support for dynamic data in ICU4X means that it's not optimally targeted towards users of static data (because of eg ULE overhead), and that is a reasonable design choice. In the same way, it is reasonable for us to make decisions to benefit the static data case that we might not have made if ICU4X were purely targeted at dynamic data users. Neither value proposition should overrule the other; we should make this sort of decision on a case by case basis.
I would prefer something along the lines of "Memory safety is a non-negotiable requirement for ICU4X. Panicking should be avoided where possible, but may be an option in cases where it represents a clearly identifiable programming error." (Compare "The panic! macro signals that your program is in a state it can’t handle and lets you tell the process to stop instead of trying to proceed with invalid or incorrect values.", from the Rust book.) (I don't have strong feelings about termination: if pressed, I would probably lean towards the idea that a clean panic is better than non-termination in the case of garbage data. I'm not aware of any other cases where non-termination is a real concern.)
I would reword "Restructuring code to be GIGO is difficult to reason about" as "It is more difficult to reason about the correctness of code that has been restructured to be GIGO...", but otherwise this seems accurate to me. I'll also add: as a browser developer, code dealing with untrusted data is exactly where I would have the strongest preference for panicking over GIGO. If I'm accepting data from a potentially adversarial source, the last thing I want is unpredictability. Security bugs often live in the space between your mental model of the code and the actual implementation, because "how does this code behave when it's wedged into an unexpected state?" is not a well-tested code path. Panics are a bad user experience, but in security-sensitive code like browsers, a panic is way better than an exploitable vulnerability. As ICU4X developers, we can't know what assumptions our users are making about our output, so it's hard to say for certain that GIGO can't introduce security risks.
My long-winded version of position B: The majority of clients are not likely to consume untrusted data. Clients consuming untrusted data have unique requirements, and are likely to be more sophisticated. They may have their own strategies for establishing trust. It is harder to write a one-size-fits-all implementation of untrusted data, so te ideal API for consuming untrusted data will likely be more involved / less automatic than the API for consuming static data. We should probably be guiding unsophisticated users towards the simpler static APIs. We should still make untrusted data as performant and ergonomic as we can, but we shouldn't bake in assumptions about how clients will use untrusted data in ICU4X before we have any such clients. For example, as Henri has mentioned, clients may choose to use a cryptographic signature to validate data bundles, in which case they could very reasonably choose not to do any additional load-time validation. Giving them that flexibility is beneficial.
No arguments here, so long as it's clear that I don't think the current serde overhead is a pressing issue, and my main concern is to avoid adding additional overhead that isn't required by the architecture of ICU4X. I also think Henri's framing during the meeting was good: "where reasonable, we should not make users pay a performance cost for features they don't use."
This seems mostly reasonable to me. A few comments:
|
Thanks; I updated my post to reflect your suggestions. |
A few responses to new points you raised:
My response: Preventing panics/crashes removes an attack vector. There are plenty of CVEs that are based on crashing programs. If data is coming from an adversarial source, the attacker can just as easily give you data that passes the validator but produces malicious results. They can exploit this whether or not we have a validator. So GIGO cannot introduce new security risks.
My response: I see two modes: trusted data and untrusted data. Cryptographically signed data goes into the trusted mode. Untrusted means you got a Postcard or JSON blob from somewhere you may not trust and you want to parse it with 3-point safety.
My response: I proposed option 1 as a simple approach that we could do right now with no additional work, and option 3 (or 2 or 5) as a possible follow-up if there is a perf benefit. On Option 4: It puts unsafe data as a second-class citizen. It removes data validation from the data provider, which means that clients can no longer rely on ICU4X having 3-point safety out-of-the-box. I see this as the opposite of good API design. It changes the contract of ICU4X from "robust algorithms converting data to localized results" to "algorithms that normally produce localized results but could crash your app unless you either check for data integrity or read a tutorial on how to manually validate at runtime". |
To be clear: my preferred path forward is to run the data validator when deserializing unstructured data blobs (or do GIGO, which you don't want to do, which is okay), put it all together into an end-to-end test suite, and measure the performance difference with the validator turned on and off. If there is a significant enough perf improvement, then we can prioritize designing option 2, 3, or 5. |
One more point I'll raise. This problem is coming up now only because we are starting to add more unstructured data blobs to ICU4X. I have said before, and my position remains, that we should generally express data with as much structure as possible. The more structured our data structs are, the less validation we need to perform: the more we can just hand off to Serde. Running a data validator in serde::Deserialize is simply your way of stating that you are deserializing data that Serde doesn't know how to deserialize properly. |
Can you sketch out what this would look like? I struggle to see a design for this that doesn't rely on specialization. |
While #1183 is more tightly scoped, note that this option would require a lot more work on zerovec to do because for zerovec data validation is a safety guarantee; especially around stuff like That said I'm not sure if this is actually a nontrivial amount of runtime cost. |
Note: I whipped up a design doc which gives us truly zero-cost Option 2 (which is great because the drawback of option 2 where mixing dynamic and static data loading leads to more code bloat is no longer a problem; since there is no runtime or even codesize cost of this) |
Two senses of "validate" are being conflated here, and I think it's confusing the discussion somewhat. The first kind of validation is what currently happens in serde. Our choice to use serde+zerovec bakes in a certain amount of overhead that is necessary to do safe deserialization (in the Rust sense of memory safety, which we all agree is non-negotiable). It would certainly be nice for users of static data if we could eliminate this overhead, but right now it looks like we'd have to go all galaxy-brain to do so, and I'm not totally convinced that it's the best use of time / our complexity budget. The second sense of "validate" is "make sure that the internal invariants of a datastructure are upheld to prevent panics". The original example of where this sense of validation is helpful is CodePointTrie, where one array stores indices into another array, and we would like to validate that those indices are in-bounds. Within Rust's type system, I am aware of no more structured way to represent that than our existing code. The other example I know well is case mapping exceptions, where a code point trie stores indices into a packed array of variable-length data structures. There are ways to add structure to this, but after spending some time looking at them, I couldn't find a design that didn't add additional indirections or increase memory usage. Because this is core case-mapping code and is likely to be called in hot loops, I decided to go with the performant option. If we reconsidered that decision, we would still be faced with the problem that, for reasons of space, most character mapping information is stored as a signed delta from the original character to the mapped character, and it's necessary to check that the resulting mapped character is valid. Storing the mapped character directly would be a major memory regression: right now we can map A-Z to a-z using a single trie value with delta 32, where we'd need to store 26 separate mappings without the delta representation. In short, there are good reasons for us to have internal invariants in our data structures, and I don't think it's a sign that we've made a design error, or that Serde doesn't know how to serialize our data. I reiterate that none of these invariants affect safety in the Rust sense: the kind of safety validation for VarZeroVec that Manish mentions is distinct. Even if we did no validation at all, we can safely panic in the face of malformed or malicious data. (That is, it's safe in the Rust sense. It's not "3-point safe" in Shane's terminology, but I'm not convinced that conflating panics with memory unsafety is useful.) In the immediate term, there is a question about how we should address the second kind of validation. Avoiding the overhead of the first would be nice, but it's more urgent to reach consensus on how we should handle bad data in code people are currently trying to land. In short: what are the preconditions before we allow ourselves an We all agree that it makes sense to validate data in the transformer when we create a data structure with internal invariants. If you are using static data, then this is enough validation to ensure that you will never panic at runtime. Depending on confidence about data integrity, this may also be sufficient for some users of dynamic data. (For example, if you are shipping cryptographically signed opaque blobs to phones, and they were validated server-side when you created them, then there will be no panics.) If you are loading untrusted dynamic data, though, you likely want to validate that data. Option 4 boils down to "we write the internal-invariants validation code that we already want for the transformer, and make it public for anybody who wants it". This covers the static case and the trusted dynamic case with no effort on their part, and requires users of untrusted dynamic data to do some additional work. Shane's stance is that asking users of untrusted dynamic data to call a My response to this is option 3: okay, so let's add a We will need 3 (or some improved alternative) before 1.0. Right now, especially for code in |
I'll post a full reply soon, but to start, here are some figures on the breakdown of cost in work_log.rs:
In other words, the bulk of time in work_log is spent in data loading, which is mostly deserialization. Zero-copy helped a bunch, but if we can use CrabBake, it will go significantly lower. |
Serde's internal validation is ensuring that the bytes it is given conforms to the invariants it knows about: everything that can be represented as structured data. When a data struct is fully structured, Serde covers 100% of validation that needs to occur. Therefore, I see the two types of validation ("automatic" for structured data and "manual" for unstructured data) as being equal for the purposes of the correctness of A corollary is that structs using I want to emphasize though that "automatic" validation exists and will continue happening so long as we keep using Serde for data loading, even if you don't see it when you write ICU4X code.
This is a totally valid reason to use unstructured data. I trust your judgement to make that call. I'm just trying to say that if you choose to use unstructured data, the burden is on you to write the code to do the work that Serde would normally be doing if you were able to express your data with more structure. In other words, unstructured data means you are making "type 1" validation weaker, and you need to make "type 2" validation stronger to compensate. Using unstructured data does not mean you can abdicate the responsibility to enforce all invariants when deserializing.
The style guide for ICU4X has said since the beginning that we do not panic in the core library. This is the practice we have applied for all library code so far. For example, when accessing a weekday name from the data struct in DateTimeFormat, we return an error if the weekday name is not there (if the vector is too short). In some sense, this is GIGO, except we take advantage of the fact that we are already inside of a fallible API in order to return an We could alternatively choose to impose an invariant on the length of the array. To do this, we could make an exotic type called Both of these approaches are valid, since they both avoid a panic. The only invalid thing would be to panic when trying to access a weekday name that isn't there. In other words, It sounds like the world you are trying to propose, and correct me if I'm wrong, is that we basically change the style guide to allow an exception for panics when reading invalid data. If this is your position, that is a very concrete proposal we could make in front of the larger group.
One of the key constraints we relax for
If Manish's CrabBake proposal made it into 1.0, then would you be happy adding the "type 2" validation to any type that needs it? |
2022-03-31:
|
By default, we assume that data is untrusted: we don't guarantee anything about the bytes ready by the data provider at runtime. We assume that the bytes could be malformed or malignant. This is important for dynamically loaded data. For further discussion, see #1183.
However, it is a common case that we have fully trusted data as well. For example, data built statically into the binary, or downloaded and verified with a cryptographic signature, does not need to be validated at runtime, because it is not possible to manipulate it between
icu4x-datagen
and the code reading the bytes from memory.This issue is to track design and implementation around a fast path for data loading that is built with different assumptions (trusted data).
The text was updated successfully, but these errors were encountered: