-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String size of mangled function names could be exponentially large #106
Comments
Is there any concern about DoS? I'm in particular imagining something like a component which parses user-controlled JSON, expecting it to correlate to one of a set of wit types. And in the case of a mismatch, logs the generated type name. Or some other pathway to user-controlled generation of these, if they can be exponentially large. |
I'd also like to echo Alex's concerns with this mangling scheme. Per my limited understanding, these strings exist only to facilitate the "componentizer" (today While certainly we can rename the imports and exports in the inner module to reduce the size of these strings in the output component (since they are really an implementation detail), I'd really prefer not having to touch the original module that's being wrapped as a component, other than perhaps stripping a custom section or two. Unless there's another use case for the mangling other than conveying information to the "next step" in producing a component, I'd also favor sticking type definitions (or just the original |
I think whether or not DoS is a concern depends on what's being implemented and where. For example I suspect a lot of the interface-to-mangled-name translation and back is probably developer-tooling-related where DoS is less of a security issue and more of a "well my builds are just slow" issue - which while still a problem is a bit less so. That being said as you mention it's tough I think to predict when/where this will be used and it's definitely a fact that any tool trying to performing the mapping will need to have limits baked into it which limits the size of the input that it can process or output it can produce (and from fuzzing experience it's very easy to forget these limits) One of the major contraints for the overall operation here is that the output of LLVM/LLD as-is today needs to be able to produce a core module that can be "componentized". For example one idea I had was to possibly just have an actual comopnent type section as a custom section along with a custom import/export section mapping the core wasm's imports/exports into the component types, but I don't think that this format would be easy to produce with LLVM/LLD as-is. Even including the |
Agreed that, since the canonical ABI is a mostly toolchain-internal detail (that can be rewritten to short names in the final Independently, we had talked about finding a way to faithfully represent Wit type aliases in component binaries so that they can be reproduced from just the binary; if we had that ability, then it would be fairly natural to reference type aliases by name, reusing types the same way as Wit. But the problem is that there would be no direct way to map from a use of a type index (in a function signature) to its associated import/export name (that wouldn't be heuristic). This might be related to an alternative solution to #102 (although #102 is more concerned with resource types); I'll look into whether there might be a joint fix. |
Yeah, I feel some tension between
|
If we do get the direct mapping between export names and type indices (which, based after some initial discussions today, does seem like a tentatively coherent idea), then there should be a unique representation, such that you wouldn't even call this a "compression scheme" but, rather, just a "faithful representation of the Wit". 3 is an important concern, and I think it would be addressed by subtyping, which would say that it's always compatible to (1) export a new type alias and (2) go back and forth between uses of type aliases and equivalent inline structural types. It would be an incompatible change to remove an exported type alias, but that sounds right, since it could break code using the type alias. |
This is additionally compounded by the fact that mangled name strings are repeated (potentially many times) internally in components, both in instantiation argument names and in export aliasing. This mangling scheme really doesn't sit well with me and is unnecessarily complicating the tools all to avoid having to feed wit files into a tool like I personally think name mangling should not be part of the component model proposal. |
Personally I think that having a translation from a core wasm to component, in isolation, can be useful for a few reasons:
I may be forgetting a few things here, too. Otherwise though I don't know if it's possible to solve this issue of exponential size and additionally keep these benefits I'm thinking of. Personally I don't think we should cast this aside just yet, but I do think it's more urgent than other issues to figure this out. If there's no great solution to this then I think it should be removed from the component model -- but this requires recalibrating plans for other tooling such as |
I should perhaps amend my statement that this iteration of name mangling should not be included in the proposal. I don't think it's valuable to implement tooling around something we know is very far from a reasonable scheme; it's just wasting effort. The bottom line is I don't think we can adequately and efficiently encode the type information in import and export name strings alone. We need to come up with a better solution before moving forward on an implementation of a tool that consumes this information. |
Just to give a bit more concrete details of what I was thinking about above, let's say we make these changes to the Component Model (semantics and Canonical ABI):
With those changes to the Component Model, then Wit-to-Component compilation could preserve name information by emitting uses of imported/exported type indices. I still can't claim to have thought and talked through this in detail, but does that make sense? |
Honestly, that sounds much more complicated for the tooling than sticking everything in a custom section (or multiple) with an expressive format — one not inherently limited to core wasm semantics — that can ultimately be discarded by the componentizer when it's done. |
One important problem that I think this issue highlights is that this isn't really a detail of the Canonical ABI but, rather, a lack of expressivity in component-level types as well. In particular, if we have a goal of being able to recover a Wit file from a component binary, then I think we need step 1 to "wire up" the names as they were originally in the Wit. Given that, steps 2 and 3 seem like "business as usual" with the Canonical ABI (such that, if we wanted to use custom sections, we should do it for everything, not just types). |
Can you elaborate on what you think is missing from the binary format today that prevents us from synthesizing a wit definition from a component file? We have tooling that does exactly this already implemented, including emitting type aliases based on type export names. And apologies if I'm coming off a bit agitated here, but currently The tool isn't currently validating the expected mangled imports/exports names nor does it flow the managed names through in instantiation arguments or aliasing (lots o' potentially long strings being repeated). While I'll get it fixed, it's rather annoying given the tool doesn't even need this information to operate. I do agree that ultimately componetization shouldn't depend on the wit definitions, but I would like the proposal, and by extension the tooling, to encode this intermediary information as efficiently as possible both as in the intermediary core module, but also in the final component (ideally being completely absent; i.e. stripped from the inner core module). |
The challenge I think is if I have a component: (component
(type $T (list string))
(export "a" (type $T))
(export "b" (type $T))
(func (export "f") (param "x" $T) ... )
) since there is no explicit relation between the type-index |
The tooling is expecting a unique type index for an aliased type in the wit, for example: (component
(type $T (list string))
(alias outer 0 $T (type $A1))
(alias outer 0 $T (type $A2))
(export "a" (type $A1))
(export "b" (type $A2))
(core module $m
(memory (export "mem") 0)
(func (export "x") (param i32 i32) unreachable)
(func (export "realloc") (param i32 i32 i32 i32) (result i32) unreachable)
)
(core instance $i (instantiate $m))
(alias core export $i "x" (core func $x))
(alias core export $i "realloc" (core func $realloc))
(alias core export $i "mem" (core memory $mem))
(func (export "foo") (param "x" $T) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
(func (export "bar") (param "x" $A1) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
(func (export "baz") (param "x" $A2) (canon lift (core func $x) (realloc $realloc) (memory $mem)))
) That should be enough for the tooling to print:
Is that not sufficient to disambiguate? The tooling doesn't do so right now, but I suspect that's a bug (it's also out of date with the wit it prints). |
That's a good way to encode Wit into component types, but if the goal is for Wit to be isomorphic to component types (such that we can ask for the |
It seems to me that there's a myriad of problems of having full isomorphism with respect to For example, you could define a record type in a component and refer to the record type by type index in a function type without ever exporting the record type with a name. As records (any many other One might argue that one could not have produced such a component with a Currently our tooling errors when it encounters this as the wit-based componentization tool should have exported any record types in the output component. Personally, I think To me, at least, the text and binary formats are the isomorphic representations of types in the component model; At any rate, this discussion is a bit off in the weeds now and probably belongs in another forum, sorry about that 😞. |
I think it's a valid question of whether having an isomorphism is a goal and a good one to discuss here so no worries about that. From my perspective, I think it would be pretty valuable to the ecosystem, since it would mean that Wit "covered" the whole surface area of the component model and vice versa. It would mean that there would never be a need to drag along a separate For the cases you brought up, it seems like they could be addressed by expanding Wit to be a bit more uniform in its treatment of the different kinds of types:
|
This was discussed today in the wit-bindgen meeting and to write down the conclusions (but correct me if I'm misremembering):
|
After discussing this more 1:1 with @pchickey, he pointed out a problem we'll have with the approach of "declaring reusable types as part of mangled name" around imports: the linker may discard an import if it isn't called. Thus we won't know which linker attribute applied to an import's declaration in the original source will survive through to the core module and this makes it impossible to know where a type declaration should go at the point we're expecting to be mangling these names. We again discussed the custom section approach, where the custom section would basically be an "interface-describing" component containing a component type section (for all the component model representation of the wit-declared types), an alias section (for type aliases), and an export section (for naming the types). Essentially a full-fidelity representation of the original wit definition, but using the component model binary format. The "mangled" name could then be as simple as As we previously discussed, this felt like a non-starter because we would not be able to include this information, at least initially, for languages like C where it isn't possible to annotate the source with a splat of bytes intended for a custom section. However, what if we had a componetizer that initially supported both ways of componetizing a core module: either by ingesting a fully self-describing core module or by being explicitly informed of the interface definitions via parsing wit files? The intention would be to eventually support only the self-describing core module as this would be what the component model proposal itself would describe, but for languages that can't yet, there's tooling around using the external wit definitions for componetization. We could then perhaps work on getting support for custom section annotations in more compiler front-ends (e.g. clang?), eventually deprecating the need to support componentization via side-channel wit definitions. I realize there's no obvious stand-out solution to this "exploding name mangling" problem, so I do want to keep this discussion going until we reach a consensus. |
@alexcrichton @peterhuene and I talked a bunch about how to solve this today: We decided that mangled strings with declared types are not viable for import functions, because a library like wasi-libc would have to declare the From there, we decided that the best possible path is to use a custom section in core wasm. Peter has already created a tool which gives a wit file an isomorphic binary encoding, over for the Wit-bindgen can take that binary format and put it directly into Rust source using Then, |
That sounds good to me. Reusing the binary format (and supporting machinery) of the component model to encode the destination type directly does seem like a more direct solution to the immediate problem than hacking more information into the Canonical ABI. If bindings are ultimately being generated from a single target Independently of solving the immediate type-explosion problem, I still think we should discuss the goal of maintaining an isomorphism between Wit and component types, but that's less pressing and I expect is going to come up out of necessity as part of nailing down resource type imports and exports. |
A world is more specific than a library (like wasi-libc) or crate (like |
I experimented with this and I believe that we can support C. While clang doesn't have suppot for a source-level definition that becomes a custom section in the final wasm binary, what we can do is something that looks like:
This is an example of using Otherwise wanted to also say I think this is a great idea @peterhuene and @pchickey, I'm glad y'all discovered the but-what-import-is-actually-linked issue! |
Assuming this solution sticks and thinking about how to close out this issue: we could simply remove the name mangling from the final section of CanonicalABI.md however, in the same way that the goal of the original name mangling scheme was to define a core-wasm-compilation-target that could be independently targeted across core-wasm-producer toolchains, allowing tools like If so, returning to my last question above ("could the custom section define a single target component type?"): If the intention is to use |
…codealliance#322)" This reverts commit 6aadac3. We chose not to use this approach to smuggling component type information in core modules. See WebAssembly/component-model#106 (comment) for more details.
…" (#329) This reverts commit 6aadac3. We chose not to use this approach to smuggling component type information in core modules. See WebAssembly/component-model#106 (comment) for more details.
Just to follow up on my previous comment to capture the counterargument explained in the last component tooling meeting: while there is a single up-front Another point that came up that sparked some enlightenment for me was the idea that each of those |
…codealliance#322)" (bytecodealliance#329) This reverts commit 126f8e8. We chose not to use this approach to smuggling component type information in core modules. See WebAssembly/component-model#106 (comment) for more details.
Is there anything left keeping this issue open at this point? |
Ah no, I think this is all handled. |
…" (#329) This reverts commit 5e38ee0. We chose not to use this approach to smuggling component type information in core modules. See WebAssembly/component-model#106 (comment) for more details.
I haven't been following the name mangling discussions much but was reviewing bytecodealliance/wit-bindgen#309 when I saw a change that got me thinking about the name mangling given a diff like this:
Two consequences that come to mind for this name mangling scheme are:
Primarily the stringify-the-whole-type expansion means that the output string can have an exponential size with respect to the input type. This has come up quite a bit when fuzzing the component model in Wasmtime where any tree-based (in a sense) representation like this quickly runs out of memory while fuzzing, even for reasonable (ish) inputs. For example in the diff above the
enum
is printed twice, once on each use. Especially with variants it's been seen to get reasonably large while reusing types a lot. To be clear though I don't have a specific use case in mind when I'm worried about there being exponential size, this is more of a future-compatibility concern where anything using the name mangling, especially coupled with fuzzing, will have to severly limit its input size to ensure that the string doesn't become too large.Otherwise more practically this means that for WASI I'd worry about first-impressions and the size of import strings and modules. In WASI today basically all functions return
result<T, errno>
whereerrno
is a large enumeration (~75 I think as of now) which means that every WASI function would be imported with a nearly 1kb long string. While not really a technical limitation per se this seems like it's going to raise a lot of eyebrows when a "minimal component" will have so many large strings embedded within it.The first issue could hypothetically be fixed with an improved mangling scheme that implicitly defines type references. For example the above function could be mangled to something like
roundtrip-enum: func(a: enum { a, b, c }) -> t0
where. This be a significant increase in complexity, however. The second issue I'm not sure how to fix other than trying to maintain state across demangling an entire module. Although perhaps one idea would be to not use import/export names to contain name mangling for the core module -> component translation but instead have a separate custom section which looks more like the component type section which is used to assign component model types to the core wasm imports/exports. (the translation would then largely "just" remove the custom section and wrap it in a component with the same contents)The text was updated successfully, but these errors were encountered: