-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
We have been discussing this in a couple recent Wasmtime meetings1 and on Zulip and I figured it was time to centralize discussion in a tracking issue.
What does handling OOM mean in this case? It means turning allocation failure into an Err(...) return and ultimately propagating that up to the Wasmtime embedder. It may even involve poisoning various data structures if necessary, maybe up to a whole store if necessary, but we haven't fleshed out the details completely yet. That will happen in discussions on this issue and various PRs during implementation.
Various, unordered sketches of things that will be involved:
- Replace
anyhow::Errorwith a customwasmtime::Error. At its most bare-bones, with all cargo features disabled, we will want this to basically be anenumwithout any data payloads in its variants. As we enable more cargo features, we can start adding support for formatting and error context and ultimately get to something likeanyhow::Errorwith all features enabled. - Create a
wasmtime-collectionscrate that exposes fallibleVec,HashMap, etc... This is probably just going to be newtypes over the types we already use today, butwasmtime_collections::Vec::pushwill return aResultand be implemented via something likeself.0.try_reserve(1)?; self.0.push(item); Ok(()), for example. - We will need custom
serde::Deserializeimplementations that handle OOM failure for thewasmtime-collectionstypes we use in our metadata that gets serialized into elf sections in our compiled code. - We would ideally like to statically analyze our code and make sure that we aren't allocating infallibly in the relevant code paths. It seems like we can probably use clippy for this, or at least for a 95% solution to this that is Good Enough in practice.
- We need a way to dynamically test/fuzz our OOM handling to make sure we are actually getting it right in practice.
We will initially focus on supporting the following code paths:
- Creating a
Config - Creating an
Engine - Creating a
Linker - Creating
InstancePres - Deserializing pre-compiled
Modules andComponents (not compiling new ones!) - Creating
Stores - Creating
Instances - Creating
Memorys,Tables,Globals, etc... - Running Wasm
Basically, everything that is supported in our no-std/pulley builds now: a basic runtime without the compiler, that can only run pre-compiled Wasm. We will not initially support async or the pooling allocator either, for example. I have vague ideas about how we might be able to refactor the pooling allocator for greater flexibility and enable its use in no-std / no-virtual-memory environments, but that is a bit orthogonal.
Eventually we will want to support async Wasm, yielding on out-of-fuel, ..., and the component model's async functionality. That is going to be a larger project on top of this already large project, so I'm going to delay talking about how we will cross that bridge until we get closer to it.
In practice, I expect that we will start with the OOM testing/fuzzing, create something very simple that fails immediately, and land that as "expected to fail". Then we can get that passing, which will be quite a bunch of work for this first iteration. Then we can remove the failure expectation. Then we can do a little bit more stuff inside the OOM testing/fuzzing and reveal new places we need to fix, and then we can fix those. We can continue this process until things are starting to look more and more complete. At some point we will add the clippy lints, initially to smaller modules and eventually to bigger regions of code. But the testing can be the forcing function for what area of code we add OOM handling to each step of the way.
The best way to dynamically test/fuzz OOM handling that I know of is the approach taken by SpiderMonkey's oomTest() helper: run a piece of code (potentially written by humans or generated by a fuzzer) with a special allocator that will return null on the first allocation made and check that the code didn't fail to handle the OOM, then run that code again but failing on the second allocation, then the third, etc... up to your time/compute budget. Starting by building this infrastructure is my rough plan. I've done a little bit of digging for other approaches to ensuring that your OOM-handling is correct, and I haven't really found anything, just people arguing about whether you should even check for null returns from malloc or not, which is not very helpful. That is to say, if anyone has any other ideas or knows of any other prior art here, I'd love to hear about it!