Skip to content

Handle OOM in the runtime #12069

@fitzgen

Description

@fitzgen

We have been discussing this in a couple recent Wasmtime meetings1 and on Zulip and I figured it was time to centralize discussion in a tracking issue.

What does handling OOM mean in this case? It means turning allocation failure into an Err(...) return and ultimately propagating that up to the Wasmtime embedder. It may even involve poisoning various data structures if necessary, maybe up to a whole store if necessary, but we haven't fleshed out the details completely yet. That will happen in discussions on this issue and various PRs during implementation.

Various, unordered sketches of things that will be involved:

  • Replace anyhow::Error with a custom wasmtime::Error. At its most bare-bones, with all cargo features disabled, we will want this to basically be an enum without any data payloads in its variants. As we enable more cargo features, we can start adding support for formatting and error context and ultimately get to something like anyhow::Error with all features enabled.
  • Create a wasmtime-collections crate that exposes fallible Vec, HashMap, etc... This is probably just going to be newtypes over the types we already use today, but wasmtime_collections::Vec::push will return a Result and be implemented via something like self.0.try_reserve(1)?; self.0.push(item); Ok(()), for example.
  • We will need custom serde::Deserialize implementations that handle OOM failure for the wasmtime-collections types we use in our metadata that gets serialized into elf sections in our compiled code.
  • We would ideally like to statically analyze our code and make sure that we aren't allocating infallibly in the relevant code paths. It seems like we can probably use clippy for this, or at least for a 95% solution to this that is Good Enough in practice.
  • We need a way to dynamically test/fuzz our OOM handling to make sure we are actually getting it right in practice.

We will initially focus on supporting the following code paths:

  • Creating a Config
  • Creating an Engine
  • Creating a Linker
  • Creating InstancePres
  • Deserializing pre-compiled Modules and Components (not compiling new ones!)
  • Creating Stores
  • Creating Instances
  • Creating Memorys, Tables, Globals, etc...
  • Running Wasm

Basically, everything that is supported in our no-std/pulley builds now: a basic runtime without the compiler, that can only run pre-compiled Wasm. We will not initially support async or the pooling allocator either, for example. I have vague ideas about how we might be able to refactor the pooling allocator for greater flexibility and enable its use in no-std / no-virtual-memory environments, but that is a bit orthogonal.

Eventually we will want to support async Wasm, yielding on out-of-fuel, ..., and the component model's async functionality. That is going to be a larger project on top of this already large project, so I'm going to delay talking about how we will cross that bridge until we get closer to it.

In practice, I expect that we will start with the OOM testing/fuzzing, create something very simple that fails immediately, and land that as "expected to fail". Then we can get that passing, which will be quite a bunch of work for this first iteration. Then we can remove the failure expectation. Then we can do a little bit more stuff inside the OOM testing/fuzzing and reveal new places we need to fix, and then we can fix those. We can continue this process until things are starting to look more and more complete. At some point we will add the clippy lints, initially to smaller modules and eventually to bigger regions of code. But the testing can be the forcing function for what area of code we add OOM handling to each step of the way.

The best way to dynamically test/fuzz OOM handling that I know of is the approach taken by SpiderMonkey's oomTest() helper: run a piece of code (potentially written by humans or generated by a fuzzer) with a special allocator that will return null on the first allocation made and check that the code didn't fail to handle the OOM, then run that code again but failing on the second allocation, then the third, etc... up to your time/compute budget. Starting by building this infrastructure is my rough plan. I've done a little bit of digging for other approaches to ensuring that your OOM-handling is correct, and I haven't really found anything, just people arguing about whether you should even check for null returns from malloc or not, which is not very helpful. That is to say, if anyone has any other ideas or knows of any other prior art here, I'd love to hear about it!

Footnotes

  1. See https://github.com/bytecodealliance/meetings/blob/main/wasmtime/2025/wasmtime-10-23.md and https://github.com/bytecodealliance/meetings/blob/main/wasmtime/2025/wasmtime-11-20.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions