Skip to content

Conversation

@energystoryhhl
Copy link

This patch series adds device memory slab address issue in the OCKL device
memory allocator that causes GPU page faults and crashes during memory
deallocation.

Problem

The deallocation code uses address masking to locate slab metadata:
saddr = addr & ~0x1fffffUL

This assumes all slabs are 2MB (2^21 bytes) aligned. However, the allocator
never validated this assumption when obtaining slabs from either:

  1. Pre-allocated initial_slabs pool (__ockl_dm_init_v1)
  2. Dynamic allocation via __ockl_devmem_request (obtain_new_slab)

When a misaligned slab is used (e.g., 0x6fd301ade000 instead of 0x6fd301a00000),
deallocation computes the wrong slab address, reads garbage metadata, and
causes array out-of-bounds access leading to GPU page faults.

Solution

Patch 1/2: Add alignment validation in __ockl_dm_init_v1

  • Verify initial_slabs base address is 2MB aligned
  • If misaligned, disable pre-allocated pool to force devmem_request fallback

Patch 2/2: Add alignment validation in obtain_new_slab

  • Check pre-allocated slab addresses before use
  • Verify dynamically allocated slab addresses from devmem_request
  • Return NULL on alignment failure instead of causing silent corruption

Impact

These fixes prevent memory corruption and GPU crashes. Allocations will either
succeed with properly aligned slabs or fail safely with NULL returns, allowing
proper error handling instead of catastrophic failures.

The device memory allocator assumes all slabs are 2MB aligned because
__ockl_dm_dealloc uses address masking (addr & ~0x1fffffUL) to find
slab metadata. If initial_slabs are misaligned, deallocation will
compute wrong slab addresses, read garbage metadata, and cause GPU
page faults.

This patch adds alignment validation in __ockl_dm_init_v1. If the
initial slabs base address is not 2MB aligned, the function disables
the pre-allocated pool by setting initial_slabs equal to
initial_slabs_end, forcing allocation to fall back to devmem_request.

Signed-off-by: Honglei Huang <[email protected]>
The obtain_new_slab function retrieves 2MB slabs from either the
pre-allocated pool or via dynamic allocation through devmem_request.
However, it did not validate that returned addresses are 2MB aligned.

Since __ockl_dm_dealloc uses address masking (addr & ~0x1fffffUL) to
locate slab metadata, misaligned slabs cause incorrect address
calculations, leading to reading garbage metadata and GPU page faults.

This patch adds alignment validation for both pre-allocated and
dynamically allocated slabs. If a slab is misaligned, the function
releases it (for dynamic allocations) and returns 0 to fail safely
rather than causing silent memory corruption.

Signed-off-by: Honglei Huang <[email protected]>
@z1-cciauto
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants