feat(loss): add pallas kernels for fused cross-entropy loss #1166

BrendanGraham14 · 2025-09-17T04:48:21Z

Description

This PR adds forward and backward pass pallas kernels for fused cross-entropy loss. Supports batch+seq+vocab blockwise parallelism.

A few things to note:

I haven't plumbed the batch_block_size and seq_block_size params up to LmConfig
I haven't run the kernels on tpu / gpu - only cpu (with interpret=True).

Unit test coverage

Unit tests exist in test_loss.py. Also added coverage for logit_soft_cap and batch+seq+vocab parallelism.

dlwh

nice! I'll try to run it and see what speed looks like for a llama 3 model!

dlwh · 2025-09-21T18:35:08Z

ok i tried it out. there are a bunch of small problems and some larger problems.

The smaller problem are:

The dtypes of the intermediates were bfloat16 and pallas won't allow you to store bfloat16 in float32 refs without an explicit cast. those computations should be done in float32 anyway, so I fixed that.
TPU requires that the last two block sizes be multiples of 8 and 128 but the last block was the vocab block and for llama 3 block size 512 it was size 63. This can be fixed by moving it to front.

The larger problem is that TPU is raising a notimplementederror from something. possibly the masked loads. I haven't investigated yet

dlwh

(dismissing my approval)

BrendanGraham14 · 2025-09-24T22:49:44Z

I see - thanks for trying it out. Will get my hands on a TPU to debug.

NB: I've also removed the disable_jit() call from test_single_block. JIT disabled and pallas_call(..., interpret=True) are incompatible and throw a RecursionError.

…nels

dlwh approved these changes Sep 18, 2025

View reviewed changes

dlwh requested changes Sep 24, 2025

View reviewed changes

BrendanGraham14 added 5 commits November 14, 2025 14:00

feat: add a forward pallas kernel for fused cross-entropy

e1a31b8

NB: I've also removed the disable_jit() call from test_single_block. JIT disabled and pallas_call(..., interpret=True) are incompatible and throw a RecursionError.

feat: add a backward pallas kernel for fused cross-entropy

95be9f0

feat(fused ce): add batch- and seq-level tiling support to pallas ker…

0745bb3

…nels

chore(fused ce): some cleanup

26872c8

fused ce kernels: b,s,v -> v,b,s for tpu compat, pass through dtype

688bb05

BrendanGraham14 force-pushed the fused-ce-pallas-kernels branch from 8291ba5 to 688bb05 Compare November 14, 2025 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(loss): add pallas kernels for fused cross-entropy loss #1166

feat(loss): add pallas kernels for fused cross-entropy loss #1166

Uh oh!

BrendanGraham14 commented Sep 17, 2025 •

edited

Loading

Uh oh!

dlwh left a comment

Uh oh!

dlwh commented Sep 21, 2025

Uh oh!

dlwh left a comment

Uh oh!

BrendanGraham14 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(loss): add pallas kernels for fused cross-entropy loss #1166

Are you sure you want to change the base?

feat(loss): add pallas kernels for fused cross-entropy loss #1166

Uh oh!

Conversation

BrendanGraham14 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Unit test coverage

Uh oh!

dlwh left a comment

Choose a reason for hiding this comment

Uh oh!

dlwh commented Sep 21, 2025

Uh oh!

dlwh left a comment

Choose a reason for hiding this comment

Uh oh!

BrendanGraham14 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BrendanGraham14 commented Sep 17, 2025 •

edited

Loading