Fix tensor descriptions in buffer.py

ch-wan · shifangx · commit 13ccd2fc69d2 · 2025-11-23T15:05:57.000-08:00
diff --git a/deep_ep/buffer.py b/deep_ep/buffer.py
@@ -684,9 +684,9 @@ def low_latency_dispatch(self, x: torch.Tensor, topk_idx: torch.Tensor,
                 `[num_local_experts, num_max_dispatch_tokens_per_rank * num_ranks, hidden // 512]` with type `torch.int`.
                 Notice that, the last-two-dimension of the scaling tensors are in column-major for TMA compatibility.
                 with `use_nvfp4=True`: the first element is a `torch.Tensor` shaped as
-                `[num_local_experts, hidden // 2, num_max_dispatch_tokens_per_rank * num_ranks]` with `torch.uint8`.
+                `[num_max_dispatch_tokens_per_rank * num_ranks, hidden // 2, num_local_experts]` with `torch.uint8`.
                 The second tensor is the corresponding scales for the first element with shape
-                `[32, 4, num_max_dispatch_tokens_per_rank * num_ranks // 128, 4, hidden // 64, num_local_experts]` with `torch.uint8`.
+                `[32, 4, num_max_dispatch_tokens_per_rank * num_ranks // 128, 4, hidden // 64, num_local_experts]` with `torch.float8_e4m3fn`.
                 With `use_fp8=False and use_nvfp4=False`, the result would be a tensor shaped as
                 `[num_local_experts, num_max_dispatch_tokens_per_rank * num_ranks, hidden]` with `torch.bfloat16`.
                 Moreover, not all tokens are valid, only some of the `num_max_dispatch_tokens_per_rank * num_ranks` are,