-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Prerequisites
- I have read the MoE-Infinity documentation.
- I have searched the Issue Tracker to ensure this hasn't been reported before.
System Information
GPU: NVIDIA GeForce RTX 3080
NVIDIA Driver Version: 560.35.03
CUDA Toolkit Version (from nvcc -V): 12.1
PyTorch Version: torch 2.5.1+cu121
Python Version: 3.9
Installation Method: Built moe-infinity from source in a clean conda environment. Just as suggested in readme.
Problem Description
I am consistently encountering a CUDA error: invalid device ordinal when trying to load a Mixtral model on a single GPU system, even after ensuring a perfectly matched environment (PyTorch for CUDA 12.1 and system CUDA Toolkit 12.1). The error seems to originate from the low-level Archer C++ backend during model initialization. Standard debugging steps like setting CUDA_VISIBLE_DEVICES=0 do not resolve the issue.
I use a cached Mixtral Model on my device, which is the base model of "mixtral-offloading"
The bug log is as follows:
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Do not detect pre-installed ops, use JIT mode
✅ Using checkpoint: /home/mlabszw/2025_paper_1/AdapMoE/Mixtral-8x7B-Instruct-v0.1-offloading-demo
✅ Using cache path: /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching
🔄 Loading tokenizer...
Tokenizer loaded successfully.
🔄 Loading model with moe-infinity engine...
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Using /home/mlabszw/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /home/mlabszw/.cache/torch_extensions/py39_cu121/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.34490704536438 seconds
[20251008 22:24:55.852284Z ][540102 ][INFO ]Create ArcherAioThread for thread: 0 - archer_aio_thread.cpp:12
[20251008 22:24:55.852391Z ][540102 ][INFO ]Index file /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching/archer_index does not exist, creating - archer_tensor_handle.cpp:48
[20251008 22:24:55.852395Z ][540102 ][INFO ]Index file size 0 - archer_tensor_handle.cpp:50
[20251008 22:24:55.852507Z ][540102 ][INFO ]Device count 1 - archer_prefetch_handle.cpp:40
[20251008 22:24:55.852511Z ][540102 ][INFO ]Enabled peer access for all devices - archer_prefetch_handle.cpp:63
Creating model from scratch ...
Loading checkpoint files: 0%| | 0/257 [00:00<?, ?it/s]
❌ Error during model loading: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
💡 Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.
ArcherTaskPool destructor
Steps to Reproduce
Create a clean conda environment with Python 3.9.
Install PyTorch for CUDA 12.1: pip install torch --index-url https://download.pytorch.org/whl/cu121
Install dependencies: pip install transformers accelerate sentencepiece
Clone the MoE-Infinity repository and build from source using pip install ..
Run the following Python script:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import torch
from transformers import AutoTokenizer
from moe_infinity import MoE
def run_mixtral_inference():
"""
Main function to load a Mixtral model using moe-infinity and run inference.
"""
checkpoint = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
cache_path = "model_caching"
os.makedirs(cache_path, exist_ok=True)
print(f"✅ Using checkpoint: {checkpoint}")
print(f"✅ Using cache path: {cache_path}")
print("\n🔄 Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
print("Tokenizer loaded successfully.")
moe_config = {
"offload_path": cache_path,
"device_memory_ratio": 0.25,
}
print("\n🔄 Loading model with moe-infinity engine...")
try:
model = MoE(checkpoint, moe_config)
print("✅ Model loaded successfully onto device:", model.model.device)
except Exception as e:
print(f"❌ Error during model loading: {e}")
print("\n💡 Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.")
return
messages = [
{"role": "user", "content": "What are the main challenges in developing Mixture-of-Experts models?"},
]
print("\n🔄 Preparing inputs with chat template...")
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.model.device)
print("🚀 Generating response...")
with torch.no_grad(): # Disable gradient calculation for inference
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=0.7)
print("Generation complete.")
response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print("\n" + "="*50)
print("💬 Model Response:")
print("="*50)
print(response_text)
print("="*50)
if name == "main":
run_mixtral_inference()
Expected Behavior
No response
Additional Context
No response
Usage Statistics (Optional)
No response