Skip to content

[BUG] CUDA Error: Invalid Device Ordinal on Single GPU Setup (NVIDIA RTX 3080) #67

@ZiweiSong96

Description

@ZiweiSong96

Prerequisites

System Information

GPU: NVIDIA GeForce RTX 3080

NVIDIA Driver Version: 560.35.03

CUDA Toolkit Version (from nvcc -V): 12.1

PyTorch Version: torch 2.5.1+cu121

Python Version: 3.9

Installation Method: Built moe-infinity from source in a clean conda environment. Just as suggested in readme.

Problem Description

I am consistently encountering a CUDA error: invalid device ordinal when trying to load a Mixtral model on a single GPU system, even after ensuring a perfectly matched environment (PyTorch for CUDA 12.1 and system CUDA Toolkit 12.1). The error seems to originate from the low-level Archer C++ backend during model initialization. Standard debugging steps like setting CUDA_VISIBLE_DEVICES=0 do not resolve the issue.
I use a cached Mixtral Model on my device, which is the base model of "mixtral-offloading"

The bug log is as follows:
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Do not detect pre-installed ops, use JIT mode
✅ Using checkpoint: /home/mlabszw/2025_paper_1/AdapMoE/Mixtral-8x7B-Instruct-v0.1-offloading-demo
✅ Using cache path: /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching

🔄 Loading tokenizer...
Tokenizer loaded successfully.

🔄 Loading model with moe-infinity engine...
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Using /home/mlabszw/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /home/mlabszw/.cache/torch_extensions/py39_cu121/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.34490704536438 seconds
[20251008 22:24:55.852284Z ][540102 ][INFO ]Create ArcherAioThread for thread: 0 - archer_aio_thread.cpp:12
[20251008 22:24:55.852391Z ][540102 ][INFO ]Index file /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching/archer_index does not exist, creating - archer_tensor_handle.cpp:48
[20251008 22:24:55.852395Z ][540102 ][INFO ]Index file size 0 - archer_tensor_handle.cpp:50
[20251008 22:24:55.852507Z ][540102 ][INFO ]Device count 1 - archer_prefetch_handle.cpp:40
[20251008 22:24:55.852511Z ][540102 ][INFO ]Enabled peer access for all devices - archer_prefetch_handle.cpp:63
Creating model from scratch ...
Loading checkpoint files: 0%| | 0/257 [00:00<?, ?it/s]
❌ Error during model loading: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

💡 Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.
ArcherTaskPool destructor

Steps to Reproduce

Create a clean conda environment with Python 3.9.

Install PyTorch for CUDA 12.1: pip install torch --index-url https://download.pytorch.org/whl/cu121

Install dependencies: pip install transformers accelerate sentencepiece

Clone the MoE-Infinity repository and build from source using pip install ..

Run the following Python script:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch
from transformers import AutoTokenizer
from moe_infinity import MoE

def run_mixtral_inference():
"""
Main function to load a Mixtral model using moe-infinity and run inference.
"""
checkpoint = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

cache_path  = "model_caching"
os.makedirs(cache_path, exist_ok=True)

print(f"✅ Using checkpoint: {checkpoint}")
print(f"✅ Using cache path: {cache_path}")

print("\n🔄 Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
print("Tokenizer loaded successfully.")

moe_config = {
    "offload_path": cache_path,

    "device_memory_ratio": 0.25,
}

print("\n🔄 Loading model with moe-infinity engine...")

try:
    model = MoE(checkpoint, moe_config)
    print("✅ Model loaded successfully onto device:", model.model.device)
except Exception as e:
    print(f"❌ Error during model loading: {e}")
    print("\n💡 Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.")
    return

messages = [
    {"role": "user", "content": "What are the main challenges in developing Mixture-of-Experts models?"},
]

print("\n🔄 Preparing inputs with chat template...")

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.model.device)

print("🚀 Generating response...")
with torch.no_grad(): # Disable gradient calculation for inference
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=0.7)
print("Generation complete.")

response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

print("\n" + "="*50)
print("💬 Model Response:")
print("="*50)
print(response_text)
print("="*50)

if name == "main":
run_mixtral_inference()

Expected Behavior

No response

Additional Context

No response

Usage Statistics (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions