[BUG] CUDA Error: Invalid Device Ordinal on Single GPU Setup (NVIDIA RTX 3080)

### Prerequisites

- [x] I have read the [MoE-Infinity documentation]().
- [x] I have searched the [Issue Tracker](https://github.com/EfficientMoE/MoE-Infinity/issues) to ensure this hasn't been reported before.

### System Information

GPU: NVIDIA GeForce RTX 3080

NVIDIA Driver Version: 560.35.03 

CUDA Toolkit Version (from nvcc -V): 12.1

PyTorch Version: torch 2.5.1+cu121

Python Version: 3.9

Installation Method: Built moe-infinity from source in a clean conda environment. Just as suggested in readme.

### Problem Description

I am consistently encountering a CUDA error: invalid device ordinal when trying to load a Mixtral model on a single GPU system, even after ensuring a perfectly matched environment (PyTorch for CUDA 12.1 and system CUDA Toolkit 12.1). The error seems to originate from the low-level Archer C++ backend during model initialization. Standard debugging steps like setting CUDA_VISIBLE_DEVICES=0 do not resolve the issue.
I use a cached Mixtral Model on my device, which is the base model of "mixtral-offloading"

The bug log is as follows:
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Do not detect pre-installed ops, use JIT mode
✅ Using checkpoint: /home/mlabszw/2025_paper_1/AdapMoE/Mixtral-8x7B-Instruct-v0.1-offloading-demo
✅ Using cache path: /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching

🔄 Loading tokenizer...
Tokenizer loaded successfully.

🔄 Loading model with moe-infinity engine...
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Using /home/mlabszw/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /home/mlabszw/.cache/torch_extensions/py39_cu121/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.34490704536438 seconds
[20251008 22:24:55.852284Z ][540102 ][INFO  ]Create ArcherAioThread for thread:  0 - archer_aio_thread.cpp:12
[20251008 22:24:55.852391Z ][540102 ][INFO  ]Index file /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching/archer_index  does not exist, creating - archer_tensor_handle.cpp:48
[20251008 22:24:55.852395Z ][540102 ][INFO  ]Index file size  0 - archer_tensor_handle.cpp:50
[20251008 22:24:55.852507Z ][540102 ][INFO  ]Device count  1 - archer_prefetch_handle.cpp:40
[20251008 22:24:55.852511Z ][540102 ][INFO  ]Enabled peer access for all devices - archer_prefetch_handle.cpp:63
Creating model from scratch ...
Loading checkpoint files:   0%|                                                                             | 0/257 [00:00<?, ?it/s]
❌ Error during model loading: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


💡 Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.
ArcherTaskPool destructor

### Steps to Reproduce

Create a clean conda environment with Python 3.9.

Install PyTorch for CUDA 12.1: pip install torch --index-url https://download.pytorch.org/whl/cu121

Install dependencies: pip install transformers accelerate sentencepiece

Clone the MoE-Infinity repository and build from source using pip install ..

Run the following Python script:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch
from transformers import AutoTokenizer
from moe_infinity import MoE

def run_mixtral_inference():
    """
    Main function to load a Mixtral model using moe-infinity and run inference.
    """
    checkpoint = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
    
    cache_path  = "model_caching"
    os.makedirs(cache_path, exist_ok=True)

    print(f"✅ Using checkpoint: {checkpoint}")
    print(f"✅ Using cache path: {cache_path}")

    print("\n🔄 Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
    print("Tokenizer loaded successfully.")

    moe_config = {
        "offload_path": cache_path,

        "device_memory_ratio": 0.25,
    }

    print("\n🔄 Loading model with moe-infinity engine...")

    try:
        model = MoE(checkpoint, moe_config)
        print("✅ Model loaded successfully onto device:", model.model.device)
    except Exception as e:
        print(f"❌ Error during model loading: {e}")
        print("\n💡 Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.")
        return

    messages = [
        {"role": "user", "content": "What are the main challenges in developing Mixture-of-Experts models?"},
    ]

    print("\n🔄 Preparing inputs with chat template...")

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.model.device)

    print("🚀 Generating response...")
    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=0.7)
    print("Generation complete.")
    
    response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

    print("\n" + "="*50)
    print("💬 Model Response:")
    print("="*50)
    print(response_text)
    print("="*50)


if __name__ == "__main__":
    run_mixtral_inference()

### Expected Behavior

_No response_

### Additional Context

_No response_

### Usage Statistics (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] CUDA Error: Invalid Device Ordinal on Single GPU Setup (NVIDIA RTX 3080) #67

Prerequisites

System Information

Problem Description

Steps to Reproduce

Expected Behavior

Additional Context

Usage Statistics (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] CUDA Error: Invalid Device Ordinal on Single GPU Setup (NVIDIA RTX 3080) #67

Description

Prerequisites

System Information

Problem Description

Steps to Reproduce

Expected Behavior

Additional Context

Usage Statistics (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions