Skip to content

[Bug] vLLM Import Error when Running Llama-3.2-1B in FP32 #14314

@aflah02

Description

@aflah02

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

Error -

Reproduction

When I run the command - python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.2-1B" --host 0.0.0.0 --dtype float32, I get a vLLM Import Error

Here is the full error trace -

root@158457f6b080:/sgl-workspace/sglang# python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.2-1B" --host 0.0.0.0 --dtype float32
[2025-12-02 19:21:23] INFO model_config.py:881: Upcasting torch.bfloat16 to torch.float32.
[2025-12-02 19:21:23] WARNING server_args.py:1213: Attention backend not explicitly specified. Use fa3 backend by default.
[2025-12-02 19:21:24] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2025-12-02 19:21:24] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B', tokenizer_path='meta-llama/Llama-3.2-1B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='float32', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.86, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=38650796, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='meta-llama/Llama-3.2-1B', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-12-02 19:21:24] Upcasting torch.bfloat16 to torch.float32.
[2025-12-02 19:21:25] No chat template found, defaulting to 'string' content format
[2025-12-02 19:21:30] Upcasting torch.bfloat16 to torch.float32.
[2025-12-02 19:21:31] Upcasting torch.bfloat16 to torch.float32.
[2025-12-02 19:21:31] Init torch distributed begin.
[W1202 19:21:32.334798467 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W1202 19:21:32.343275226 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-02 19:21:32] Init torch distributed ends. mem usage=0.00 GB
[2025-12-02 19:21:32] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-02 19:21:33] Load weight begin. avail mem=5.97 GB
[2025-12-02 19:21:33] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2712, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 312, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 237, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 324, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 410, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 767, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 594, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 262, in _initialize_model
    return model_class(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 425, in __init__
    self.model = self._init_model(config, quant_config, add_prefix("model", prefix))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 457, in _init_model
    return LlamaModel(config, quant_config=quant_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 300, in __init__
    self.layers, self.start_layer, self.end_layer = make_layers(
                                                    ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 577, in make_layers
    + get_offloader().wrap_modules(
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules
    return list(all_modules_generator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 579, in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 302, in <lambda>
    lambda idx, prefix: LlamaDecoderLayer(
                        ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 227, in __init__
    self.self_attn = LlamaAttention(
                     ^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 170, in __init__
    self.rotary_emb = get_rope(
                      ^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/rotary_embedding.py", line 2541, in get_rope
    rotary_emb = Llama3RotaryEmbedding(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/rotary_embedding.py", line 932, in __init__
    super().__init__(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/rotary_embedding.py", line 121, in __init__
    from vllm._custom_ops import rotary_embedding
ModuleNotFoundError: No module named 'vllm'

[2025-12-02 19:21:33] Received sigquit from a child process. It usually means the child failed.
Killed

I cannot figure out what is the right vLLM to install as if I install the latest via - pip install -U vllm

Then I get errors during the installation -

root@158457f6b080:/sgl-workspace/sglang# pip install -U vllm
Collecting vllm
  Downloading vllm-0.11.2-cp38-abi3-manylinux1_x86_64.whl.metadata (18 kB)
Requirement already satisfied: regex in /usr/local/lib/python3.12/dist-packages (from vllm) (2025.11.3)
Collecting cachetools (from vllm)
  Downloading cachetools-6.2.2-py3-none-any.whl.metadata (5.6 kB)
Requirement already satisfied: psutil in /usr/local/lib/python3.12/dist-packages (from vllm) (7.1.3)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.12/dist-packages (from vllm) (0.2.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from vllm) (2.3.5)
Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (2.32.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from vllm) (4.67.1)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting py-cpuinfo (from vllm)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Requirement already satisfied: transformers<5,>=4.56.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (4.57.1)
Requirement already satisfied: tokenizers>=0.21.1 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.22.1)
Requirement already satisfied: protobuf in /usr/local/lib/python3.12/dist-packages (from vllm) (6.33.1)
Requirement already satisfied: fastapi>=0.115.0 in /usr/local/lib/python3.12/dist-packages (from fastapi[standard]>=0.115.0->vllm) (0.121.2)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.12/dist-packages (from vllm) (3.13.2)
Requirement already satisfied: openai>=1.99.1 in /usr/local/lib/python3.12/dist-packages (from vllm) (2.6.1)
Requirement already satisfied: pydantic>=2.12.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (2.12.4)
Requirement already satisfied: prometheus_client>=0.18.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.23.1)
Requirement already satisfied: pillow in /usr/local/lib/python3.12/dist-packages (from vllm) (12.0.0)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: tiktoken>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.12.0)
Collecting lm-format-enforcer==0.11.3 (from vllm)
  Downloading lm_format_enforcer-0.11.3-py3-none-any.whl.metadata (17 kB)
Collecting llguidance<1.4.0,>=1.3.0 (from vllm)
  Downloading llguidance-1.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting outlines_core==0.2.11 (from vllm)
  Downloading outlines_core-0.2.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Requirement already satisfied: diskcache==5.6.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (5.6.3)
Collecting lark==1.2.2 (from vllm)
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Requirement already satisfied: xgrammar==0.1.25 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.1.25)
Requirement already satisfied: typing_extensions>=4.10 in /usr/local/lib/python3.12/dist-packages (from vllm) (4.15.0)
Requirement already satisfied: filelock>=3.16.1 in /usr/local/lib/python3.12/dist-packages (from vllm) (3.20.0)
Requirement already satisfied: partial-json-parser in /usr/local/lib/python3.12/dist-packages (from vllm) (0.2.1.1.post6)
Requirement already satisfied: pyzmq>=25.0.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (27.1.0)
Requirement already satisfied: msgspec in /usr/local/lib/python3.12/dist-packages (from vllm) (0.19.0)
Requirement already satisfied: gguf>=0.13.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.17.1)
Collecting mistral_common>=1.8.5 (from mistral_common[image]>=1.8.5->vllm)
  Downloading mistral_common-1.8.6-py3-none-any.whl.metadata (5.3 kB)
Collecting opencv-python-headless>=4.11.0 (from vllm)
  Downloading opencv_python_headless-4.12.0.88-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (19 kB)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.12/dist-packages (from vllm) (6.0.3)
Requirement already satisfied: six>=1.16.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.17.0)
Requirement already satisfied: setuptools<81.0.0,>=77.0.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (80.9.0)
Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from vllm) (0.8.1)
Requirement already satisfied: compressed-tensors==0.12.2 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.12.2)
Collecting depyf==0.20.0 (from vllm)
  Downloading depyf-0.20.0-py3-none-any.whl.metadata (7.3 kB)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.12/dist-packages (from vllm) (3.1.2)
Collecting watchfiles (from vllm)
  Downloading watchfiles-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting python-json-logger (from vllm)
  Downloading python_json_logger-4.0.0-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from vllm) (1.16.3)
Requirement already satisfied: ninja in /usr/local/lib/python3.12/dist-packages (from vllm) (1.13.0)
Requirement already satisfied: pybase64 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.4.2)
Collecting cbor2 (from vllm)
  Downloading cbor2-5.7.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Requirement already satisfied: setproctitle in /usr/local/lib/python3.12/dist-packages (from vllm) (1.3.7)
Requirement already satisfied: openai-harmony>=0.0.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.0.4)
Collecting anthropic==0.71.0 (from vllm)
  Downloading anthropic-0.71.0-py3-none-any.whl.metadata (28 kB)
Collecting model-hosting-container-standards<1.0.0 (from vllm)
  Downloading model_hosting_container_standards-0.1.9-py3-none-any.whl.metadata (24 kB)
Collecting numba==0.61.2 (from vllm)
  Downloading numba-0.61.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.8 kB)
Collecting ray>=2.48.0 (from ray[cgraph]>=2.48.0->vllm)
  Downloading ray-2.52.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting torch==2.9.0 (from vllm)
  Downloading torch-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting torchaudio==2.9.0 (from vllm)
  Downloading torchaudio-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.9 kB)
Collecting torchvision==0.24.0 (from vllm)
  Downloading torchvision-0.24.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (5.9 kB)
Collecting xformers==0.0.33.post1 (from vllm)
  Downloading xformers-0.0.33.post1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Requirement already satisfied: flashinfer-python==0.5.2 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.5.2)
Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (4.11.0)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from anthropic==0.71.0->vllm) (1.7.0)
Requirement already satisfied: docstring-parser<1,>=0.15 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (0.17.0)
Requirement already satisfied: httpx<1,>=0.25.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (0.28.1)
Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (0.12.0)
Requirement already satisfied: sniffio in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (1.3.1)
Requirement already satisfied: loguru in /usr/local/lib/python3.12/dist-packages (from compressed-tensors==0.12.2->vllm) (0.7.3)
Collecting astor (from depyf==0.20.0->vllm)
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: dill in /usr/local/lib/python3.12/dist-packages (from depyf==0.20.0->vllm) (0.4.0)
Requirement already satisfied: apache-tvm-ffi<0.2,>=0.1 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (0.1.2)
Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (8.3.1)
Requirement already satisfied: nvidia-cudnn-frontend>=1.13.0 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (1.16.0)
Requirement already satisfied: nvidia-cutlass-dsl>=4.2.1 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (4.3.0.dev0)
Requirement already satisfied: nvidia-ml-py in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (13.580.82)
Requirement already satisfied: packaging>=24.2 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (25.0)
Requirement already satisfied: tabulate in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.2->vllm) (0.9.0)
Requirement already satisfied: interegular>=0.3.2 in /usr/local/lib/python3.12/dist-packages (from lm-format-enforcer==0.11.3->vllm) (0.3.3)
Collecting llvmlite<0.45,>=0.44.0dev0 (from numba==0.61.2->vllm)
  Downloading llvmlite-0.44.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.0 kB)
Collecting numpy (from vllm)
  Downloading numpy-2.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.0->vllm) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.0->vllm) (3.5)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.0->vllm) (3.1.6)
Requirement already satisfied: fsspec>=0.8.5 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.0->vllm) (2025.10.0)
Collecting nvidia-cuda-nvrtc-cu12==12.8.93 (from torch==2.9.0->vllm)
  Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-runtime-cu12==12.8.90 (from torch==2.9.0->vllm)
  Downloading nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-cupti-cu12==12.8.90 (from torch==2.9.0->vllm)
  Downloading nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.0->vllm) (9.10.2.21)
Collecting nvidia-cublas-cu12==12.8.4.1 (from torch==2.9.0->vllm)
  Downloading nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cufft-cu12==11.3.3.83 (from torch==2.9.0->vllm)
  Downloading nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-curand-cu12==10.3.9.90 (from torch==2.9.0->vllm)
  Downloading nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cusolver-cu12==11.7.3.90 (from torch==2.9.0->vllm)
  Downloading nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-cusparse-cu12==12.5.8.93 (from torch==2.9.0->vllm)
  Downloading nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.8 kB)
Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.0->vllm) (0.7.1)
Collecting nvidia-nccl-cu12==2.27.5 (from torch==2.9.0->vllm)
  Downloading nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
Collecting nvidia-nvshmem-cu12==3.3.20 (from torch==2.9.0->vllm)
  Downloading nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.1 kB)
Collecting nvidia-nvtx-cu12==12.8.90 (from torch==2.9.0->vllm)
  Downloading nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvjitlink-cu12==12.8.93 (from torch==2.9.0->vllm)
  Downloading nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cufile-cu12==1.13.1.3 (from torch==2.9.0->vllm)
  Downloading nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting triton==3.5.0 (from torch==2.9.0->vllm)
  Downloading triton-3.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.12/dist-packages (from anyio<5,>=3.5.0->anthropic==0.71.0->vllm) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.25.0->anthropic==0.71.0->vllm) (2025.11.12)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.25.0->anthropic==0.71.0->vllm) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.25.0->anthropic==0.71.0->vllm) (0.16.0)
Collecting jmespath (from model-hosting-container-standards<1.0.0->vllm)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Requirement already satisfied: starlette>=0.49.1 in /usr/local/lib/python3.12/dist-packages (from model-hosting-container-standards<1.0.0->vllm) (0.49.3)
Collecting supervisor>=4.2.0 (from model-hosting-container-standards<1.0.0->vllm)
  Downloading supervisor-4.3.0-py2.py3-none-any.whl.metadata (87 kB)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.12.0->vllm) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.12.0->vllm) (2.41.5)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.12.0->vllm) (0.4.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.34.0 in /usr/local/lib/python3.12/dist-packages (from transformers<5,>=4.56.0->vllm) (0.36.0)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers<5,>=4.56.0->vllm) (0.6.2)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.34.0->transformers<5,>=4.56.0->vllm) (1.2.0)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from fastapi>=0.115.0->fastapi[standard]>=0.115.0->vllm) (0.0.4)
Collecting fastapi-cli>=0.0.8 (from fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading fastapi_cli-0.0.16-py3-none-any.whl.metadata (6.4 kB)
Requirement already satisfied: python-multipart>=0.0.18 in /usr/local/lib/python3.12/dist-packages (from fastapi[standard]>=0.115.0->vllm) (0.0.20)
Collecting email-validator>=2.0.0 (from fastapi[standard]>=0.115.0->vllm)
  Downloading email_validator-2.3.0-py3-none-any.whl.metadata (26 kB)
Requirement already satisfied: uvicorn>=0.12.0 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.38.0)
Collecting dnspython>=2.0.0 (from email-validator>=2.0.0->fastapi[standard]>=0.115.0->vllm)
  Downloading dnspython-2.8.0-py3-none-any.whl.metadata (5.7 kB)
Collecting typer>=0.15.1 (from fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading typer-0.20.0-py3-none-any.whl.metadata (16 kB)
Collecting rich-toolkit>=0.14.8 (from fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading rich_toolkit-0.17.0-py3-none-any.whl.metadata (1.0 kB)
Collecting fastapi-cloud-cli>=0.1.1 (from fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading fastapi_cloud_cli-0.5.2-py3-none-any.whl.metadata (3.3 kB)
Collecting rignore>=0.5.1 (from fastapi-cloud-cli>=0.1.1->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading rignore-0.7.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting sentry-sdk>=2.20.0 (from fastapi-cloud-cli>=0.1.1->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading sentry_sdk-2.46.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting fastar>=0.5.0 (from fastapi-cloud-cli>=0.1.1->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading fastar-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch==2.9.0->vllm) (3.0.3)
Requirement already satisfied: jsonschema>=4.21.1 in /usr/local/lib/python3.12/dist-packages (from mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm) (4.25.1)
Collecting pydantic-extra-types>=2.10.5 (from pydantic-extra-types[pycountry]>=2.10.5->mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm)
  Downloading pydantic_extra_types-2.10.6-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm) (25.4.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm) (0.37.0)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm) (0.29.0)
Requirement already satisfied: cuda-python>=12.8 in /usr/local/lib/python3.12/dist-packages (from nvidia-cutlass-dsl>=4.2.1->flashinfer-python==0.5.2->vllm) (13.0.3)
Requirement already satisfied: cuda-bindings~=13.0.3 in /usr/local/lib/python3.12/dist-packages (from cuda-python>=12.8->nvidia-cutlass-dsl>=4.2.1->flashinfer-python==0.5.2->vllm) (13.0.3)
Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-python>=12.8->nvidia-cutlass-dsl>=4.2.1->flashinfer-python==0.5.2->vllm) (1.3.2)
Requirement already satisfied: pycountry>=23 in /usr/local/lib/python3.12/dist-packages (from pydantic-extra-types[pycountry]>=2.10.5->mistral_common>=1.8.5->mistral_common[image]>=1.8.5->vllm) (24.6.1)
Collecting click (from flashinfer-python==0.5.2->vllm)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting msgpack<2.0.0,>=1.0.0 (from ray>=2.48.0->ray[cgraph]>=2.48.0->vllm)
  Downloading msgpack-1.1.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)
Collecting cupy-cuda12x (from ray[cgraph]>=2.48.0->vllm)
  Downloading cupy_cuda12x-13.6.0-cp312-cp312-manylinux2014_x86_64.whl.metadata (2.4 kB)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests>=2.26.0->vllm) (3.4.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests>=2.26.0->vllm) (2.5.0)
Collecting rich>=13.7.1 (from rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading rich-14.2.0-py3-none-any.whl.metadata (18 kB)
Collecting markdown-it-py>=2.2.0 (from rich>=13.7.1->rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading markdown_it_py-4.0.0-py3-none-any.whl.metadata (7.3 kB)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.7.1->rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (2.19.2)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich>=13.7.1->rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch==2.9.0->vllm) (1.3.0)
Collecting shellingham>=1.3.0 (from typer>=0.15.1->fastapi-cli>=0.0.8->fastapi-cli[standard]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting httptools>=0.6.3 (from uvicorn[standard]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading httptools-0.7.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (3.5 kB)
Collecting python-dotenv>=0.13 (from uvicorn[standard]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Requirement already satisfied: uvloop>=0.15.1 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.21.0)
Collecting websockets>=10.4 (from uvicorn[standard]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm)
  Downloading websockets-15.0.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (1.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (1.8.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (6.7.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (0.4.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (1.22.0)
Collecting fastrlock>=0.5 (from cupy-cuda12x->ray[cgraph]>=2.48.0->vllm)
  Downloading fastrlock-0.8.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Downloading vllm-0.11.2-cp38-abi3-manylinux1_x86_64.whl (370.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 370.3/370.3 MB 62.8 MB/s  0:00:06
Downloading anthropic-0.71.0-py3-none-any.whl (355 kB)
Downloading depyf-0.20.0-py3-none-any.whl (39 kB)
Downloading lark-1.2.2-py3-none-any.whl (111 kB)
Downloading lm_format_enforcer-0.11.3-py3-none-any.whl (45 kB)
Downloading numba-0.61.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 59.1 MB/s  0:00:00
Downloading outlines_core-0.2.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 42.4 MB/s  0:00:00
Downloading torch-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl (899.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 899.7/899.7 MB 40.3 MB/s  0:00:13
Downloading nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl (594.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 594.3/594.3 MB 62.8 MB/s  0:00:07
Downloading nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 64.7 MB/s  0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 80.3 MB/s  0:00:01
Downloading nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 51.3 MB/s  0:00:00
Downloading nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 72.2 MB/s  0:00:02
Downloading nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 51.9 MB/s  0:00:00
Downloading nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 79.6 MB/s  0:00:00
Downloading nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl (267.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 267.5/267.5 MB 74.0 MB/s  0:00:03
Downloading nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (288.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288.2/288.2 MB 59.1 MB/s  0:00:04
Downloading nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (322.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.3/322.3 MB 70.3 MB/s  0:00:05
Downloading nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.3/39.3 MB 95.5 MB/s  0:00:00
Downloading nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (124.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.7/124.7 MB 81.4 MB/s  0:00:01
Downloading nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
Downloading torchaudio-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 81.4 MB/s  0:00:00
Downloading torchvision-0.24.0-cp312-cp312-manylinux_2_28_x86_64.whl (8.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 97.9 MB/s  0:00:00
Downloading triton-3.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (170.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.5/170.5 MB 58.0 MB/s  0:00:02
Downloading xformers-0.0.33.post1-cp39-abi3-manylinux_2_28_x86_64.whl (122.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 122.9/122.9 MB 64.0 MB/s  0:00:01
Downloading llguidance-1.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 13.0 MB/s  0:00:00
Downloading llvmlite-0.44.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.4/42.4 MB 28.6 MB/s  0:00:01
Downloading model_hosting_container_standards-0.1.9-py3-none-any.whl (102 kB)
Downloading numpy-2.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.5/16.5 MB 76.7 MB/s  0:00:00
Downloading email_validator-2.3.0-py3-none-any.whl (35 kB)
Downloading dnspython-2.8.0-py3-none-any.whl (331 kB)
Downloading fastapi_cli-0.0.16-py3-none-any.whl (12 kB)
Downloading fastapi_cloud_cli-0.5.2-py3-none-any.whl (23 kB)
Downloading fastar-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (821 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 821.6/821.6 kB 39.6 MB/s  0:00:00
Downloading mistral_common-1.8.6-py3-none-any.whl (6.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 56.1 MB/s  0:00:00
Downloading opencv_python_headless-4.12.0.88-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (54.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.0/54.0 MB 98.8 MB/s  0:00:00
Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl (19 kB)
Downloading pydantic_extra_types-2.10.6-py3-none-any.whl (40 kB)
Downloading ray-2.52.1-cp312-cp312-manylinux2014_x86_64.whl (72.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.3/72.3 MB 68.2 MB/s  0:00:01
Downloading msgpack-1.1.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (427 kB)
Downloading click-8.2.1-py3-none-any.whl (102 kB)
Downloading rich_toolkit-0.17.0-py3-none-any.whl (31 kB)
Downloading rich-14.2.0-py3-none-any.whl (243 kB)
Downloading markdown_it_py-4.0.0-py3-none-any.whl (87 kB)
Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Downloading rignore-0.7.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (959 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 959.8/959.8 kB 47.9 MB/s  0:00:00
Downloading sentry_sdk-2.46.0-py2.py3-none-any.whl (406 kB)
Downloading supervisor-4.3.0-py2.py3-none-any.whl (320 kB)
Downloading typer-0.20.0-py3-none-any.whl (47 kB)
Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Downloading httptools-0.7.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (517 kB)
Downloading python_dotenv-1.2.1-py3-none-any.whl (21 kB)
Downloading watchfiles-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (456 kB)
Downloading websockets-15.0.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (182 kB)
Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Downloading blake3-1.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (388 kB)
Downloading cachetools-6.2.2-py3-none-any.whl (11 kB)
Downloading cbor2-5.7.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (285 kB)
Downloading cupy_cuda12x-13.6.0-cp312-cp312-manylinux2014_x86_64.whl (112.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 112.9/112.9 MB 70.7 MB/s  0:00:01
Downloading fastrlock-0.8.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (53 kB)
Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Downloading python_json_logger-4.0.0-py3-none-any.whl (15 kB)
WARNING: Error parsing dependencies of devscripts: Invalid version: '2.22.1ubuntu1'
Installing collected packages: supervisor, py-cpuinfo, fastrlock, websockets, triton, shellingham, sentry-sdk, rignore, python-json-logger, python-dotenv, outlines_core, nvidia-nvtx-cu12, nvidia-nvshmem-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, msgpack, mdurl, llvmlite, llguidance, lark, jmespath, httptools, fastar, dnspython, click, cbor2, cachetools, blake3, astor, watchfiles, opencv-python-headless, nvidia-cusparse-cu12, nvidia-cufft-cu12, numba, markdown-it-py, email-validator, depyf, cupy-cuda12x, rich, pydantic-extra-types, prometheus-fastapi-instrumentator, nvidia-cusolver-cu12, lm-format-enforcer, anthropic, typer, torch, rich-toolkit, ray, model-hosting-container-standards, xformers, torchvision, torchaudio, mistral_common, fastapi-cloud-cli, fastapi-cli, vllm
  Attempting uninstall: triton
    Found existing installation: triton 3.4.0
    Uninstalling triton-3.4.0:
      Successfully uninstalled triton-3.4.0
  Attempting uninstall: outlines_core
    Found existing installation: outlines_core 0.1.26
    Uninstalling outlines_core-0.1.26:
      Successfully uninstalled outlines_core-0.1.26
  Attempting uninstall: nvidia-nvtx-cu12
    Found existing installation: nvidia-nvtx-cu12 12.9.79
    Uninstalling nvidia-nvtx-cu12-12.9.79:
      Successfully uninstalled nvidia-nvtx-cu12-12.9.79
  Attempting uninstall: nvidia-nvjitlink-cu12
    Found existing installation: nvidia-nvjitlink-cu12 12.9.86
    Uninstalling nvidia-nvjitlink-cu12-12.9.86:
      Successfully uninstalled nvidia-nvjitlink-cu12-12.9.86
  Attempting uninstall: nvidia-nccl-cu12
    Found existing installation: nvidia-nccl-cu12 2.27.3
    Uninstalling nvidia-nccl-cu12-2.27.3:
      Successfully uninstalled nvidia-nccl-cu12-2.27.3
  Attempting uninstall: nvidia-curand-cu12
    Found existing installation: nvidia-curand-cu12 10.3.10.19
    Uninstalling nvidia-curand-cu12-10.3.10.19:
      Successfully uninstalled nvidia-curand-cu12-10.3.10.19
  Attempting uninstall: nvidia-cufile-cu12
    Found existing installation: nvidia-cufile-cu12 1.14.1.1
    Uninstalling nvidia-cufile-cu12-1.14.1.1:
      Successfully uninstalled nvidia-cufile-cu12-1.14.1.1
  Attempting uninstall: nvidia-cuda-runtime-cu12
    Found existing installation: nvidia-cuda-runtime-cu12 12.9.79
    Uninstalling nvidia-cuda-runtime-cu12-12.9.79:
      Successfully uninstalled nvidia-cuda-runtime-cu12-12.9.79
  Attempting uninstall: nvidia-cuda-nvrtc-cu12
    Found existing installation: nvidia-cuda-nvrtc-cu12 12.9.86
    Uninstalling nvidia-cuda-nvrtc-cu12-12.9.86:
      Successfully uninstalled nvidia-cuda-nvrtc-cu12-12.9.86
  Attempting uninstall: nvidia-cuda-cupti-cu12
    Found existing installation: nvidia-cuda-cupti-cu12 12.9.79
    Uninstalling nvidia-cuda-cupti-cu12-12.9.79:
      Successfully uninstalled nvidia-cuda-cupti-cu12-12.9.79
  Attempting uninstall: nvidia-cublas-cu12
    Found existing installation: nvidia-cublas-cu12 12.9.1.4
    Uninstalling nvidia-cublas-cu12-12.9.1.4:
      Successfully uninstalled nvidia-cublas-cu12-12.9.1.4
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.5
    Uninstalling numpy-2.3.5:
      Successfully uninstalled numpy-2.3.5
  Attempting uninstall: llguidance
    Found existing installation: llguidance 0.7.30
    Uninstalling llguidance-0.7.30:
      Successfully uninstalled llguidance-0.7.30
  Attempting uninstall: lark
    Found existing installation: lark 1.3.1
    Uninstalling lark-1.3.1:
      Successfully uninstalled lark-1.3.1
  Attempting uninstall: click
    Found existing installation: click 8.3.1
    Uninstalling click-8.3.1:
      Successfully uninstalled click-8.3.1
  Attempting uninstall: nvidia-cusparse-cu12
    Found existing installation: nvidia-cusparse-cu12 12.5.10.65
    Uninstalling nvidia-cusparse-cu12-12.5.10.65:
      Successfully uninstalled nvidia-cusparse-cu12-12.5.10.65
  Attempting uninstall: nvidia-cufft-cu12
    Found existing installation: nvidia-cufft-cu12 11.4.1.4
    Uninstalling nvidia-cufft-cu12-11.4.1.4:
      Successfully uninstalled nvidia-cufft-cu12-11.4.1.4
  Attempting uninstall: nvidia-cusolver-cu12
    Found existing installation: nvidia-cusolver-cu12 11.7.5.82
    Uninstalling nvidia-cusolver-cu12-11.7.5.82:
      Successfully uninstalled nvidia-cusolver-cu12-11.7.5.82
  Attempting uninstall: anthropic
    Found existing installation: anthropic 0.73.0
    Uninstalling anthropic-0.73.0:
      Successfully uninstalled anthropic-0.73.0
  Attempting uninstall: torch
    Found existing installation: torch 2.8.0+cu129
    Uninstalling torch-2.8.0+cu129:
      Successfully uninstalled torch-2.8.0+cu129
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.23.0+cu129
    Uninstalling torchvision-0.23.0+cu129:
      Successfully uninstalled torchvision-0.23.0+cu129
  Attempting uninstall: torchaudio
    Found existing installation: torchaudio 2.8.0+cu129
    Uninstalling torchaudio-2.8.0+cu129:
      Successfully uninstalled torchaudio-2.8.0+cu129
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
outlines 0.1.11 requires outlines_core==0.1.26, but you have outlines-core 0.2.11 which is incompatible.
sglang 0.5.5.post3 requires llguidance<0.8.0,>=0.7.11, but you have llguidance 1.3.0 which is incompatible.
sglang 0.5.5.post3 requires nvidia-cutlass-dsl==4.2.1, but you have nvidia-cutlass-dsl 4.3.0.dev0 which is incompatible.
sglang 0.5.5.post3 requires torch==2.8.0, but you have torch 2.9.0 which is incompatible.
sglang 0.5.5.post3 requires torchaudio==2.8.0, but you have torchaudio 2.9.0 which is incompatible.
Successfully installed anthropic-0.71.0 astor-0.8.1 blake3-1.0.8 cachetools-6.2.2 cbor2-5.7.1 click-8.2.1 cupy-cuda12x-13.6.0 depyf-0.20.0 dnspython-2.8.0 email-validator-2.3.0 fastapi-cli-0.0.16 fastapi-cloud-cli-0.5.2 fastar-0.8.0 fastrlock-0.8.3 httptools-0.7.1 jmespath-1.0.1 lark-1.2.2 llguidance-1.3.0 llvmlite-0.44.0 lm-format-enforcer-0.11.3 markdown-it-py-4.0.0 mdurl-0.1.2 mistral_common-1.8.6 model-hosting-container-standards-0.1.9 msgpack-1.1.2 numba-0.61.2 numpy-2.2.6 nvidia-cublas-cu12-12.8.4.1 nvidia-cuda-cupti-cu12-12.8.90 nvidia-cuda-nvrtc-cu12-12.8.93 nvidia-cuda-runtime-cu12-12.8.90 nvidia-cufft-cu12-11.3.3.83 nvidia-cufile-cu12-1.13.1.3 nvidia-curand-cu12-10.3.9.90 nvidia-cusolver-cu12-11.7.3.90 nvidia-cusparse-cu12-12.5.8.93 nvidia-nccl-cu12-2.27.5 nvidia-nvjitlink-cu12-12.8.93 nvidia-nvshmem-cu12-3.3.20 nvidia-nvtx-cu12-12.8.90 opencv-python-headless-4.12.0.88 outlines_core-0.2.11 prometheus-fastapi-instrumentator-7.1.0 py-cpuinfo-9.0.0 pydantic-extra-types-2.10.6 python-dotenv-1.2.1 python-json-logger-4.0.0 ray-2.52.1 rich-14.2.0 rich-toolkit-0.17.0 rignore-0.7.6 sentry-sdk-2.46.0 shellingham-1.5.4 supervisor-4.3.0 torch-2.9.0 torchaudio-2.9.0 torchvision-0.24.0 triton-3.5.0 typer-0.20.0 vllm-0.11.2 watchfiles-1.1.1 websockets-15.0.1 xformers-0.0.33.post1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

and if I run the command after installing vLLM with the step above, I get this error -

root@158457f6b080:/sgl-workspace/sglang# python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.2-1B" --host 0.0.0.0 --dtype float32
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 24, in <module>
    server_args = prepare_server_args(sys.argv[1:])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 4106, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 3707, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 281, in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 596, in __post_init__
    self._handle_gpu_memory_settings(gpu_mem)
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 845, in _handle_gpu_memory_settings
    model_config = self.get_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 3728, in get_model_config
    from sglang.srt.configs.model_config import ModelConfig
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 26, in <module>
    from sglang.srt.layers.quantization import QUANTIZATION_METHODS
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/__init__.py", line 19, in <module>
    from sglang.srt.layers.quantization.auto_round import AutoRoundConfig
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/auto_round.py", line 12, in <module>
    from sglang.srt.layers.quantization.utils import get_scalar_types
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/utils.py", line 13, in <module>
    from sglang.srt.layers.quantization.fp8_kernel import scaled_fp8_quant
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8_kernel.py", line 46, in <module>
    from sgl_kernel import sgl_per_tensor_quant_fp8, sgl_per_token_quant_fp8
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/__init__.py", line 5, in <module>
    common_ops = _load_architecture_specific_ops()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/load_utils.py", line 188, in _load_architecture_specific_ops
    raise ImportError(error_msg)
ImportError: 
[sgl_kernel] CRITICAL: Could not load any common_ops library!

Attempted locations:
1. Architecture-specific pattern: /usr/local/lib/python3.12/dist-packages/sgl_kernel/sm90/common_ops.* - found files: ['/usr/local/lib/python3.12/dist-packages/sgl_kernel/sm90/common_ops.abi3.so']
2. Fallback pattern: /usr/local/lib/python3.12/dist-packages/sgl_kernel/common_ops.* - found files: []
3. Standard Python import: common_ops - failed

GPU Info:
- Compute capability: 90
- Expected variant: SM90 (Hopper/H100 with fast math optimization)

Please ensure sgl_kernel is properly installed with:
pip install --upgrade sgl_kernel

Error details from previous import attempts:
- ImportError: /usr/local/lib/python3.12/dist-packages/sgl_kernel/sm90/common_ops.abi3.so: undefined symbol: _ZNK3c106SymInt6sym_neERKS0_
- ModuleNotFoundError: No module named 'common_ops'

If vLLM is a required dependency then I think it should be in the image. Since it is not there then can you please share what vLLM version should I be installing?

Environment

I am using the docker image via this command -

docker run --gpus all -it \
    --shm-size 32g \
    --env "HF_TOKEN=TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    bash

The exact hash is -

$ docker image inspect --format='{{.RepoDigests}}' lmsysorg/sglang:latest
[lmsysorg/sglang@sha256:97fe3876fd7f0d27c72c79f612b024e08e9ac4ffdc52b5e4f81b7b53e1f3e819]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions