Skip to content

[BUG] Port overflow error with high data parallelism in vLLM server #652

@HsiaoTsan

Description

@HsiaoTsan

Bug Description

When using high data parallelism with vLLM (e.g., allocation_mode=vllm:d12t1), the vLLM server launcher crashes with a port overflow error on multi-node setups.

Error Message

  OverflowError: bind(): port must be 0-65535.

  Full traceback:
  File "/home/workspace/AReaL/areal/launcher/vllm_server.py", line 125, in run
      server_port, dist_init_port = find_free_ports(2, port_range)
  File "/home/workspace/AReaL/areal/utils/network.py", line 61, in find_free_ports
      if is_port_free(port):
  File "/home/workspace/AReaL/areal/utils/network.py", line 88, in is_port_free
      sock.bind(("", port))
  OverflowError: bind(): port must be 0-65535.

Environment

  • AReaL version: main branch (latest)
  • Python version: 3.12
  • Allocation mode: vllm:d12p1t1+d4p1t1
  • Cluster configuration:
    • 2 nodes
    • 8 GPUs per node (16 total)
    • 64GB GPU memory per GPU

Steps to Reproduce

  1. Configure a cluster with 2 nodes, 8 GPUs per node
  2. Set allocation mode with high data parallelism:
    allocation_mode=vllm:d12p1t1+d4p1t1
  3. Launch the training job
  4. Observe the error when launching the 12th vLLM server (on Node 1)

Commit ID
8b54f78

Expected Behavior

All 12 vLLM servers should launch successfully with port allocations within the valid range [0-65535] across all nodes.

Actual Behavior

The 12th vLLM server (on Node 1) fails because the calculated port range exceeds 65535:

  • Node 0: servers 0-7 use ports [10000-50000] ✅
  • Node 1: servers 8-11 attempt to use ports [55000-70000] ❌ (overflow!)

Root Cause

The server_idx_offset calculation in areal/launcher/vllm_server.py:195 uses global GPU indices instead of node-local indices:

  # Current buggy code:
  server_idx_offset = min(list(map(int, visible))) // gpus_per_server
  # For Node 1 with GPUs 8-15: offset = 8 (global index)

This causes port calculations to overflow on subsequent nodes:

  port_range = (11 * 5000 + 10000, 12 * 5000 + 10000)
             = (65000, 70000)  # ❌ Exceeds 65535!

Proposed Solution

Add modulo operation to ensure server_idx_offset is node-local (0 to n_servers_per_node-1):

  # Fixed code:
  server_idx_offset = (min(list(map(int, visible))) // gpus_per_server) % n_servers_per_node
  # For Node 1 with GPUs 8-15: offset = 0 (node-local)

This keeps port ranges within bounds on all nodes:

  port_range = (3 * 5000 + 10000, 4 * 5000 + 10000)
             = (25000, 30000)  # ✅ Valid!

Impact

This bug affects any configuration with:

  • High data parallelism (d8+ with single GPU per server)
  • Multi-node setups (2+ nodes)
  • Small tensor parallelism (t1 or t2)

Affected allocation modes:

  • vllm:d10t1, vllm:d12t1, vllm:d16t1, etc.
  • Any mode where total servers exceed 8-10 per node

Workarounds

Until fixed, users can:

  1. Reduce data parallelism: vllm:d8t1 instead of vllm:d12t1
  2. Increase tensor parallelism: vllm:d6t2 instead of vllm:d12t1 (fewer servers)
  3. Use single-node setups (if possible)

Additional Context

I have a fix ready with:

  • ✅ Bug fix (modulo operation for node-local indexing)
  • ✅ 20 comprehensive unit tests (all passing)
  • ✅ Full backward compatibility
  • ✅ Validation for d12, d16, and various tensor parallelism configs

I'll submit a pull request shortly.

Related Files

  • areal/launcher/vllm_server.py (line 195)
  • areal/launcher/sglang_server.py (same issue may exist)
  • areal/utils/network.py (port allocation utilities)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions