[BUG] Port overflow error with high data parallelism in vLLM server

  ## Bug Description

  When using high data parallelism with vLLM (e.g., `allocation_mode=vllm:d12t1`), the vLLM server launcher crashes with a port overflow error on multi-node setups.

  ## Error Message
```
  OverflowError: bind(): port must be 0-65535.

  Full traceback:
  File "/home/workspace/AReaL/areal/launcher/vllm_server.py", line 125, in run
      server_port, dist_init_port = find_free_ports(2, port_range)
  File "/home/workspace/AReaL/areal/utils/network.py", line 61, in find_free_ports
      if is_port_free(port):
  File "/home/workspace/AReaL/areal/utils/network.py", line 88, in is_port_free
      sock.bind(("", port))
  OverflowError: bind(): port must be 0-65535.

```
  ## Environment

  - **AReaL version**: main branch (latest)
  - **Python version**: 3.12
  - **Allocation mode**: `vllm:d12p1t1+d4p1t1`
  - **Cluster configuration**:
    - 2 nodes
    - 8 GPUs per node (16 total)
    - 64GB GPU memory per GPU

  ## Steps to Reproduce

  1. Configure a cluster with 2 nodes, 8 GPUs per node
  2. Set allocation mode with high data parallelism:
     ```bash
     allocation_mode=vllm:d12p1t1+d4p1t1
  3. Launch the training job
  4. Observe the error when launching the 12th vLLM server (on Node 1)


Commit ID
https://github.com/inclusionAI/AReaL/commit/8b54f7854026ae8d54432aebf3cb3e856d48a4b4



  ### Expected Behavior

  All 12 vLLM servers should launch successfully with port allocations within the valid range [0-65535] across all nodes.

  ### Actual Behavior

  The 12th vLLM server (on Node 1) fails because the calculated port range exceeds 65535:
  - Node 0: servers 0-7 use ports [10000-50000] ✅
  - Node 1: servers 8-11 attempt to use ports [55000-70000] ❌ (overflow!)

  ## Root Cause

  The `server_idx_offset` calculation in `areal/launcher/vllm_server.py:195` uses global GPU indices instead of node-local indices:
```
  # Current buggy code:
  server_idx_offset = min(list(map(int, visible))) // gpus_per_server
  # For Node 1 with GPUs 8-15: offset = 8 (global index)
```

  This causes port calculations to overflow on subsequent nodes:
```
  port_range = (11 * 5000 + 10000, 12 * 5000 + 10000)
             = (65000, 70000)  # ❌ Exceeds 65535!
```
  ## Proposed Solution

  Add modulo operation to ensure `server_idx_offset` is node-local `(0 to n_servers_per_node-1)`:
```
  # Fixed code:
  server_idx_offset = (min(list(map(int, visible))) // gpus_per_server) % n_servers_per_node
  # For Node 1 with GPUs 8-15: offset = 0 (node-local)
```
  This keeps port ranges within bounds on all nodes:
```
  port_range = (3 * 5000 + 10000, 4 * 5000 + 10000)
             = (25000, 30000)  # ✅ Valid!
```
 ## Impact

  This bug affects any configuration with:
  - High data parallelism (d8+ with single GPU per server)
  - Multi-node setups (2+ nodes)
  - Small tensor parallelism (t1 or t2)

  Affected allocation modes:
  - vllm:d10t1, vllm:d12t1, vllm:d16t1, etc.
  - Any mode where total servers exceed 8-10 per node

  ## Workarounds

  Until fixed, users can:
  1. Reduce data parallelism: vllm:d8t1 instead of vllm:d12t1
  2. Increase tensor parallelism: vllm:d6t2 instead of vllm:d12t1 (fewer servers)
  3. Use single-node setups (if possible)

  ## Additional Context

  I have a fix ready with:
  - ✅ Bug fix (modulo operation for node-local indexing)
  - ✅ 20 comprehensive unit tests (all passing)
  - ✅ Full backward compatibility
  - ✅ Validation for d12, d16, and various tensor parallelism configs

  I'll submit a pull request shortly.

 ## Related Files

  - areal/launcher/vllm_server.py (line 195)
  - areal/launcher/sglang_server.py (same issue may exist)
  - areal/utils/network.py (port allocation utilities)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Port overflow error with high data parallelism in vLLM server #652

Bug Description

Error Message

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Solution

Impact

Workarounds

Additional Context

Related Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Port overflow error with high data parallelism in vLLM server #652

Description

Bug Description

Error Message

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Solution

Impact

Workarounds

Additional Context

Related Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions