-
Notifications
You must be signed in to change notification settings - Fork 243
Description
Bug Description
When using high data parallelism with vLLM (e.g., allocation_mode=vllm:d12t1), the vLLM server launcher crashes with a port overflow error on multi-node setups.
Error Message
OverflowError: bind(): port must be 0-65535.
Full traceback:
File "/home/workspace/AReaL/areal/launcher/vllm_server.py", line 125, in run
server_port, dist_init_port = find_free_ports(2, port_range)
File "/home/workspace/AReaL/areal/utils/network.py", line 61, in find_free_ports
if is_port_free(port):
File "/home/workspace/AReaL/areal/utils/network.py", line 88, in is_port_free
sock.bind(("", port))
OverflowError: bind(): port must be 0-65535.
Environment
- AReaL version: main branch (latest)
- Python version: 3.12
- Allocation mode:
vllm:d12p1t1+d4p1t1 - Cluster configuration:
- 2 nodes
- 8 GPUs per node (16 total)
- 64GB GPU memory per GPU
Steps to Reproduce
- Configure a cluster with 2 nodes, 8 GPUs per node
- Set allocation mode with high data parallelism:
allocation_mode=vllm:d12p1t1+d4p1t1
- Launch the training job
- Observe the error when launching the 12th vLLM server (on Node 1)
Commit ID
8b54f78
Expected Behavior
All 12 vLLM servers should launch successfully with port allocations within the valid range [0-65535] across all nodes.
Actual Behavior
The 12th vLLM server (on Node 1) fails because the calculated port range exceeds 65535:
- Node 0: servers 0-7 use ports [10000-50000] ✅
- Node 1: servers 8-11 attempt to use ports [55000-70000] ❌ (overflow!)
Root Cause
The server_idx_offset calculation in areal/launcher/vllm_server.py:195 uses global GPU indices instead of node-local indices:
# Current buggy code:
server_idx_offset = min(list(map(int, visible))) // gpus_per_server
# For Node 1 with GPUs 8-15: offset = 8 (global index)
This causes port calculations to overflow on subsequent nodes:
port_range = (11 * 5000 + 10000, 12 * 5000 + 10000)
= (65000, 70000) # ❌ Exceeds 65535!
Proposed Solution
Add modulo operation to ensure server_idx_offset is node-local (0 to n_servers_per_node-1):
# Fixed code:
server_idx_offset = (min(list(map(int, visible))) // gpus_per_server) % n_servers_per_node
# For Node 1 with GPUs 8-15: offset = 0 (node-local)
This keeps port ranges within bounds on all nodes:
port_range = (3 * 5000 + 10000, 4 * 5000 + 10000)
= (25000, 30000) # ✅ Valid!
Impact
This bug affects any configuration with:
- High data parallelism (d8+ with single GPU per server)
- Multi-node setups (2+ nodes)
- Small tensor parallelism (t1 or t2)
Affected allocation modes:
- vllm:d10t1, vllm:d12t1, vllm:d16t1, etc.
- Any mode where total servers exceed 8-10 per node
Workarounds
Until fixed, users can:
- Reduce data parallelism: vllm:d8t1 instead of vllm:d12t1
- Increase tensor parallelism: vllm:d6t2 instead of vllm:d12t1 (fewer servers)
- Use single-node setups (if possible)
Additional Context
I have a fix ready with:
- ✅ Bug fix (modulo operation for node-local indexing)
- ✅ 20 comprehensive unit tests (all passing)
- ✅ Full backward compatibility
- ✅ Validation for d12, d16, and various tensor parallelism configs
I'll submit a pull request shortly.
Related Files
- areal/launcher/vllm_server.py (line 195)
- areal/launcher/sglang_server.py (same issue may exist)
- areal/utils/network.py (port allocation utilities)