-
Notifications
You must be signed in to change notification settings - Fork 60
LLM pools. #2015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LLM pools. #2015
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR replaces the Ray actor-based worker pool implementation from Zephyr with a new cluster-agnostic worker pool system for Fray. The new design uses job-based workers and distributed queues to enable LLM inference scaling across different cluster backends.
Key changes:
- Introduced a Queue abstraction with lease semantics for reliable distributed task processing
- Implemented multiple queue backends (Ray actor-based, multiprocessing-based, file-based)
- Created a new WorkerPool that uses cluster jobs instead of Ray actors for better backend flexibility
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/zephyr/src/zephyr/worker_pool.py | Removed Ray actor-based worker pool implementation |
| lib/fray/src/fray/worker_pool.py | Added new cluster-agnostic worker pool with autoscaling support |
| lib/fray/src/fray/cluster/queue.py | Introduced Queue protocol and Lease dataclass for distributed task management |
| lib/fray/src/fray/cluster/base.py | Added create_queue abstract method to Cluster interface |
| lib/fray/src/fray/cluster/ray/queue.py | Implemented Ray actor-based queue backend |
| lib/fray/src/fray/cluster/local/queue.py | Implemented multiprocessing-based queue for local clusters |
| lib/fray/src/fray/cluster/local/file_queue.py | Implemented file-based queue for subprocess communication |
| lib/fray/src/fray/examples/fake_llm_worker.py | Added example worker simulating LLM inference for testing |
| lib/fray/tests/test_worker_pool.py | Added comprehensive tests for worker pool functionality |
| lib/fray/tests/cluster/test_*.py | Added tests for queue implementations |
| lib/fray/examples/inference_pool.py | Added demo script showcasing worker pool usage |
lib/fray/tests/test_queue.py
Outdated
|
|
||
| def test_push_and_pop(queue: SimpleQueue): | ||
| """Test pushing and popping items.""" | ||
| assert queue.pop() is None |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 'assert' statement contains an expression which may have side effects.
| assert queue.pop() is None | |
| result = queue.pop() | |
| assert result is None |
lib/fray/tests/test_queue.py
Outdated
| def test_empty_queue_operations(queue: SimpleQueue): | ||
| """Test operations on an empty queue.""" | ||
| assert queue.peek() is None | ||
| assert queue.pop() is None |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 'assert' statement contains an expression which may have side effects.
| assert queue.pop() is None | |
| result = queue.pop() | |
| assert result is None |
lib/fray/tests/test_ray_queue.py
Outdated
|
|
||
| def test_push_and_pop(queue: RayQueue): | ||
| """Test pushing and popping items.""" | ||
| assert queue.pop() is None |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 'assert' statement contains an expression which may have side effects.
| assert queue.pop() is None | |
| result = queue.pop() | |
| assert result is None |
lib/fray/tests/test_ray_queue.py
Outdated
| def test_empty_queue_operations(queue: RayQueue): | ||
| """Test operations on an empty queue.""" | ||
| assert queue.peek() is None | ||
| assert queue.pop() is None |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 'assert' statement contains an expression which may have side effects.
| assert queue.pop() is None | |
| result = queue.pop() | |
| assert result is None |
26dd5bb to
c4a24a9
Compare
6b88bb6 to
9b45fd5
Compare
3eaae02 to
473cc12
Compare
| import socket | ||
|
|
||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
To fix the problem, update the socket binding in the find_free_port() function from s.bind(("", 0)) to s.bind(("127.0.0.1", 0)). This ensures the port discovery socket is only available on the loopback interface, preventing any potential network exposure during its short lifetime. Only lines in the find_free_port function need to be changed, and no imports or additional method definitions are required since '127.0.0.1' is available by default.
-
Copy modified line R37
| @@ -34,7 +34,7 @@ | ||
| import socket | ||
|
|
||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) | ||
| s.bind(("127.0.0.1", 0)) | ||
| s.listen(1) | ||
| port = s.getsockname()[1] | ||
| return port |
1290c74 to
5a12d3c
Compare
It hurts.
5a12d3c to
33a9b8d
Compare
No description provided.