-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
What happened?
Summary
When processing requests with mixed priority field types (int, tuple, list, None), LiteLLM's scheduler fails with a TypeError: '<' not supported between instances of 'tuple' and 'list' during heap operations in the priority queue.
Environment
- LiteLLM Version: 1.76.1
- Python Version: 3.12
- Error Location:
scheduler.py:53inheapq.heappush()
Error Details
Stack Trace
TypeError: '<' not supported between instances of 'tuple' and 'list'
File "/app/.venv/lib/python3.12/site-packages/litellm/scheduler.py", line 53, in add_request
heapq.heappush(queue, (request.priority, request.request_id))
File "/app/.venv/lib/python3.12/site-packages/litellm/router.py", line 1879, in _schedule_factory
await self.scheduler.add_request(request=item)
File "/app/.venv/lib/python3.12/site-packages/litellm/router.py", line 2479, in atext_completion
return await self._schedule_factory(...)
Root Cause Analysis
The issue occurs due to Redis cache deserialization corrupting data types in the priority queue:
-
Redis Serialization Issue: When the scheduler queue is stored in Redis and retrieved, the deserialization process can corrupt data types:
json.loads()fails to parse certain data structures- Falls back to
ast.literal_eval()which can convert strings back to tuples/lists - This corrupts the queue data structure
-
Heap Comparison Failure:
heapq.heappush()expects comparable elements, but receives mixed types:(int, str)from fresh requests(tuple, str)or(list, str)from corrupted Redis cache data
-
Cache Corruption: The issue occurs when:
- Queue is stored in Redis with certain data structures
- Redis retrieval uses
ast.literal_eval()fallback - Corrupted data types are mixed with fresh data in the heap
Code Analysis
Problematic Code Path
File: redis_cache.py:707-711 (Root Cause)
try:
cached_response = json.loads(cached_response) # Convert string to dictionary
except Exception:
cached_response = ast.literal_eval(cached_response) # 👈 CORRUPTS DATA TYPESast.literal_eval()can convert strings back to tuples/lists- This corrupts the queue data structure when retrieved from Redis
File: scheduler.py:53 (Error Location)
heapq.heappush(queue, (request.priority, request.request_id))- Fails when queue contains mixed types from Redis cache corruption
- Fresh requests:
(int, str) - Corrupted cache:
(tuple, str)or(list, str)
File: scheduler.py:116-127 (Cache Retrieval)
async def get_queue(self, model_name: str) -> list:
if self.cache is not None:
response = await self.cache.async_get_cache(key=_cache_key)
if response is None or not isinstance(response, list):
return []
elif isinstance(response, list):
return response # 👈 RETURNS CORRUPTED DATA- Returns corrupted data from Redis cache
- No validation of data types in retrieved queue
Steps to Reproduce
- Send multiple requests with different priority field types simultaneously:
# Request 1: Valid priority
{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}],
"priority": 5
}
# Request 2: Invalid priority (tuple/list)
{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}],
"priority": [1, 2, 3] # or (1, 2, 3)
}
# Request 3: No priority
{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}]
}- Process these requests concurrently through LiteLLM router
- Error occurs in scheduler during heap operations
Expected Behavior
- Priority field should be normalized to consistent type (
int) before being passed to scheduler - Invalid priority values should be handled gracefully (fallback to default priority)
- Service should remain available even with malformed priority values
Actual Behavior
TypeErrorwhen heap tries to compare different types- Service becomes unavailable
- No graceful handling of malformed priority values
Proposed Fix
Primary Fix: Improve Redis cache deserialization in redis_cache.py:707-711:
try:
cached_response = json.loads(cached_response)
except Exception:
# Instead of ast.literal_eval, use safer deserialization
try:
cached_response = ast.literal_eval(cached_response)
# Validate that the result is a list of tuples with correct types
if isinstance(cached_response, list):
for i, item in enumerate(cached_response):
if not isinstance(item, (tuple, list)) or len(item) != 2:
raise ValueError(f"Invalid queue item at index {i}: {item}")
# Ensure priority is int and request_id is str
if not isinstance(item[0], int) or not isinstance(item[1], str):
# Fix the data type
cached_response[i] = (int(item[0]), str(item[1]))
except Exception as e:
# If deserialization fails completely, return empty list
logging.warning(f"Failed to deserialize queue from Redis: {e}")
cached_response = []Secondary Fix: Add type validation in scheduler queue retrieval:
# In scheduler.py get_queue method
async def get_queue(self, model_name: str) -> list:
if self.cache is not None:
response = await self.cache.async_get_cache(key=_cache_key)
if response is None or not isinstance(response, list):
return []
# Validate queue items have correct types
validated_queue = []
for item in response:
if isinstance(item, (tuple, list)) and len(item) == 2:
try:
validated_queue.append((int(item[0]), str(item[1])))
except (ValueError, TypeError):
continue # Skip invalid items
return validated_queue
return self.queueImpact
- Severity: High - Service becomes unavailable
- Affected Features: Priority-based request scheduling
- Workaround: Client-side validation of priority field types
- Affected Users: Any service using LiteLLM with priority-based scheduling when request payloads contain malformed priority values
Additional Context
This issue was discovered in a production environment where:
- Multiple clients send requests with different priority field formats
- Some clients send priority as strings, others as numbers, others as arrays
- The service processes these requests concurrently
- The error occurs intermittently based on request timing and priority field types
Files Involved
scheduler.py:53- Error locationrouter.py:1039- Priority extraction (needs fix)router.py:1049-1050- Priority validation (needs improvement)router.py:1802- Schedule completion method signature
Relevant log output
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
1.76.1
Twitter / LinkedIn details
No response