Skip to content

[Bug]: TypeError in Scheduler When Priority Field Has Inconsistent Types #14817

@TheCodeWrangler

Description

@TheCodeWrangler

What happened?

Summary

When processing requests with mixed priority field types (int, tuple, list, None), LiteLLM's scheduler fails with a TypeError: '<' not supported between instances of 'tuple' and 'list' during heap operations in the priority queue.

Environment

  • LiteLLM Version: 1.76.1
  • Python Version: 3.12
  • Error Location: scheduler.py:53 in heapq.heappush()

Error Details

Stack Trace

TypeError: '<' not supported between instances of 'tuple' and 'list'
  File "/app/.venv/lib/python3.12/site-packages/litellm/scheduler.py", line 53, in add_request
    heapq.heappush(queue, (request.priority, request.request_id))
  File "/app/.venv/lib/python3.12/site-packages/litellm/router.py", line 1879, in _schedule_factory
    await self.scheduler.add_request(request=item)
  File "/app/.venv/lib/python3.12/site-packages/litellm/router.py", line 2479, in atext_completion
    return await self._schedule_factory(...)

Root Cause Analysis

The issue occurs due to Redis cache deserialization corrupting data types in the priority queue:

  1. Redis Serialization Issue: When the scheduler queue is stored in Redis and retrieved, the deserialization process can corrupt data types:

    • json.loads() fails to parse certain data structures
    • Falls back to ast.literal_eval() which can convert strings back to tuples/lists
    • This corrupts the queue data structure
  2. Heap Comparison Failure: heapq.heappush() expects comparable elements, but receives mixed types:

    • (int, str) from fresh requests
    • (tuple, str) or (list, str) from corrupted Redis cache data
  3. Cache Corruption: The issue occurs when:

    • Queue is stored in Redis with certain data structures
    • Redis retrieval uses ast.literal_eval() fallback
    • Corrupted data types are mixed with fresh data in the heap

Code Analysis

Problematic Code Path

File: redis_cache.py:707-711 (Root Cause)

try:
    cached_response = json.loads(cached_response)  # Convert string to dictionary
except Exception:
    cached_response = ast.literal_eval(cached_response)  # 👈 CORRUPTS DATA TYPES
  • ast.literal_eval() can convert strings back to tuples/lists
  • This corrupts the queue data structure when retrieved from Redis

File: scheduler.py:53 (Error Location)

heapq.heappush(queue, (request.priority, request.request_id))
  • Fails when queue contains mixed types from Redis cache corruption
  • Fresh requests: (int, str)
  • Corrupted cache: (tuple, str) or (list, str)

File: scheduler.py:116-127 (Cache Retrieval)

async def get_queue(self, model_name: str) -> list:
    if self.cache is not None:
        response = await self.cache.async_get_cache(key=_cache_key)
        if response is None or not isinstance(response, list):
            return []
        elif isinstance(response, list):
            return response  # 👈 RETURNS CORRUPTED DATA
  • Returns corrupted data from Redis cache
  • No validation of data types in retrieved queue

Steps to Reproduce

  1. Send multiple requests with different priority field types simultaneously:
# Request 1: Valid priority
{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello"}],
    "priority": 5
}

# Request 2: Invalid priority (tuple/list)
{
    "model": "gpt-4", 
    "messages": [{"role": "user", "content": "Hello"}],
    "priority": [1, 2, 3]  # or (1, 2, 3)
}

# Request 3: No priority
{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello"}]
}
  1. Process these requests concurrently through LiteLLM router
  2. Error occurs in scheduler during heap operations

Expected Behavior

  • Priority field should be normalized to consistent type (int) before being passed to scheduler
  • Invalid priority values should be handled gracefully (fallback to default priority)
  • Service should remain available even with malformed priority values

Actual Behavior

  • TypeError when heap tries to compare different types
  • Service becomes unavailable
  • No graceful handling of malformed priority values

Proposed Fix

Primary Fix: Improve Redis cache deserialization in redis_cache.py:707-711:

try:
    cached_response = json.loads(cached_response)
except Exception:
    # Instead of ast.literal_eval, use safer deserialization
    try:
        cached_response = ast.literal_eval(cached_response)
        # Validate that the result is a list of tuples with correct types
        if isinstance(cached_response, list):
            for i, item in enumerate(cached_response):
                if not isinstance(item, (tuple, list)) or len(item) != 2:
                    raise ValueError(f"Invalid queue item at index {i}: {item}")
                # Ensure priority is int and request_id is str
                if not isinstance(item[0], int) or not isinstance(item[1], str):
                    # Fix the data type
                    cached_response[i] = (int(item[0]), str(item[1]))
    except Exception as e:
        # If deserialization fails completely, return empty list
        logging.warning(f"Failed to deserialize queue from Redis: {e}")
        cached_response = []

Secondary Fix: Add type validation in scheduler queue retrieval:

# In scheduler.py get_queue method
async def get_queue(self, model_name: str) -> list:
    if self.cache is not None:
        response = await self.cache.async_get_cache(key=_cache_key)
        if response is None or not isinstance(response, list):
            return []
        # Validate queue items have correct types
        validated_queue = []
        for item in response:
            if isinstance(item, (tuple, list)) and len(item) == 2:
                try:
                    validated_queue.append((int(item[0]), str(item[1])))
                except (ValueError, TypeError):
                    continue  # Skip invalid items
        return validated_queue
    return self.queue

Impact

  • Severity: High - Service becomes unavailable
  • Affected Features: Priority-based request scheduling
  • Workaround: Client-side validation of priority field types
  • Affected Users: Any service using LiteLLM with priority-based scheduling when request payloads contain malformed priority values

Additional Context

This issue was discovered in a production environment where:

  • Multiple clients send requests with different priority field formats
  • Some clients send priority as strings, others as numbers, others as arrays
  • The service processes these requests concurrently
  • The error occurs intermittently based on request timing and priority field types

Files Involved

  • scheduler.py:53 - Error location
  • router.py:1039 - Priority extraction (needs fix)
  • router.py:1049-1050 - Priority validation (needs improvement)
  • router.py:1802 - Schedule completion method signature

Relevant log output

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

1.76.1

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions