-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
fix: Use file content hash for audio transcription cache keys #16462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add calculate_audio_file_hash() function to generate SHA256 hash of audio file content - Update file_checksum calculation in utils.py to use content hash instead of filename - Enhance _get_file_param_value() in caching.py with hash calculation fallback - Add comprehensive tests for transcription cache key generation Fixes issue where different audio files with same name would share cache keys, causing incorrect transcriptions to be returned from cache.
|
@Otoru is attempting to deploy a commit to the CLERKIEAI Team on Vercel. A member of the Team first needs to authorize it. |
When tests pass MagicMock() as file parameter, hash calculation may fail. Added try-except to fallback to filename if hash calculation fails, ensuring the flow continues and client creation is not blocked.
…DIO_ANALYZER_API_BASE The tests test_presidio_sets_guardrail_information_in_request_data and test_request_data_flows_to_apply_guardrail were failing because they didn't use mock_testing=True, causing Presidio to try to initialize and require the PRESIDIO_ANALYZER_API_BASE environment variable. Adding mock_testing=True allows these tests to run without external dependencies, consistent with other Presidio tests in the file.
- Remove unused 'traceback' import - Separate 'status' import into its own line for clarity - Replace traceback.format_exc() with verbose_proxy_logger.exception() which is the proper way to log exceptions
Replace manual exception handling with handle_exception_on_proxy() helper function from litellm.proxy.utils, which is the standard approach used throughout the codebase. This simplifies the code and ensures consistent error handling.
…ion_on_proxy Maintain the verbose_proxy_logger.error() call before handle_exception_on_proxy() to ensure proper error logging, as per the main branch version.
The status import is no longer needed after using handle_exception_on_proxy() helper function which handles status codes internally.
- Add check for None file object before processing - Add nested try-except for more robust fallback handling - Improve error message in calculate_audio_file_hash to preserve original exception - This should prevent 500 errors when file processing fails in edge cases
- Remove unused exception variable 'e' in audio file hash calculation - Fix test_base_email.py to properly add enterprise directory to sys.path - This ensures litellm_enterprise modules can be imported correctly
Change FileTypes to Optional[FileTypes] since kwargs.get('file') can return None.
This fixes the mypy type error on line 785.
litellm/caching/caching.py
Outdated
|
|
||
| file_checksum = calculate_audio_file_hash(audio_file=file) | ||
| # Store in metadata for future use | ||
| if "metadata" in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle metadata being none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also why check this again, we already have a metadata var above (line 361)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this?
litellm/caching/caching.py
Outdated
|
|
||
| file_checksum = calculate_audio_file_hash(audio_file=file) | ||
| # Store in metadata for future use | ||
| if "metadata" in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also why check this again, we already have a metadata var above (line 361)
- Change metadata.get() to handle None case with 'or {}'
- Remove redundant 'if metadata in kwargs' check and use existing metadata variable
- Simplify code by directly updating metadata variable and kwargs
- Change import from litellm_enterprise.types to enterprise.litellm_enterprise.types - Ensures test uses local enterprise code instead of installed package - Fixes ImportError for SendKeyRotatedEmailEvent in CI
- Add enterprise/litellm_enterprise directory to sys.path before enterprise - Allows base_email.py to find litellm_enterprise.types from local code - Fixes ImportError when base_email.py imports from litellm_enterprise.types
- Remove installed litellm_enterprise modules before importing to force use of local code - Ensures SendKeyRotatedEmailEvent is found from local enterprise code - Fixes ImportError when installed package doesn't have latest types - All 18 tests passing
Problem
The audio transcription endpoint was not considering the actual audio file content when generating cache keys. Instead, it was only using the filename, which caused incorrect cache hits when:
This resulted in the same transcription being returned for different audio files, causing data integrity issues.
Solution
This PR fixes the cache key generation for audio transcriptions by:
Added
calculate_audio_file_hash()function (litellm/litellm_core_utils/audio_utils/utils.py):Updated file_checksum calculation (
litellm/utils.py):get_audio_file_name()tocalculate_audio_file_hash()Enhanced cache key generation (
litellm/caching/caching.py):_get_file_param_value()with proper fallback logicfile_checksumin metadataAdded comprehensive tests (
tests/test_litellm/caching/test_transcription_cache.py):Changes Made
litellm/litellm_core_utils/audio_utils/utils.py: Addedcalculate_audio_file_hash()functionlitellm/utils.py: Updated to use content hash instead of filenamelitellm/caching/caching.py: Enhanced cache key generation with hash fallbacktests/test_litellm/caching/test_transcription_cache.py: Added comprehensive test suiteTesting
All new tests pass and verify:
Impact
Related Issues
Fixes the issue where audio transcription cache keys were based on filename instead of file content, causing incorrect cache hits for different audio files.
Checklist