-
Notifications
You must be signed in to change notification settings - Fork 571
UN-3008 [FIX] Word level confidence issue fixes #1681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add word-level confidence feature that extends the existing highlight functionality. This feature allows tracking confidence scores at the word level during extraction. Key changes: - Add enable_word_confidence field to CustomTool model - Add word_confidence_postamble support for custom prompts - Pass word_confidence flag through extraction and indexing pipelines - Update SDK to preserve original text for post-processing - Add dependency check to ensure word confidence requires highlight to be enabled
for more information, see https://pre-commit.ci
…ext parameter - Updated post_process_fn type signature from Callable[[LLMResponseCompat, bool], ...] to Callable[[LLMResponseCompat, bool, str], ...] to match the actual call at line 500-502 - Addresses review comment: #1672 (comment) - The highlight_data plugin's run() method already accepts the third parameter (original_text: str)
- Removed WORD_CONFIDENCE_DATA constant as it's not used in main repo's workflow manager - The constant is only needed in unstract-cloud repo which has the rule engine - Prompt studio code uses the string directly, which is appropriate - Addresses review comment: #1672 (comment)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughThe changes refactor confidence data handling across frontend, backend, and worker layers. Loop-scoped alias variables prevent cross-prompt data mutations during output processing. Confidence computation adopts a multi-tier fallback strategy (word-level → highlight-derived → explicit), with new LLMWhisperer configuration support for line-level confidence extraction. Changes
Sequence Diagram(s)sequenceDiagram
participant OMH as Output Manager<br/>(Backend)
participant DC as Destination<br/>Connector (Worker)
participant Frontend as Frontend<br/>Component
OMH->>OMH: Extract loop-scoped<br/>variables per prompt
OMH->>OMH: Serialize context<br/>& challenge data
OMH->>DC: Call with isolated<br/>prompt data
DC->>DC: Extract confidence<br/>from highlight data
DC->>DC: Apply 3-tier fallback:<br/>word → highlight → explicit
DC->>DC: Populate metadata with<br/>all confidence variants
DC->>Frontend: Return wrapped result
Frontend->>Frontend: Receive confidence
Frontend->>Frontend: Apply nested word<br/>confidence lookup
Frontend->>Frontend: Fallback to highlight<br/>or explicit confidence
Frontend->>Frontend: Pass to<br/>handleSelectHighlight
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
workers/shared/workflow/destination_connector.py (1)
224-266: Refactor to reduce cognitive complexity and add recursion protection.SonarCloud flags cognitive complexity of 28 (allowed: 15). The nested function with multiple conditionals is hard to follow. Additionally, deeply nested highlight data could cause stack overflow.
Consider refactoring to reduce nesting and add a depth limit:
@staticmethod -def _extract_confidence_from_highlight_data(data: Any) -> float | None: +def _extract_confidence_from_highlight_data( + data: Any, max_depth: int = 50 +) -> float | None: """Extract confidence from 5th element of highlight data coordinate arrays. - - Recursively searches through nested arrays/objects to find coordinate arrays - with 5 elements where the 5th element (index 4) is the confidence score. - - Args: - data: Highlight data structure (can be nested arrays/dicts) - - Returns: - Average confidence score if found, None otherwise """ if not data: return None confidence_values = [] - def extract_from_array(arr): - if isinstance(arr, list): - for item in arr: - if isinstance(item, list): - # Check if this is a coordinate array with 5 elements - if len(item) >= 5 and isinstance(item[4], (int, float)): - confidence_values.append(float(item[4])) - else: - # Recursively check nested arrays - extract_from_array(item) - elif isinstance(item, dict): - # Recursively check objects - for val in item.values(): - extract_from_array(val) - elif isinstance(arr, dict): - for val in arr.values(): - extract_from_array(val) + def extract_recursive(arr, depth: int = 0): + if depth > max_depth: + return + if isinstance(arr, list): + # Check if this is a coordinate array with 5 elements + if len(arr) >= 5 and isinstance(arr[4], (int, float)): + confidence_values.append(float(arr[4])) + return + for item in arr: + extract_recursive(item, depth + 1) + elif isinstance(arr, dict): + for val in arr.values(): + extract_recursive(val, depth + 1) - extract_from_array(data) + extract_recursive(data) if confidence_values: return sum(confidence_values) / len(confidence_values) - return Nonefrontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx (2)
138-138: PreferreplaceAll()for clarity.While
replace()with the global flag works,replaceAll()more explicitly conveys the intent to replace all occurrences.Apply this diff:
- const normalized = path.replace(/\[(\d+)\]/g, ".$1"); + const normalized = path.replaceAll(/\[(\d+)\]/g, ".$1");
147-166: Consider adding comments to clarify the multi-tier fallback strategy.The confidence computation uses a three-tier fallback (word-level → highlight-derived → explicit), but this strategy isn't immediately apparent from the code. Adding inline comments would improve maintainability.
Consider adding clarifying comments:
let confidence; + // Tier 1: Try word-level confidence (most granular) if (shouldUseWordConfidence && wordConfidenceData) { const wordConfidence = getNestedValue(wordConfidenceData, key); if (wordConfidence && typeof wordConfidence === "object") { const values = Object.values(wordConfidence).filter( (v) => typeof v === "number" ); if (values.length > 0) { const sum = values.reduce((acc, val) => acc + val, 0); confidence = sum / values.length; } } } + // Tier 2: Extract from highlight data coordinate arrays (5th element) + // Tier 3: Fall back to explicit confidence data if (confidence === undefined) { const extractedConfidence = extractConfidenceFromHighlightData( highlightData[key] ); confidence = extractedConfidence ?? confidenceData?.[key]; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
backend/prompt_studio/prompt_studio_output_manager_v2/output_manager_helper.py(2 hunks)frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx(2 hunks)frontend/src/components/custom-tools/prompt-card/PromptCard.jsx(1 hunks)unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/constants.py(1 hunks)unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/helper.py(1 hunks)unstract/sdk1/src/unstract/sdk1/llm.py(1 hunks)workers/shared/workflow/destination_connector.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/helper.py (1)
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/constants.py (1)
WhispererConfig(50-77)
frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx (1)
frontend/src/components/custom-tools/document-manager/DocumentManager.jsx (1)
highlightData(108-108)
unstract/sdk1/src/unstract/sdk1/llm.py (1)
unstract/sdk1/src/unstract/sdk1/utils/common.py (1)
LLMResponseCompat(120-138)
workers/shared/workflow/destination_connector.py (2)
backend/adapter_processor_v2/models.py (1)
metadata(161-171)backend/api_v2/api_deployment_views.py (1)
get(139-189)
🪛 GitHub Check: SonarCloud Code Analysis
frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx
[failure] 89-89: Refactor this function to reduce its Cognitive Complexity from 22 to the 15 allowed.
[warning] 138-138: Prefer String#replaceAll() over String#replace().
workers/shared/workflow/destination_connector.py
[failure] 225-225: Refactor this function to reduce its Cognitive Complexity from 28 to the 15 allowed.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/constants.py (1)
75-75: LGTM!The new configuration key follows the existing naming conventions and integrates well with the other WhispererConfig constants.
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/helper.py (1)
205-206: LGTM!Coupling
INCLUDE_LINE_CONFIDENCEwithenable_highlightis consistent withADD_LINE_NOSand ensures line-level confidence data is available when highlighting features are enabled downstream.backend/prompt_studio/prompt_studio_output_manager_v2/output_manager_helper.py (1)
147-169: Good refactor: Loop-scoped variables prevent cross-prompt data mutation.This correctly isolates per-prompt data extraction, ensuring that modifications (like extracting nested keys or adding
file_name) don't affect subsequent iterations.workers/shared/workflow/destination_connector.py (1)
1494-1534: LGTM: 3-tier confidence fallback hierarchy is well-implemented.The fallback logic correctly prioritizes:
word_confidence_data(most granular)- Extracted average from
highlight_data(derived)- Original
confidence_data(last resort)The enriched metadata in
wrapped_resultnow provides comprehensive confidence data for downstream HITL rule evaluation.unstract/sdk1/src/unstract/sdk1/llm.py (1)
477-478: Verify breaking change impact on external callersThe signature change for
post_process_fnfrom the original 2-parameter to the new 3-parameter callable is confirmed and real. The callback now requires(LLMResponseCompat, bool, str)parameters and is invoked at lines 501-502 with three arguments:response_compat,extract_json, andoriginal_text.However, no callers of
complete()passingpost_process_fnwere found in the codebase search. Since this parameter is passed via**kwargsand appears undocumented, determine whether:
- This is a new feature with no existing users (safe to change)
- Users exist in external code outside this repository (breaking change)
- Users exist in code paths not covered by the search
Confirm there are no external dependencies relying on the old 2-parameter signature before merging.
frontend/src/components/custom-tools/prompt-card/PromptCard.jsx (1)
235-235: LGTM! Simplified confidence handling.The change correctly delegates confidence computation to
DisplayPromptResult.jsx, removing the averaging logic from this component. This centralizes the multi-tier confidence fallback strategy in one place.
frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx
Show resolved
Hide resolved
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|



What
Why
How
Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Screenshots
Checklist
I have read and understood the Contribution Guidelines.