Skip to content

Conversation

@ShaikAbdulHafeez03
Copy link

@ShaikAbdulHafeez03 ShaikAbdulHafeez03 commented Nov 22, 2025

Purpose
This PR fixes critical issues with tool call extraction in the OpenAIToolParser (Harmony/gpt-oss), specifically when using Custom Tools and MCP (Model Context Protocol).

Previously, the parser failed to correctly capture tool calls that resided in the active stream buffer (current_content) but hadn't yet been finalized into a message object. This led to two major issues:

Tool Call Failure: valid tool calls were dropped or ignored.

Channel Leakage/Hallucination: The model appeared to hallucinate or leak tool call structures into the final (user-facing) message channel instead of processing them as structured tool calls.

Changes Implemented:

Active Buffer Parsing: Added logic to inspect parser.current_content and parser.current_channel. If the active channel is commentary (used for functions in Harmony), the content is now correctly extracted as a tool call.

Tag Sanitization: Added a sanitization step to strip internal tags (e.g., <|channel|>) from function names.

Robust JSON Handling: Improved JSON parsing for tool arguments to handle partial or slightly malformed buffers.

Test Plan
Scenario: Run a request using a Custom Tool or MCP tool definition (e.g., web_search or a custom python function).

Send a request that forces custom tool call or try calling a custom mcp server (it fail's calling multiple tools because of parsing issue)

Test Result
Before: When using custom tools/MCP, the tool calling would frequently fail. The model output would often "hallucinate" or leak the raw tool call JSON into the final user message string, failing to trigger the actual tool execution on the client side.

After: Tool calls are cleanly extracted from the commentary channel. The final message content remains clean, and custom tools/MCP function correctly without leakage.

Essential Elements of an Effective PR Description Checklist

[x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".

[x] The test plan, such as providing test command.

[x] The test results, such as pasting the results comparison before and after, or e2e results

[ ] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

[ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important fixes for tool call parsing with gpt-oss models, addressing issues with dropped tool calls and malformed outputs. The changes involve more robust parsing of the active stream buffer, sanitizing function names, and improving JSON argument handling. The review identifies two high-severity issues in vllm/entrypoints/harmony_utils.py: one related to potential data loss in parse_chat_output due to content overwriting, and another concerning silent error handling in parse_output_into_messages which could lead to truncated outputs. Addressing these will further improve the robustness and debuggability of the parsing logic.

Comment on lines +519 to +520
except Exception:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a broad except Exception: that silently breaks can hide underlying bugs in the parser or issues with the token stream. If an exception occurs, token processing stops, and the output is silently truncated. This can lead to incomplete results without any warning, making debugging difficult. This is a high-risk pattern that can cause silent data loss. It is highly recommended to log the exception before breaking the loop to provide visibility into why parsing was stopped.

Suggested change
except Exception:
break
except Exception:
# TODO: Add logging here to avoid silent failures. For example:
# logger.warning("Error processing token %d. Stopping parsing.",
# token_id, exc_info=True)
break

Comment on lines 531 to 560
reasoning_parts = []
final_content = None
is_tool_call = False

for msg in output_msgs:
if msg.channel == "analysis":
for content in msg.content:
reasoning_parts.append(content.text)
elif msg.channel == "final":
for content in msg.content:
final_content = content.text
elif msg.channel == "commentary" and msg.recipient and msg.recipient.startswith("functions."):
is_tool_call = True
if not final_content:
final_content = ""
for content in msg.content:
final_content = content.text

if parser.current_content:
if parser.current_channel == "analysis":
reasoning_parts.append(parser.current_content)
elif parser.current_channel == "final":
final_content = parser.current_content
elif parser.current_channel == "commentary" and parser.current_recipient and parser.current_recipient.startswith("functions."):
is_tool_call = True
final_content = parser.current_content

reasoning = "\n".join(reasoning_parts) if reasoning_parts else None

return reasoning, final_content, is_tool_call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The implementation for final_content handling has several issues that can lead to data loss by overwriting content. Specifically:

  • In the loop over output_msgs, final_content is overwritten for each message in the final channel and for each tool call in the commentary channel. If there are multiple such messages, only the content of the last one will be preserved.
  • When processing parser.current_content, final_content is again overwritten, discarding any content that might have been parsed from output_msgs.

This is particularly problematic for tool calls, as it means only the arguments of the last tool call are returned in final_content, while others are dropped. To prevent this data loss, final_content should be aggregated from all relevant messages and the current buffer, rather than being reassigned.

    reasoning_parts = []
    final_content = None
    is_tool_call = False
    
    for msg in output_msgs:
        if msg.channel == "analysis":
            for content in msg.content:
                reasoning_parts.append(content.text)
        elif msg.channel == "final":
            if final_content is None:
                final_content = ""
            for content in msg.content:
                final_content += content.text
        elif msg.channel == "commentary" and msg.recipient and msg.recipient.startswith("functions."):
            is_tool_call = True
            current_call_args = "".join(c.text for c in msg.content)
            if current_call_args:
                if final_content is None:
                    final_content = current_call_args
                else:
                    final_content += "\n" + current_call_args

    if parser.current_content:
        if parser.current_channel == "analysis":
             reasoning_parts.append(parser.current_content)
        elif parser.current_channel == "final":
             if final_content is None:
                 final_content = ""
             final_content += parser.current_content
        elif parser.current_channel == "commentary" and parser.current_recipient and parser.current_recipient.startswith("functions."):
             is_tool_call = True
             if parser.current_content:
                if final_content is None:
                    final_content = parser.current_content
                else:
                    final_content += "\n" + parser.current_content

    reasoning = "\n".join(reasoning_parts) if reasoning_parts else None

    return reasoning, final_content, is_tool_call

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 52 to +74
if msg.recipient and msg.recipient.startswith("functions."):
# If no content-type is given assume JSON, as that's the
# most common case with gpt-oss models.
if not msg.content_type or "json" in msg.content_type:
# load and dump the JSON text to check validity and
# remove any extra newlines or other odd formatting
try:
tool_args = json.dumps(json.loads(msg_text))
except json.JSONDecodeError:
logger.exception(
"Error decoding JSON tool call from response."
)
tool_args = msg_text
else:
tool_args = msg_text
tool_calls.append(
ToolCall(
type="function",
function=FunctionCall(
name=msg.recipient.split("functions.")[1],
arguments=tool_args,
),
)
)
func_name = msg.recipient.split("functions.")[1]
tool_calls.append(_create_tool_call(func_name, msg_text))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Non-JSON tool calls are dropped

extract_tool_calls now only appends a tool call when msg.content_type is absent or contains "json"; the prior fallback that forwarded other content types has been removed. Any tool invocation emitted with a non-JSON content type (e.g., plain-text arguments or other MIME types) will now leave tools_called false and omit the corresponding tool_calls entry even though the model targeted functions.*, so the client will never execute that tool.

Useful? React with 👍 / 👎.

@ShaikAbdulHafeez03 ShaikAbdulHafeez03 force-pushed the fix-gpt-oss-tool-parser-v2 branch 2 times, most recently from cd557ce to 251113d Compare November 22, 2025 11:37
@DarkLight1337
Copy link
Member

Can you merge from main to fix pre-commit?

@ShaikAbdulHafeez03 ShaikAbdulHafeez03 force-pushed the fix-gpt-oss-tool-parser-v2 branch from 32a6d03 to 8d94e30 Compare November 22, 2025 15:19
Signed-off-by: ShaikAbdulHafeez03 <[email protected]>
@ShaikAbdulHafeez03
Copy link
Author

Can you merge from main to fix pre-commit?

Done fixed the pre commit issue

@bbrowning
Copy link
Contributor

bbrowning commented Nov 22, 2025

Previously, the parser failed to correctly capture tool calls that resided in the active stream buffer (current_content) but hadn't yet been finalized into a message object. This led to two major issues:

If there is content in current_content and not a finalized message, that means the model stopped mid-generation for some reason. Proper tool calls all end with a <|call|> token, which is considered a stop token and the tool call will end up in a message and not current_content of the parser.

The changes in #28729 will help many of the issues this proposes to fix in the parser. The root cause of many of these is we are prompting gpt-oss models wrong in many multi-turn scenarios, especially ones involving tool calling. So, what the changes in that PR do is fix how we're prompting the models which helps many of the situations where <|channel|> may leak into function recipients or we see other issues with models following their harmony format.

Some level of fuzzy parsing could be useful, but the changes here are large enough that I'd want to see what kind of impact this has on model accuracy and overall parsing success. Do you have any examples of requests that did not work before that this fixes? Have you run this against any tool calling benchmark suites to ensure it doesn't regress in other areas?

I'm pushing on the accuracy point here as I've spent a large amount of time benchmarking tool call accuracy in the gpt-oss models over the last few weeks to come up with the changes in #28729 and want to be careful about potentially regressing on any of that.

@ShaikAbdulHafeez03
Copy link
Author

ShaikAbdulHafeez03 commented Nov 23, 2025

Previously, the parser failed to correctly capture tool calls that resided in the active stream buffer (current_content) but hadn't yet been finalized into a message object. This led to two major issues:

If there is content in current_content and not a finalized message, that means the model stopped mid-generation for some reason. Proper tool calls all end with a <|call|> token, which is considered a stop token and the tool call will end up in a message and not current_content of the parser.

The changes in #28729 will help many of the issues this proposes to fix in the parser. The root cause of many of these is we are prompting gpt-oss models wrong in many multi-turn scenarios, especially ones involving tool calling. So, what the changes in that PR do is fix how we're prompting the models which helps many of the situations where <|channel|> may leak into function recipients or we see other issues with models following their harmony format.

Some level of fuzzy parsing could be useful, but the changes here are large enough that I'd want to see what kind of impact this has on model accuracy and overall parsing success. Do you have any examples of requests that did not work before that this fixes? Have you run this against any tool calling benchmark suites to ensure it doesn't regress in other areas?

I'm pushing on the accuracy point here as I've spent a large amount of time benchmarking tool call accuracy in the gpt-oss models over the last few weeks to come up with the changes in #28729 and want to be careful about potentially regressing on any of that.

Hi @bbrowning ,

Thanks for the feedback and for the context regarding #28729.

To address your question about previous failures: I have been testing specifically with the latest version (vllm 0.11.2) which includes the changes from #28729. While those prompting improvements definitely helped, I am still encountering significant parsing failures in scenarios where the model is provided with a large number of tools.

The Issue When the context is heavy, the model occasionally leaks the analysis/commentary tags directly into the function name string. Because the current parser expects a strict format, it treats the leakage as part of the tool name, causing the call to fail.

Here is a specific trace from my testing with v0.11.2:

[3900697] [Local→Remote] tools/list
[3900697] [Remote→Local] 3
⚠ Tool error attempt 1: Tool get_customer_name<|channel|>commentary not found in agent MCP Agent
[3900697] [Remote→Local] 4
⚠ Tool error attempt 2: Tool get_customer_name<|channel|>commentary not found in agent MCP Agent
As seen above, the parser is interpreting get_customer_name<|channel|>commentary as the tool name, rather than separating the tool call from the commentary tag.

Here is the another issue where it leaked the analysis part into the tool calling

Tool error attempt 3: Error code: 500 - {'error': {'message': 'unexpected tokens remaining in message header: ["need", "to", "create", "Jira", "issue.", "The", "tool", "name", "is", "createJiraIssue.", "The", "error", "says", "tool", "not", "found", "in", "agent", "MCP", "Jira", "Agent.", "Maybe", "we", "need", "to", "call", "createJiraIssue", "with", "correct", "JSON.", "The", "tool", "signature:", "createJiraIssue({cloudId,", "projectKey,", "issueTypeName,", "summary,", "description?,", "parent?,", "assignee_account_id?,", "additional_fields?}).",

As seen above, the parser is interpreting get_customer_name<|channel|>commentary as the tool name, rather than separating the tool call from the commentary tag.

Observed Impact & Improvements Thus This PR introduces a sanitization layer ("fuzzy parsing") to handle these edge cases. The impact on reliability in my tests has been significant:

Before fix: I observed a failure rate of approximately 2 out of 3 tool calls failing due to this parsing error.

After fix: The tool calling is functioning properly 100% of the time in the same scenario.

Scope: These improvements were observed specifically in ChatCompletion mode, ensuring the parser doesn't hang waiting for a format the model failed to provide perfectly.

@ApostaC
Copy link
Collaborator

ApostaC commented Nov 24, 2025

cc @aarnphm @chanh

@bbrowning
Copy link
Contributor

To address your question about previous failures: I have been testing specifically with the latest version (vllm 0.11.2) which includes the changes from #28729. While those prompting improvements definitely helped, I am still encountering significant parsing failures in scenarios where the model is provided with a large number of tools.

Note that 28729 is not merged yet, which means 0.11.2 does NOT have those changes in. It's expected you'll continue to have prompting and parsing issues until that merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models tool-calling

Projects

Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

4 participants