Skip to content

JS Summarization Middleware: internal model.invoke() is streamed to UI (cannot be suppressed, unlike tools) #9455

@evanedreo

Description

@evanedreo

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangGraph.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangGraph.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangGraph (or the specific integration package).

Example Code

import { z } from "zod";
import { ChatOpenAI } from "@langchain/openai";
import { createAgent, summarizationMiddleware } from "langchain";
import { MemorySaver, MessagesZodState } from "@langchain/langgraph";

// --- Main agent model (intentional streaming for UI) ---
const mainModel = new ChatOpenAI({
  model: "gpt-4o-mini",
  streaming: true,
  temperature: 0,
});

// --- Internal summarization model (should NOT stream) ---
const summaryModel = new ChatOpenAI({
  model: "gpt-4o-mini",
  streaming: false, // <-- disabled
  callbacks: [],    // <-- none
  temperature: 0,
});

// --- Summarization middleware with low thresholds so it triggers instantly ---
const summarizeMiddleware = summarizationMiddleware({
  model: summaryModel,        // <-- internal model
  trigger: { tokens: 200, messages: 2 },
  keep: { messages: 1 },
  trimTokensToSummarize: 128,
  summaryPrefix: "Here is a summary of the conversation to date:",
});

// --- Minimal agent state ---
const agentState = z.object({
  messages: MessagesZodState.shape.messages,
});

// --- Agent with only summarization middleware ---
const agent = createAgent({
  model: mainModel,
  tools: [],
  middleware: [summarizeMiddleware],
  stateSchema: agentState,
  checkpointer: new MemorySaver(),
});

// --- Helper: produce enough text to trigger summarization ---
function longPrompt() {
  const base =
    "Explain in detail how to build a production-grade Node.js REST API with authentication, " +
    "queues, monitoring, error handling, and scaling considerations.";
  return Array(10).fill(base).join("\n\n"); // trigger summarization threshold
}

async function main() {
  console.log("Starting stream… Watch for a leaked summary BEFORE final answer.\n");

  const stream = await agent.stream(
    {
      messages: [{ role: "user", content: longPrompt() }],
    },
    {
      streamMode: ["messages"], // <-- required to observe the bug
      configurable: { thread_id: "summarization-repro" },
    }
  );

  for await (const [mode, chunk] of stream) {
    if (mode !== "messages") continue;

    const [msg] = chunk;
    const role = (msg as any).role ?? "unknown";
    const name = (msg as any).name ?? "";
    const content =
      typeof msg.content === "string"
        ? msg.content
        : JSON.stringify(msg.content);

    console.log(`\n[STREAMED MESSAGE] role=${role} name=${name}`);
    console.log(content.slice(0, 200) + (content.length > 200 ? "..." : ""));

    // Bug behavior:
    //
    // 1. You will see a "summary" message appear FIRST
    //    (coming from model.invoke inside summarizationMiddleware)
    //
    // 2. Then you will see the real agent reply
    //
    // The summary message should NEVER appear because it is an internal
    // compression step and summaryModel.streaming = false.
  }

  console.log("\nStream complete.");
}

main().catch((err) => console.error("Repro error:", err));

Error Message and Stack Trace (if applicable)

There is no thrown exception.
The bug is incorrect behavior: unwanted streamed assistant output.

Before
Image

After I fixed it
Image

Description

What I am doing

Using the official JS summarizationMiddleware with an agent that streams events (stream() or stream_mode="messages").

What I expect

Internal summarization LLM calls should behave like LLM calls inside tools:

  • completely isolated
  • not streamed
  • invisible to the user
  • not producing assistant messages

What happens instead

The internal line:

await model.invoke(formattedPrompt);

inside the middleware is always streamed to the UI as if it were a normal assistant response.

This causes:

  • an extra assistant message to appear
  • summary tokens showing up before the real response
  • impossible to hide the summary model call
  • impossible to fix with custom filtering

Key insight

Even if I remove:

return { messages: [...] }

from beforeModel, the streamed events continue because:

The leak does NOT come from the middleware return value.
It comes directly from the internal model.invoke() call.

LangGraph intercepts all LLM invocations inside the main execution context, which is where middleware runs.

Tools run in isolated contexts — middleware does not.

Why this is a problem

Summarization is supposed to be an internal housekeeping step.
Users should never see:

  • model starts
  • streamed tokens
  • model ends

from that step.

Currently, JS summarization middleware behaves like a visible extra assistant turn.

Workaround

Only one thing works:

await fetch("https://api.openai.com/v1/chat/completions", ...)

This bypasses LangChain entirely and produces zero streamed events.

But this breaks consistency and prevents using LCEL or BaseChatModel.

System Info

Platform: Linux
Package Manager: - Package Manager: npm (10.9.2)

  • langchain-js: 1.0.4
  • langgraph-js: 1.0.1
  • Node: 22.17.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions