Python β’
TypeScript β’

n8n β’ π Docs β’ π‘ Examples
Stop Bleeding Money on AI Calls. Cut Costs 30-65% in 3 Lines of Code.
40-70% of text prompts and 20-60% of agent calls don't need expensive flagship models. You're overpaying every single day.
cascadeflow fixes this with intelligent model cascading, available in Python and TypeScript.
pip install cascadeflownpm install @cascadeflow/corecascadeflow is an intelligent AI model cascading library that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.
Use cascadeflow for:
- Cost Optimization. Reduce API costs by 40-85% through intelligent model cascading and speculative execution with automatic per-query cost tracking.
- Cost Control and Transparency. Built-in telemetry for query, model, and provider-level cost tracking with configurable budget limits and programmable spending caps.
- Low Latency & Speed Optimization. Sub-2ms framework overhead with fast provider routing (Groq sub-50ms). Cascade simple queries to fast models while reserving expensive models for complex reasoning, achieving 2-10x latency reduction overall. (use preset
PRESET_ULTRA_FAST) - Multi-Provider Flexibility. Unified API across
OpenAI,Anthropic,Groq,Ollama,vLLM,Together, andHugging Facewith automatic provider detection and zero vendor lock-in. OptionalLiteLLMintegration for 100+ additional providers, plusLangChainintegration for LCEL chains and tools. - Edge & Local-Hosted AI Deployment. Use best of both worlds: handle most queries with local models (vLLM, Ollama), then automatically escalate complex queries to cloud providers only when needed.
βΉοΈ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper
cascadeflow uses speculative execution with quality validation:
- Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
- Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
- Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
- Learns patterns to optimize future cascading decisions and domain specific routing
Zero configuration. Works with YOUR existing models (7 Providers currently supported).
In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation
Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β cascadeflow Stack β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cascade Agent β β
β β β β
β β Orchestrates the entire cascade execution β β
β β β’ Query routing & model selection β β
β β β’ Drafter -> Verifier coordination β β
β β β’ Cost tracking & telemetry β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Domain Pipeline β β
β β β β
β β Automatic domain classification β β
β β β’ Rule-based detection (CODE, MATH, DATA, etc.) β β
β β β’ Optional ML semantic classification β β
β β β’ Domain-optimized pipelines & model selection β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Quality Validation Engine β β
β β β β
β β Multi-dimensional quality checks β β
β β β’ Length validation (too short/verbose) β β
β β β’ Confidence scoring (logprobs analysis) β β
β β β’ Format validation (JSON, structured output) β β
β β β’ Semantic alignment (intent matching) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cascading Engine (<2ms overhead) β β
β β β β
β β Smart model escalation strategy β β
β β β’ Try cheap models first (speculative execution) β β
β β β’ Validate quality instantly β β
β β β’ Escalate only when needed β β
β β β’ Automatic retry & fallback β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Provider Abstraction Layer β β
β β β β
β β Unified interface for 7+ providers β β
β β β’ OpenAI β’ Anthropic β’ Groq β’ Ollama β β
β β β’ Together β’ vLLM β’ HuggingFace β’ LiteLLM β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pip install cascadeflow[all]from cascadeflow import CascadeAgent, ModelConfig
# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375), # Draft model (~$0.375/1M tokens)
ModelConfig(name="gpt-5", provider="openai", cost=0.00562), # Verifier model (~$5.62/1M tokens)
])
# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")
print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")π‘ Optional: Use ML-based Semantic Quality Validation
For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.
Step 1: Install the optional ML package:
pip install cascadeflow[ml] # Adds semantic similarity via FastEmbed (~80MB model)Step 2: Use semantic quality validation:
from cascadeflow.quality.semantic import SemanticQualityChecker
# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
similarity_threshold=0.5, # Minimum similarity score (0-1)
toxicity_threshold=0.7 # Maximum toxicity score (0-1)
)
# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."
result = checker.validate(query, response, check_toxicity=True)
print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")What you get:
- π― Semantic similarity scoring (query β response alignment)
- π‘οΈ Optional toxicity detection
- π Automatic model download and caching
- π Fast inference (~100ms per check)
Full example: See semantic_quality_domain_detection.py
β οΈ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).
π Learn more: Python Documentation | Quickstart Guide | Providers Guide
npm install @cascadeflow/coreimport { CascadeAgent, ModelConfig } from '@cascadeflow/core';
// Same API as Python!
const agent = new CascadeAgent({
models: [
{ name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
{ name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
],
});
const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);π‘ Optional: ML-based Semantic Quality Validation
For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.
Step 1: Install the optional ML packages:
npm install @cascadeflow/ml @xenova/transformersStep 2: Enable semantic validation in your cascade:
import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';
const agent = new CascadeAgent({
models: [
{ name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
{ name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
],
quality: {
threshold: 0.40, // Traditional confidence threshold
requireMinimumTokens: 5, // Minimum response length
useSemanticValidation: true, // Enable ML validation
semanticThreshold: 0.5, // 50% minimum similarity
},
});
// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');Step 3: Or use semantic validation directly:
import { SemanticQualityChecker } from '@cascadeflow/core';
const checker = new SemanticQualityChecker();
if (await checker.isAvailable()) {
const result = await checker.checkSimilarity(
'What is TypeScript?',
'TypeScript is a typed superset of JavaScript.'
);
console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
console.log(`Passed: ${result.passed}`);
}What you get:
- π― Query-response semantic alignment detection
- π« Off-topic response filtering
- π¦ BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
- β‘ Fast CPU inference (~50-100ms with caching)
- π Request-scoped caching (50% latency reduction)
- π Works in Node.js, Browser, and Edge Functions
Example: semantic-quality.ts
π Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples | Browser/Edge Guide
Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.
Cost: $0.000113, Latency: 850ms
# Using expensive model for everything
result = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's 2+2?"}]
)Cost: $0.000007, Latency: 234ms
agent = CascadeAgent(models=[
ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])
result = await agent.run("What's 2+2?")π₯ Saved: $0.000106 (94% reduction), 3.6x faster
π Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization
Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!
- Open n8n
- Go to Settings β Community Nodes
- Search for:
@cascadeflow/n8n-nodes-cascadeflow - Click Install
CascadeFlow is a Language Model sub-node that connects two AI Chat Model nodes (drafter + verifier) and intelligently cascades between them:
Setup:
- Add two AI Chat Model nodes (cheap drafter + powerful verifier)
- Add CascadeFlow node and connect both models
- Connect CascadeFlow to Basic LLM Chain or Chain nodes
- Check Logs tab to see cascade decisions in real-time!
Result: 40-85% cost savings in your n8n workflows!
Features:
- β Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)
- β Mix providers (e.g., Ollama drafter + GPT-4o verifier)
- β Real-time flow visualization in Logs tab
- β Detailed metrics: confidence scores, latency, cost savings
π Learn more: n8n Integration Guide | n8n Documentation
Use cascadeflow with LangChain for intelligent model cascading with full LCEL, streaming, and tools support!
npm install @cascadeflow/langchain @langchain/core @langchain/openaipip install cascadeflow[langchain]
TypeScript - Drop-in replacement for any LangChain chat model
import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';
const cascade = withCascade({
drafter: new ChatOpenAI({ modelName: 'gpt-5-mini' }), // $0.25/$2 per 1M tokens
verifier: new ChatAnthropic({ modelName: 'claude-sonnet-4-5' }), // $3/$15 per 1M tokens
qualityThreshold: 0.8, // 80% queries use drafter
});
// Use like any LangChain chat model
const result = await cascade.invoke('Explain quantum computing');
// Optional: Enable LangSmith tracing (see https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true
// Or with LCEL chains
const chain = prompt.pipe(cascade).pipe(new StringOutputParser());
Python - Drop-in replacement for any LangChain chat model
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from cascadeflow.integrations.langchain import CascadeFlow
cascade = CascadeFlow(
drafter=ChatOpenAI(model="gpt-4o-mini"), # $0.15/$0.60 per 1M tokens
verifier=ChatAnthropic(model="claude-sonnet-4-5"), # $3/$15 per 1M tokens
quality_threshold=0.8, # 80% queries use drafter
)
# Use like any LangChain chat model
result = await cascade.ainvoke("Explain quantum computing")
# Optional: Enable LangSmith tracing (see https://smith.langchain.com)
# Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true
# Or with LCEL chains
chain = prompt | cascade | StrOutputParser()π‘ Optional: Cost Tracking with Callbacks (Python)
Track costs, tokens, and cascade decisions with LangChain-compatible callbacks:
from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback
# Track costs similar to get_openai_callback()
with get_cascade_callback() as cb:
response = await cascade.ainvoke("What is Python?")
print(f"Total cost: ${cb.total_cost:.6f}")
print(f"Drafter cost: ${cb.drafter_cost:.6f}")
print(f"Verifier cost: ${cb.verifier_cost:.6f}")
print(f"Total tokens: {cb.total_tokens}")
print(f"Successful requests: {cb.successful_requests}")Features:
- π― Compatible with
get_openai_callback()pattern - π° Separate drafter/verifier cost tracking
- π Token usage (including streaming)
- π Works with LangSmith tracing
- β‘ Near-zero overhead
Full example: See langchain_cost_tracking.py
π‘ Optional: Model Discovery & Analysis Helpers (TypeScript)
For discovering optimal cascade pairs from your existing LangChain models, use the built-in discovery helpers:
import {
discoverCascadePairs,
findBestCascadePair,
analyzeModel,
validateCascadePair
} from '@cascadeflow/langchain';
// Your existing LangChain models (configured with YOUR API keys)
const myModels = [
new ChatOpenAI({ model: 'gpt-3.5-turbo' }),
new ChatOpenAI({ model: 'gpt-4o-mini' }),
new ChatOpenAI({ model: 'gpt-4o' }),
new ChatAnthropic({ model: 'claude-3-haiku' }),
// ... any LangChain chat models
];
// Quick: Find best cascade pair
const best = findBestCascadePair(myModels);
console.log(`Best pair: ${best.analysis.drafterModel} β ${best.analysis.verifierModel}`);
console.log(`Estimated savings: ${best.estimatedSavings}%`);
// Use it immediately
const cascade = withCascade({
drafter: best.drafter,
verifier: best.verifier,
});
// Advanced: Discover all valid pairs
const pairs = discoverCascadePairs(myModels, {
minSavings: 50, // Only pairs with β₯50% savings
requireSameProvider: false, // Allow cross-provider cascades
});
// Validate specific pair
const validation = validateCascadePair(drafter, verifier);
console.log(`Valid: ${validation.valid}`);
console.log(`Warnings: ${validation.warnings}`);What you get:
- π Automatic discovery of optimal cascade pairs from YOUR models
- π° Estimated cost savings calculations
β οΈ Validation warnings for misconfigured pairs- π Model tier analysis (drafter vs verifier candidates)
Full example: See model-discovery.ts
Features:
- β Full LCEL support (pipes, sequences, batch)
- β Streaming with pre-routing
- β Tool calling and structured output
- β LangSmith cost tracking metadata
- β Cost tracking callbacks (Python)
- β Works with all LangChain features
π¦ Learn more: LangChain Integration Guide | TypeScript Package | Python Examples
Basic Examples - Get started quickly
| Example | Description | Link |
|---|---|---|
| Basic Usage | Simple cascade setup with OpenAI models | View |
| Preset Usage | Use built-in presets for quick setup | View |
| Multi-Provider | Mix multiple AI providers in one cascade | View |
| Reasoning Models | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | View |
| Tool Execution | Function calling and tool usage | View |
| Streaming Text | Stream responses from cascade agents | View |
| Cost Tracking | Track and analyze costs across queries | View |
Advanced Examples - Production & customization
| Example | Description | Link |
|---|---|---|
| Production Patterns | Best practices for production deployments | View |
| FastAPI Integration | Integrate cascades with FastAPI | View |
| Streaming Tools | Stream tool calls and responses | View |
| Batch Processing | Process multiple queries efficiently | View |
| Multi-Step Cascade | Build complex multi-step cascades | View |
| Edge Device | Run cascades on edge devices with local models | View |
| vLLM Example | Use vLLM for local model deployment | View |
| Multi-Instance Ollama | Run draft/verifier on separate Ollama instances | View |
| Multi-Instance vLLM | Run draft/verifier on separate vLLM instances | View |
| Custom Cascade | Build custom cascade strategies | View |
| Custom Validation | Implement custom quality validators | View |
| User Budget Tracking | Per-user budget enforcement and tracking | View |
| User Profile Usage | User-specific routing and configurations | View |
| Rate Limiting | Implement rate limiting for cascades | View |
| Guardrails | Add safety and content guardrails | View |
| Cost Forecasting | Forecast costs and detect anomalies | View |
| Semantic Quality Detection | ML-based domain and quality detection | View |
| Profile Database Integration | Integrate user profiles with databases | View |
| LangChain Basic | Simple LangChain cascade setup | View |
| LangChain Streaming | Stream responses with LangChain | View |
| LangChain Model Discovery | Discover and analyze LangChain models | View |
| LangChain LangSmith | Cost tracking with LangSmith integration | View |
| LangChain Cost Tracking | Track costs with callback handlers | View |
| LangChain Benchmark | Comprehensive cascade benchmarking | View |
Basic Examples - Get started quickly
| Example | Description | Link |
|---|---|---|
| Basic Usage | Simple cascade setup (Node.js) | View |
| Tool Calling | Function calling with tools (Node.js) | View |
| Multi-Provider | Mix providers in TypeScript (Node.js) | View |
| Reasoning Models | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | View |
| Cost Tracking | Track and analyze costs across queries | View |
| Semantic Quality | ML-based semantic validation with embeddings | View |
| Streaming | Stream responses in TypeScript | View |
Advanced Examples - Production, edge & LangChain
| Example | Description | Link |
|---|---|---|
| Production Patterns | Production best practices (Node.js) | View |
| Multi-Instance Ollama | Run draft/verifier on separate Ollama instances | View |
| Multi-Instance vLLM | Run draft/verifier on separate vLLM instances | View |
| Browser/Edge | Vercel Edge runtime example | View |
| LangChain Basic | Simple LangChain cascade setup | View |
| LangChain Cross-Provider | Haiku β GPT-5 with PreRouter | View |
| LangChain LangSmith | Cost tracking with LangSmith | View |
| LangChain Cost Tracking | Compare cascadeflow vs LangSmith cost tracking | View |
π View All Python Examples β | View All TypeScript Examples β
Getting Started - Core concepts and basics
| Guide | Description | Link |
|---|---|---|
| Quickstart | Get started with cascadeflow in 5 minutes | Read |
| Providers Guide | Configure and use different AI providers | Read |
| Presets Guide | Using and creating custom presets | Read |
| Streaming Guide | Stream responses from cascade agents | Read |
| Tools Guide | Function calling and tool usage | Read |
| Cost Tracking | Track and analyze API costs | Read |
Advanced Topics - Production, customization & integrations
| Guide | Description | Link |
|---|---|---|
| Production Guide | Best practices for production deployments | Read |
| Performance Guide | Optimize cascade performance and latency | Read |
| Custom Cascade | Build custom cascade strategies | Read |
| Custom Validation | Implement custom quality validators | Read |
| Edge Device | Deploy cascades on edge devices | Read |
| Browser Cascading | Run cascades in the browser/edge | Read |
| FastAPI Integration | Integrate with FastAPI applications | Read |
| LangChain Integration | Use cascadeflow with LangChain | Read |
| n8n Integration | Use cascadeflow in n8n workflows | Read |
π View All Documentation β
| Feature | Benefit |
|---|---|
| π― Speculative Cascading | Tries cheap models first, escalates intelligently |
| π° 40-85% Cost Savings | Research-backed, proven in production |
| β‘ 2-10x Faster | Small models respond in <50ms vs 500-2000ms |
| β‘ Low Latency | Sub-2ms framework overhead, negligible performance impact |
| π Mix Any Providers | OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional) + LangChain integration |
| π€ User Profile System | Per-user budgets, tier-aware routing, enforcement callbacks |
| β Quality Validation | Automatic checks + semantic similarity (optional ML, ~80MB, CPU) |
| π¨ Cascading Policies | Domain-specific pipelines, multi-step validation strategies |
| π§ Domain Understanding | Auto-detects code/medical/legal/math/structured data, routes to specialists |
| π€ Drafter/Validator Pattern | 20-60% savings for agent/tool systems |
| π§ Tool Calling Support | Universal format, works across all providers |
| π Cost Tracking | Built-in analytics + OpenTelemetry export (vendor-neutral) |
| π 3-Line Integration | Zero architecture changes needed |
| π Production Ready | Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection |
MIT Β© see LICENSE file.
Free for commercial use. Attribution appreciated but not required.
We β€οΈ contributions!
π Contributing Guide - Python & TypeScript development setup
- Cascade Profiler - Analyzes your AI API logs to calculate cost savings potential and generate optimized cascadeflow configurations automatically
- User Tier Management - Cost controls and limits per user tier with advanced routing
- Semantic Quality Validators - Optional lightweight local quality scoring (200MB CPU model, no external API calls)
- Code Complexity Detection - Dynamic cascading based on task complexity analysis
- Domain Aware Cascading - Multi-stage pipelines tailored to specific domains
- Benchmark Reports - Automated performance and cost benchmarking
- π GitHub Discussions - Searchable Q&A
- π GitHub Issues - Bug reports & feature requests
- π§ Email Support - Direct support
If you use cascadeflow in your research or project, please cite:
@software{cascadeflow2025,
author = {Lemony Inc., Sascha Buehrle and Contributors},
title = {cascadeflow: Smart AI model cascading for cost optimization},
year = {2025},
publisher = {GitHub},
url = {https://github.com/lemony-ai/cascadeflow}
}Ready to cut your AI costs by 40-85%?
pip install cascadeflownpm install @cascadeflow/coreRead the Docs β’ View Python Examples β’ View TypeScript Examples β’ Join Discussions
Built with β€οΈ by Lemony Inc. and the cascadeflow Community
One cascade. Hundreds of specialists.
New York | Zurich
β Star us on GitHub if cascadeflow helps you save money!