Intercept execution and control when the handler is called. Use for retries, caching, and transformation.You decide if the handler is called zero times (short-circuit), once (normal flow), or multiple times (retry logic).Available hooks:
wrap_model_call - Around each model call
wrap_tool_call - Around each tool call
Example:
Decorator
Class
from langchain.agents.middleware import wrap_model_call, ModelRequest, ModelResponsefrom typing import Callable@wrap_model_calldef retry_model( request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse],) -> ModelResponse: for attempt in range(3): try: return handler(request) except Exception as e: if attempt == 2: raise print(f"Retry {attempt + 1}/3 after error: {e}")
from langchain.agents.middleware import AgentMiddleware, ModelRequest, ModelResponsefrom typing import Callableclass RetryMiddleware(AgentMiddleware): def __init__(self, max_retries: int = 3): super().__init__() self.max_retries = max_retries def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelResponse: for attempt in range(self.max_retries): try: return handler(request) except Exception as e: if attempt == self.max_retries - 1: raise print(f"Retry {attempt + 1}/{self.max_retries} after error: {e}")
More powerful for complex middleware with multiple hooks or configuration. Use classes when you need to define both sync and async implementations for the same hook, or when you want to combine multiple hooks in a single middleware.Example:
Middleware can extend the agent’s state with custom properties. This enables middleware to:
Track state across execution: Maintain counters, flags, or other values that persist throughout the agent’s execution lifecycle
Share data between hooks: Pass information from before_model to after_model or between different middleware instances
Implement cross-cutting concerns: Add functionality like rate limiting, usage tracking, user context, or audit logging without modifying the core agent logic
Make conditional decisions: Use accumulated state to determine whether to continue execution, jump to different nodes, or modify behavior dynamically
from langchain.agents.middleware import wrap_model_call, ModelRequest, ModelResponsefrom langchain.chat_models import init_chat_modelfrom typing import Callablecomplex_model = init_chat_model("gpt-4.1")simple_model = init_chat_model("gpt-4.1-mini")@wrap_model_calldef dynamic_model( request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse],) -> ModelResponse: # Use different model based on conversation length if len(request.messages) > 10: model = complex_model else: model = simple_model return handler(request.override(model=model))
from langchain.agents.middleware import AgentMiddleware, ModelRequest, ModelResponsefrom langchain.chat_models import init_chat_modelfrom typing import Callablecomplex_model = init_chat_model("gpt-4.1")simple_model = init_chat_model("gpt-4.1-mini")class DynamicModelMiddleware(AgentMiddleware): def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelResponse: # Use different model based on conversation length if len(request.messages) > 10: model = complex_model else: model = simple_model return handler(request.override(model=model))
Select relevant tools at runtime to improve performance and accuracy. This section covers filtering pre-registered tools. For registering tools that are discovered at runtime (e.g., from MCP servers), see Runtime tool registration.Benefits:
Shorter prompts - Reduce complexity by exposing only relevant tools
Better accuracy - Models choose correctly from fewer options
Permission control - Dynamically filter tools based on user access
Decorator
Class
from langchain.agents import create_agentfrom langchain.agents.middleware import wrap_model_call, ModelRequest, ModelResponsefrom typing import Callable@wrap_model_calldef select_tools( request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse],) -> ModelResponse: """Middleware to select relevant tools based on state/context.""" # Select a small, relevant subset of tools based on state/context relevant_tools = select_relevant_tools(request.state, request.runtime) return handler(request.override(tools=relevant_tools))agent = create_agent( model="gpt-4.1", tools=all_tools, # All available tools need to be registered upfront middleware=[select_tools],)
from langchain.agents import create_agentfrom langchain.agents.middleware import AgentMiddleware, ModelRequest, ModelResponsefrom typing import Callableclass ToolSelectorMiddleware(AgentMiddleware): def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelResponse: """Middleware to select relevant tools based on state/context.""" # Select a small, relevant subset of tools based on state/context relevant_tools = select_relevant_tools(request.state, request.runtime) return handler(request.override(tools=relevant_tools))agent = create_agent( model="gpt-4.1", tools=all_tools, # All available tools need to be registered upfront middleware=[ToolSelectorMiddleware()],)
Modify system messages in middleware using the system_message field on ModelRequest. The system_message field contains a SystemMessage object (even if the agent was created with a string system_prompt).Example: Adding context to system message
Example: Working with cache control (Anthropic)When working with Anthropic models, you can use structured content blocks with cache control directives to cache large system prompts:
Decorator
Class
from langchain.agents.middleware import wrap_model_call, ModelRequest, ModelResponsefrom langchain.messages import SystemMessagefrom typing import Callable@wrap_model_calldef add_cached_context( request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse],) -> ModelResponse: # Always work with content blocks new_content = list(request.system_message.content_blocks) + [ { "type": "text", "text": "Here is a large document to analyze:\n\n<document>...</document>", # content up until this point is cached "cache_control": {"type": "ephemeral"} } ] new_system_message = SystemMessage(content=new_content) return handler(request.override(system_message=new_system_message))
from langchain.agents.middleware import AgentMiddleware, ModelRequest, ModelResponsefrom langchain.messages import SystemMessagefrom typing import Callableclass CachedContextMiddleware(AgentMiddleware): def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelResponse: # Always work with content blocks new_content = list(request.system_message.content_blocks) + [ { "type": "text", "text": "Here is a large document to analyze:\n\n<document>...</document>", "cache_control": {"type": "ephemeral"} # This content will be cached } ] new_system_message = SystemMessage(content=new_content) return handler(request.override(system_message=new_system_message))
Notes:
ModelRequest.system_message is always a SystemMessage object, even if the agent was created with system_prompt="string"
Use SystemMessage.content_blocks to access content as a list of blocks, regardless of whether the original content was a string or list
When modifying system messages, use content_blocks and append new blocks to preserve existing structure
You can pass SystemMessage objects directly to create_agent’s system_prompt parameter for advanced use cases like cache control