Writing Effective Tools for Agents: Complete MCP Development Guide

Writing Effective Tools for Agents: Complete MCP Development Guide

Writing Effective Tools for Agents

Agents are only as effective as the tools we give them. This guide shares how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself.

The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective?

In this guide, we describe our most effective techniques for improving performance in a variety of agentic AI systems.

What is a Tool?

In computing, deterministic systems produce the same output every time given identical inputs, while non-deterministic systems—like agents—can generate varied responses even with the same starting conditions.

When we traditionally write software, we’re establishing a contract between deterministic systems. For instance, a function call like getWeather("NYC") will always fetch the weather in New York City in the exact same manner every time it is called.

Tools are a new kind of software which reflects a contract between deterministic systems and non-deterministic agents. When a user asks “Should I bring an umbrella today?,” an agent might call the weather tool, answer from general knowledge, or even ask a clarifying question about location first. Occasionally, an agent might hallucinate or even fail to grasp how to use a tool.

This means fundamentally rethinking our approach when writing software for agents: instead of writing tools and MCP servers the way we’d write functions and APIs for other developers or systems, we need to design them for agents.

How to Write Tools

Building a Prototype

It can be difficult to anticipate which tools agents will find ergonomic and which tools they won’t without getting hands-on yourself. Start by standing up a quick prototype of your tools.

If you’re using Claude Code to write your tools, it helps to give Claude documentation for any software libraries, APIs, or SDKs (including potentially the MCP SDK) your tools will rely on. LLM-friendly documentation can commonly be found in flat llms.txt files on official documentation sites.

Wrapping your tools in a local MCP server will allow you to connect and test your tools in Claude Code or the Claude Desktop app.

To connect your local MCP server to Claude Code, run claude mcp add <name> <command> [args...].

Tools can also be passed directly into Anthropic API calls for programmatic testing.

Test the tools yourself to identify any rough edges. Collect feedback from your users to build an intuition around the use-cases and prompts you expect your tools to enable.

Running an Evaluation

Next, you need to measure how well Claude uses your tools by running an evaluation. Start by generating lots of evaluation tasks, grounded in real world uses. We recommend collaborating with an agent to help analyze your results and determine how to improve your tools.

See this process end-to-end in our tool evaluation cookbook.

Claude Code evaluation workflow Building an evaluation system with Claude Code allows systematic measurement and optimization of tool performance

Tool Performance Case Studies

Through systematic evaluation-driven development, we achieved significant performance improvements in internal tool testing:

Slack tools performance results Held-out test set performance comparison: human-written vs Claude-optimized Slack MCP tools

Asana tools performance results
Held-out test set performance comparison for Asana MCP tools: demonstrating significant AI optimization effects

Generating Evaluation Tasks

With your early prototype, Claude Code can quickly explore your tools and create dozens of prompt and response pairs. Prompts should be inspired by real-world uses and be based on realistic data sources and services. We recommend you avoid overly simplistic or superficial “sandbox” environments that don’t stress-test your tools with sufficient complexity.

Strong evaluation task examples:

  • Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room.
  • Customer ID 9182 reported that they were charged three times for a single purchase attempt. Find all relevant log entries and determine if any other customers were affected by the same issue.
  • Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they’re leaving, (2) what retention offer would be most compelling, and (3) any risk factors we should be aware of before making an offer.

Weaker evaluation task examples:

  • Schedule a meeting with [email protected] next week.
  • Search the payment logs for purchase_complete and customer_id=9182.
  • Find the cancellation request by Customer ID 45892.

Each evaluation prompt should be paired with a verifiable response or outcome. Your verifier can be as simple as an exact string comparison between ground truth and sampled responses, or as advanced as enlisting Claude to judge the response.

Running the Evaluation

We recommend running your evaluation programmatically with direct LLM API calls. Use simple agentic loops (while-loops wrapping alternating LLM API and tool calls): one loop for each evaluation task.

In your evaluation agents’ system prompts, we recommend instructing agents to output not just structured response blocks (for verification), but also reasoning and feedback blocks. Instructing agents to output these before tool call and response blocks may increase LLMs’ effective intelligence by triggering chain-of-thought (CoT) behaviors.

Analyzing Results

Agents are your helpful partners in spotting issues and providing feedback on everything from contradictory tool descriptions to inefficient tool implementations and confusing tool schemas. However, keep in mind that what agents omit in their feedback and responses can often be more important than what they include.

Observe where your agents get stumped or confused. Read through your evaluation agents’ reasoning and feedback (or CoT) to identify rough edges. Review the raw transcripts (including tool calls and tool responses) to catch any behavior not explicitly described in the agent’s CoT.

Collaborating with Agents

You can even let agents analyze your results and improve your tools for you. Simply concatenate the transcripts from your evaluation agents and paste them into Claude Code. Claude is an expert at analyzing transcripts and refactoring lots of tools all at once.

In fact, most of the advice in this post came from repeatedly optimizing our internal tool implementations with Claude Code. We relied on held-out test sets to ensure we did not overfit to our “training” evaluations.

Principles for Writing Effective Tools

Choosing the Right Tools for Agents

More tools don’t always lead to better outcomes. A common error we’ve observed is tools that merely wrap existing software functionality or API endpoints—whether or not the tools are appropriate for agents.

LLM agents have limited “context” (that is, there are limits to how much information they can process at once), whereas computer memory is cheap and abundant. Consider the task of searching for a contact in an address book. Traditional software programs can efficiently store and process a list of contacts one at a time, checking each one before moving on.

However, if an LLM agent uses a tool that returns ALL contacts and then has to read through each one token-by-token, it’s wasting its limited context space on irrelevant information. The better and more natural approach is to skip to the relevant page first.

We recommend building a few thoughtful tools targeting specific high-impact workflows, which match your evaluation tasks and scaling up from there.

Tools can consolidate functionality, handling potentially multiple discrete operations (or API calls) under the hood. For example, tools can enrich tool responses with related metadata or handle frequently chained, multi-step tasks in a single tool call.

Better tool design examples:

  • Instead of implementing list_users, list_events, and create_event tools, consider implementing a schedule_event tool which finds availability and schedules an event.
  • Instead of implementing a read_logs tool, consider implementing a search_logs tool which only returns relevant log lines and some surrounding context.
  • Instead of implementing get_customer_by_id, list_transactions, and list_notes tools, implement a get_customer_context tool which compiles all of a customer’s recent & relevant information all at once.

Namespacing Your Tools

Your AI agents will potentially gain access to dozens of MCP servers and hundreds of different tools–including those by other developers. When tools overlap in function or have a vague purpose, agents can get confused about which ones to use.

Namespacing (grouping related tools under common prefixes) can help delineate boundaries between lots of tools; MCP clients sometimes do this by default. For example, namespacing tools by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search, asana_users_search), can help agents select the right tools at the right time.

Returning Meaningful Context from Your Tools

Tool implementations should take care to return only high signal information back to agents. They should prioritize contextual relevance over flexibility, and eschew low-level technical identifiers (for example: uuid, 256px_image_url, mime_type). Fields like name, image_url, and file_type are much more likely to directly inform agents’ downstream actions and responses.

Agents also tend to grapple with natural language names, terms, or identifiers significantly more successfully than they do with cryptic identifiers. We’ve found that merely resolving arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language significantly improves Claude’s precision in retrieval tasks.

In some instances, agents may require the flexibility to interact with both natural language and technical identifiers outputs, if only to trigger downstream tool calls (for example, search_user(name='jane')send_message(id=12345)). You can enable both by exposing a simple response_format enum parameter in your tool, allowing your agent to control whether tools return "concise" or "detailed" responses.

response_format.py
from enum import Enum

class ResponseFormat(Enum):
    DETAILED = "detailed"
    CONCISE = "concise"

Response Format Examples

Detailed tool response (206 tokens): Detailed tool response example

Concise tool response (72 tokens): Concise tool response example

Slack threads and thread replies are identified by unique thread_ts which are required to fetch thread replies. thread_ts and other IDs (channel_id, user_id) can be retrieved from a "detailed" tool response to enable further tool calls that require these. "concise" tool responses return only thread content and exclude IDs. In this example, we use ~⅓ of the tokens with "concise" tool responses.

Optimizing Tool Responses for Token Efficiency

Optimizing the quality of context is important. But so is optimizing the quantity of context returned back to agents in tool responses.

We suggest implementing some combination of pagination, range selection, filtering, and/or truncation with sensible default parameter values for any tool responses that could use up lots of context. For Claude Code, we restrict tool responses to 25,000 tokens by default.

If you choose to truncate responses, be sure to steer agents with helpful instructions. You can directly encourage agents to pursue more token-efficient strategies, like making many small and targeted searches instead of a single, broad search for a knowledge retrieval task.

Response Truncation and Error Handling Examples

Truncated response example: Truncated tool response example Intelligently guiding agents toward more precise searches

Error response comparison:

Unhelpful error response: Unhelpful error response example

Helpful error response: Helpful error response example

Tool truncation and error responses can steer agents towards more token-efficient tool-use behaviors (using filters or pagination) or give examples of correctly formatted tool inputs.

Prompt-Engineering Your Tool Descriptions

We now come to one of the most effective methods for improving tools: prompt-engineering your tool descriptions and specs. Because these are loaded into your agents’ context, they can collectively steer agents toward effective tool-calling behaviors.

When writing tool descriptions and specs, think of how you would describe your tool to a new hire on your team. Consider the context that you might implicitly bring—specialized query formats, definitions of niche terminology, relationships between underlying resources—and make it explicit.

With your evaluation you can measure the impact of your prompt engineering with greater confidence. Even small refinements to tool descriptions can yield dramatic improvements.

Avoid ambiguity by clearly describing (and enforcing with strict data models) expected inputs and outputs. In particular, input parameters should be unambiguously named: instead of a parameter named user, try a parameter named user_id.

Practical Guidance

Tool Development Workflow

  1. Quick Prototype → 2. User Testing → 3. Create Evaluation → 4. Run Tests → 5. Agent Analysis → 6. Iterate → 7. Repeat

Common Pitfalls to Avoid

Avoid these common mistakes:

  • Creating one tool per API endpoint
  • Returning too many low-level technical details
  • Using vague or overlapping tool names
  • Neglecting tool description quality
  • Not testing actual agent workflows

Performance Optimization Tips

  • Consolidate related functionality - Merge commonly chained tools into single tools
  • Smart defaults - Set reasonable default values for parameters
  • Error handling - Provide clear, actionable error messages
  • Response formats - Support both detailed and concise response modes
  • Context management - Optimize relevance and quantity of returned information

Summary

Building effective tools for agents requires re-orienting our software development practices from predictable, deterministic patterns to non-deterministic ones.

Through the iterative, evaluation-driven process we’ve described in this post, we’ve identified consistent patterns in what makes tools successful:

Effective tools are:

  • Intentionally and clearly defined
  • Judicious with agent context
  • Combinable in diverse workflows
  • Intuitive for solving real-world tasks

As agents become more capable, the tools they use will evolve alongside them. With a systematic, evaluation-driven approach to improving tools for agents, we can ensure that both tools and agents advance together.

Related Resources