Tool use is finally good. Here's how to actually ship it.

For two years, "tool use" has meant "JSON-mode plus crossed fingers." That era is over.

The current generation of frontier models — Claude 4.7, GPT-5.1, Gemini 2.5 — all ship native tool-use APIs that handle parallelism, error recovery, and partial-result composition. The gap between "demo-able" and "production-able" has collapsed. If you tried tool use a year ago and walked away, the cost of trying again is roughly an afternoon, and the floor of what works is dramatically higher.

But there are still ways to ship a sharp edge into your own foot. Here are the four that come up the most in our reviews.

Tools should be designed for the model, not your codebase

The first generation of tool wrappers exposed internal services more or less verbatim. Take whatever your engineers were already building — getUserById, listInvoicesByCustomer, bulkUpdateMetadata — and slap a JSON schema on it. The model figured it out, mostly.

Mostly is the operative word. Internal APIs are designed for humans who know the domain. Models do not know your domain. They know the world.

The cleanest shift is to treat your tool surface as a small, model-facing API. Concretely:

Collapse pagination behind the tool. The model should not be tracking cursors.
Provide tools at the task granularity, not the resource granularity. find_overdue_invoices_for_customer beats list_invoices(filter=...).
Return small, structured payloads. 200 tokens of structured data beats 2,000 tokens of nested JSON every time.
Name things in domain language. cancel_subscription is unambiguous. mutateMembershipState is not.

This is the closest thing to a single trick: rewrite your tools as if a smart intern was the consumer. Latency drops. Token cost drops. Reliability climbs.

Pin your model. Test on every minor.

Tool-use behavior shifts between minor model releases more than text-only behavior does. The prompt you tuned six weeks ago against claude-4-7-20260201 will probably keep working — until the day it doesn't, on a tool-use edge case nobody anticipated.

The fix is uneventful: pin the snapshot, run a small eval suite (50–100 tasks) on every release, and only roll forward when the suite is green. We have a sample harness in our evals repo you can lift.

Treat error returns as first-class

The single biggest reliability win on a real agent loop is consistent error envelopes. If half your tools return errors as { ok: false, error: "..." } and half as bare strings and a few panic with HTTP 500, the model spends real cognitive budget figuring out what's going on.

A boring shape is fine:

{ "ok": false, "code": "NOT_FOUND", "message": "...", "retryable": false }

Apply it ruthlessly. The model recovers from code: "RATE_LIMITED", retryable: true flawlessly. It recovers from "Internal server error 500" stochastically.

Stop returning the kitchen sink

We see teams return entire database rows because they "might be useful later." They will not be useful later. They will eat your context window and confuse your agent.

When a tool runs, return only what the model needs to make the next decision. If the user asks for follow-up detail, run another tool call. The token economics actually favor this: two narrow calls beat one wide one almost every time, once you account for cache hit rates.

The deeper read on all of this is that tool use has moved from "novel capability" to "ordinary integration concern." Treat it like you treat any other API surface — design it deliberately, test it on every release, and respect the consumer. The models will do the rest.

Tool use is finally good. Here's how to actually ship it.

Tools should be designed for the model, not your codebase

Pin your model. Test on every minor.

Treat error returns as first-class

Stop returning the kitchen sink

Coming soon

Frontier evals are converging. That tells us less than you think.

Interpretability is having a moment. Here's why it matters now.