April 10, 2026aiwatch

How to Monitor AI Agent Costs Before They Destroy Your Budget

5 steps × 100 users × $0.015/call = $7.50/minute. $10,800/day if you're not watching.

aiwatchai-agentscostsanthropicbudget

How to Monitor AI Agent Costs Before They Destroy Your Budget

Here's a number that should scare you: $10,800.

That's what happens when an AI agent with a bug runs unchecked for 24 hours. The math is straightforward:

Your agent uses claude-sonnet-4-6 with tool calling

Each agent loop takes 5 steps (system prompt + 4 tool calls)

Each step costs about $0.015 (3,000 input tokens, 500 output tokens)

One user request = 5 steps = $0.075

100 concurrent users = $7.50/minute

Running all day = $10,800

This isn't hypothetical. We've seen it happen. A developer shipped an agent that retried on malformed JSON responses. The LLM returned malformed JSON 40% of the time on a specific edge case. The retry logic had no max attempts. It ran for 6 hours before someone noticed the Anthropic dashboard.

$4,860. Real money. Gone.

The $10,000 mistake (and how it happens)

AI agents are fundamentally different from traditional API calls. A REST endpoint costs the same every time -- 50ms of compute, a database query, a JSON response. Predictable. Budgetable.

An AI agent's cost depends on what it decides to do. The same user input might trigger 3 calls or 30 calls depending on the model's reasoning. You can't predict it, and you can't cap it at the application level without breaking functionality.

Here's what a normal agent session looks like:

User: "What were our top 5 customers last quarter?"

Step 1: Claude reads the system prompt + user message     → $0.008
Step 2: Claude calls get_quarterly_data tool              → $0.003
Step 3: Tool result injected (2,000 tokens of data)       → $0.012
Step 4: Claude calls get_customer_details for top 5       → $0.015
Step 5: Claude generates the final summary                → $0.009

Total: $0.047 — totally reasonable

Now here's what happens when something goes wrong:

User: "What were our top 5 customers last quarter?"

Step 1: Claude reads the system prompt                    → $0.008
Step 2: Claude calls get_quarterly_data                   → $0.003
Step 3: Tool returns an error (database timeout)          → $0.001
Step 4: Claude decides to retry                           → $0.003
Step 5: Tool returns an error again                       → $0.001
Step 6: Claude tries a different approach                 → $0.005
Step 7: Claude calls get_all_customers instead            → $0.003
Step 8: Tool returns 50,000 rows (context explodes)       → $0.089
Step 9: Claude tries to process, hits token limit         → $0.045
Step 10: Framework truncates and retries                  → $0.045
... (continues for 30+ steps)

Total: $2.40 — for ONE request

Multiply by 100 users hitting this edge case per hour. That's $240/hour. $5,760/day.

The three ways agents blow your budget

After analyzing thousands of agent traces in AIWatch, we see the same three patterns repeatedly:

1. Infinite loops

The most expensive bug. The agent encounters an error or unexpected response, retries, gets the same error, retries again. Without a max-steps limit, this runs until the context window fills up.

Common triggers:

Tool returns an error the model doesn't know how to handle

Model generates malformed tool calls that fail validation

Output parsing fails, framework retries automatically

The model gets confused and calls the same tool repeatedly

The cost: Each retry adds $0.01-0.05. A loop of 50 retries costs $0.50-2.50 per request. At scale, this compounds fast.

2. Context window explosion

Every step in an agent chain adds tokens to the conversation history. System prompt (500 tokens) + user message (100 tokens) + tool results (varies). The problem is tool results.

A tool that returns a database query result might inject 5,000 tokens into the context. After 3 tool calls, you're at 15,000+ tokens of context. The model is now processing 15,000 input tokens for every subsequent step.

Claude Sonnet costs $3 per million input tokens. At 15,000 tokens per call, that's $0.045 per step. Five more steps and you've spent $0.27 on one request -- 6x what it should cost.

The fix: Summarize tool results before injecting them. Return the top 10 rows, not all 50,000. Use streaming for large results.

3. Fallback cascades

Your code tries claude-sonnet, it's rate limited, falls back to claude-opus. Opus costs 5x more. If the rate limiting is temporary (which it usually is), you're paying 5x for the same result.

Or worse: your code tries Anthropic, falls back to OpenAI, falls back to a different OpenAI model. Each fallback adds cost and latency, and the context needs to be reformatted for each provider.

The cost: A $0.015 call becomes $0.075 on the first fallback and $0.15 on the second. The user doesn't notice -- the response looks the same. But your bill notices.

How to catch it before it costs you

AIWatch budget rules are the simplest protection against runaway costs. Set them up once, forget about them.

Daily budget with hard stop:

Go to your AIWatch dashboard → Budgets tab. Set:

Monthly budget: $200
Daily budget: $8 (monthly / 25 working days)
Alert at: 80% ($6.40/day)
Hard stop: Yes

When you hit 80% of your daily budget, AIWatch sends an alert to Slack/email. When you hit 100%, it returns a 429 status code to your application. Your code should handle this:

try {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6-20250514',
    messages: [...],
  });
  return response;
} catch (error) {
  if (error.status === 429) {
    // Budget exceeded — graceful degradation
    return { message: "AI features are temporarily paused. Try again tomorrow." };
  }
  throw error;
}

This is not elegant. It's effective. A hard stop at $8/day means your maximum monthly bill is $248, no matter what bugs you ship.

Per-feature budgets:

If your app has multiple AI features (chat, summarization, classification), set budgets per feature using the X-Luxkern-Feature header:

const response = await client.messages.create({
  model: 'claude-sonnet-4-6-20250514',
  messages: [...],
}, {
  headers: {
    'X-Luxkern-Feature': 'chat',
  },
});

Now you can see in the dashboard: chat costs $4.20/day, summarization costs $1.80/day, classification costs $0.30/day. If chat suddenly jumps to $12/day, you know exactly where to look.

The 2-line setup

You don't need to change your application code. Just change the base URL:

# Python
import anthropic

client = anthropic.Anthropic(
    base_url="https://api.luxkern.com/aiwatch/proxy/anthropic"
)

// Node.js
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  baseURL: 'https://api.luxkern.com/aiwatch/proxy/anthropic',
});

Every call now gets logged with full trace data: tokens, cost, latency, prompt preview. The proxy adds less than 50ms of overhead. If the proxy is slow or unavailable, it bypasses automatically to the direct Anthropic API -- your users never see a degradation.

Reading the warning signs

Once AIWatch is running, here's what to watch for in the Traces dashboard:

Latency spikes: A call that normally takes 800ms suddenly takes 4,000ms. This usually means the context window grew (more input tokens = slower response). Check if a tool is returning more data than expected.

Cost per request increasing: Your average request cost went from $0.04 to $0.12. Sort traces by cost descending. The expensive calls will show you which feature or which chain step is responsible.

Error rate above 2%: Some 429s from the provider are normal (rate limiting). A sustained 5%+ error rate means something is wrong -- either you're hitting rate limits too often (you need request queuing) or the provider is having issues (check Radar for community reports).

Chain length growing: Your average chain used to be 4 steps. Now it's 8. The model might be struggling with a specific tool or a changed prompt. Check the AICanary for behavioral regression.

What to do when you hit a spike

You open Slack on Monday morning. AIWatch sent an alert: "Daily budget 80% reached at 11:47 AM Sunday."

Here's the playbook:

1. Check the traces (2 minutes)

Open AIWatch → Traces → sort by cost descending → filter to Sunday. Find the expensive calls. Are they all from one feature? One user? One chain?

2. Identify the pattern (5 minutes)

If it's one feature: check the chain traces. Is there a loop? Is the context window growing?

If it's one user: check if their data triggers an edge case (large dataset, unusual characters, specific tool error).

If it's all features: check Radar. The AI provider might have had latency issues, causing retries and increased costs.

3. Fix the immediate problem (10 minutes)

If it's a loop: add max_steps = 10 to your agent configuration

If it's context growth: add result summarization or truncation

If it's fallback cascading: fix the primary path instead of relying on fallbacks

If it's a specific prompt: update the system prompt to handle the edge case

4. Prevent it from happening again (5 minutes)

Set a per-feature budget if you haven't already

Add a max-steps limit to every agent loop

Set up AICanary to test the specific edge case that caused the spike

Total time: 22 minutes. Cost prevented: potentially thousands of dollars per month.

---

AI agents are powerful. They're also the most unpredictable cost center in your infrastructure. Without monitoring, every agent deployment is a bet that nothing goes wrong. With AIWatch, you know exactly what's happening, how much it costs, and you have a kill switch when things go sideways.

Set up cost monitoring in 2 minutes -- free to start, EU-hosted.