April 9, 2026aiwatch

LLM Traces in Production: How to Debug AI Costs and Latency Without Losing Your Mind

Your AI agent makes 47 Claude calls to answer one question. You don't know it. Until the bill arrives.

aiwatchllmtracesdebugginganthropic

LLM Traces in Production: How to Debug AI Costs and Latency Without Losing Your Mind

Your AI feature works. Users like it. Then you open the Anthropic dashboard and see $847 for last month. You budgeted $200.

You dig into the logs. Nothing. Your application logs say "called Claude, got response." That's it. No detail on how many calls happened, what the prompts looked like, how long each call took, or why one user's request triggered 47 separate API calls while another triggered 3.

This is what happens when you ship AI features without tracing. You're flying blind, and the turbulence is your invoice.

The invisible calls killing your budget

Here's what a typical AI agent request looks like from the outside:

User: "Summarize my last 5 support tickets"
Response: "Here's a summary of your recent tickets..."
Latency: 8.2 seconds

Looks simple. One question, one answer. But here's what actually happened behind the scenes:

System prompt sent with 2,400 tokens of context

Tool call: get_tickets -- Claude asks for the data

Tool result: 5 tickets injected (3,800 tokens)

Claude generates a summary plan (internal reasoning)

Tool call: get_ticket_details for ticket #1

Tool call: get_ticket_details for ticket #2

Tool call: get_ticket_details for ticket #3

Tool call: get_ticket_details for ticket #4

Tool call: get_ticket_details for ticket #5

Final summary generated

That's 10 API round-trips. Each one has input tokens, output tokens, and latency. The total cost for this single user request? Around $0.12. Multiply by 200 daily users, and you're at $720/month for one feature.

Without traces, you see: "$720/month on Claude." With traces, you see exactly which step is expensive and why.

What LLM traces actually show you

An LLM trace captures every individual API call your application makes to an AI provider. For each call, you get:

Timestamp: when the call started

Model: which model was used (claude-sonnet-4-6, gpt-4o, etc.)

Prompt preview: the first 200 characters of what you sent

Response preview: the first 200 characters of what came back

Input tokens: how many tokens went in

Output tokens: how many tokens came out

Latency: how long the round-trip took in milliseconds

Cost: the exact dollar amount for this call

Status code: 200, 429 (rate limited), 500 (error)

Chain ID: which multi-step chain this call belongs to

That last one is critical. When your agent makes 10 calls to answer one question, the chain ID groups them together so you can see the full picture.

Here's what a trace looks like in AIWatch:

| Time | Model | Latency | Tokens | Cost | |------|-------|---------|--------|------| | 14:23:01 | claude-sonnet-4-6 | 342ms | 2,400 in / 89 out | $0.0085 | | 14:23:01 | claude-sonnet-4-6 | 1,204ms | 6,200 in / 456 out | $0.0254 | | 14:23:03 | claude-sonnet-4-6 | 890ms | 4,100 in / 234 out | $0.0158 |

Immediately you can see: the second call is the expensive one. 6,200 input tokens. That's the step where all 5 tickets got injected into the context window at once. Maybe you should paginate that.

How to add tracing in 2 lines

If you're using the Anthropic SDK or OpenAI SDK, you don't need to change your application code. Just point your SDK's base URL to AIWatch's proxy:

Python (Anthropic):

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.luxkern.com/aiwatch/proxy/anthropic"
)

Everything else stays the same
response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1000,
    messages=[{"role": "user", "content": "Hello"}]
)

Node.js (Anthropic):

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  baseURL: 'https://api.luxkern.com/aiwatch/proxy/anthropic',
});

// Your existing code works unchanged
const message = await client.messages.create({
  model: 'claude-sonnet-4-6-20250514',
  max_tokens: 1000,
  messages: [{ role: 'user', content: 'Hello' }],
});

Python (OpenAI):

from openai import OpenAI

client = OpenAI(
    base_url="https://api.luxkern.com/aiwatch/proxy/openai"
)

That's it. Every call now gets logged with full trace data. Your prompts stay in EU servers (Frankfurt, Germany). The proxy adds less than 50ms of latency, and if it's ever slow, it automatically bypasses to the direct API so your users never notice.

Reading your first trace

Once calls start flowing, open your AIWatch dashboard and click the Traces tab. You'll see something like this:

2 min ago  | claude-sonnet-4-6 | /api/chat    | 1,204ms | 3,200 in 890 out | $0.019 | 200
2 min ago  | claude-sonnet-4-6 | /api/chat    | 342ms   | 890 in 45 out    | $0.003 | 200
5 min ago  | claude-haiku-4-5  | /api/classify| 89ms    | 200 in 12 out    | $0.0001| 200
12 min ago | gpt-4o            | /api/summary | 2,340ms | 8,100 in 1,200 out| $0.038 | 200

Sort by cost descending. Your most expensive calls float to the top. Sort by latency. The slowest calls are usually the ones with the most input tokens -- that's your context window growing.

Click on any trace to see the prompt and response previews, the exact token counts, and the cost breakdown. If the call has a prompt_hash, you'll see how many times that same prompt has been sent this week. A prompt that fires 500 times/day at $0.02 each is $300/month. Worth optimizing.

The chain problem (multi-step agents)

Single-call LLM features are straightforward. The hard debugging problems come from agents -- systems that make multiple calls in sequence, where each call depends on the result of the previous one.

A typical ReAct agent loop:

Step 1: User message + system prompt       → Claude decides to call a tool
Step 2: Tool result injected               → Claude decides to call another tool
Step 3: Tool result injected               → Claude generates final answer

Three calls minimum. But agents can loop. If your tool returns an error, Claude might retry. If the context window fills up, some frameworks silently truncate and retry. If the model can't find the right tool, it might try several.

We've seen agents make 30+ calls for a single user request. Without chain tracing, you'd never know.

In AIWatch, when you pass an X-Chain-Id header, all calls in that chain are grouped:

const chainId = crypto.randomUUID();

// Step 1
const step1 = await client.messages.create({
  model: 'claude-sonnet-4-6-20250514',
  messages: [...],
}, {
  headers: {
    'X-Chain-Id': chainId,
    'X-Chain-Step': '1',
  },
});

// Step 2 (uses step1 result)
const step2 = await client.messages.create({
  model: 'claude-sonnet-4-6-20250514',
  messages: [...],
}, {
  headers: {
    'X-Chain-Id': chainId,
    'X-Chain-Step': '2',
  },
});

Now when you view step 1's trace, you see the entire chain: all steps, their individual costs, and the total. The expensive step is immediately visible.

What to do when you find the problem

Once traces reveal the issue, the fixes are usually one of these:

Too many tokens in: Your system prompt or context is bloated. Trim it. Use summaries instead of full documents. Consider claude-haiku for simple classification steps.

Too many calls in a chain: Your agent is looping. Add a max-steps limit. Use structured outputs to reduce retries. Cache tool results that don't change.

Wrong model for the job: You're using claude-sonnet for tasks that claude-haiku handles at 1/4 the cost. Traces show you which calls are simple enough to downgrade.

Cost spikes at specific times: A scheduled job or a specific user segment is driving costs. AIWatch's budget rules can alert you at 80% of your daily limit and hard-stop at 100%.

Set up a budget rule in 30 seconds:

Monthly budget: $200
Alert at: 80% ($160)
Hard stop: Yes, at 100%

When the hard stop triggers, AIWatch returns a 429 to your app. Your code should handle that gracefully -- show a "try again later" message instead of an error page.

---

LLM traces aren't optional anymore. If you're running AI features in production, you need to see what's happening inside every call. The alternative is guessing why your bill doubled.

AIWatch gives you full LLM tracing, cost monitoring, and budget protection. EU-hosted. 2 lines to set up.