February 27, 2026aiwatch

How to Reduce Your LLM Costs by 60% in Production

Five production optimizations that cut LLM API costs dramatically. Real numbers, code examples, and a breakdown from $127/month to $34/month.

llm-costsoptimizationaiwatchproductionanthropicopenaicost-monitoring

How to Reduce Your LLM Costs by 60% in Production

Your LLM bill went from $127/month to $34/month with 5 changes. Not theoretical changes. Not "consider doing X." Five specific optimizations, applied to a real support automation pipeline, with before-and-after numbers for each one. The total reduction was 73%, and the largest single lever -- model routing -- took 45 minutes to implement and accounted for $56 of the $93 in monthly savings.

Here is the thing that nobody tells you about LLM costs: the default configuration is almost always the most expensive one. You pick Claude Sonnet during prototyping because it works well, and then you ship it to production without asking whether every call actually needs Sonnet-level intelligence. You send the same 2,000-token system prompt on every request without caching. You run classification tasks synchronously when they could be batched at half price. Each of these defaults is money you are leaving on the table.

Optimization 1: Route Tasks to the Right Model

This single change saved $56/month -- 44% of the total bill. The principle is simple: most production LLM workloads are a mix of easy tasks and hard tasks, and you are probably sending all of them to the same model.

Here is the cost landscape as of early 2026:

| Model | Input (per 1M tokens) | Output (per 1M tokens) | |---|---|---| | Claude Haiku | $0.25 | $1.25 | | Claude Sonnet | $3.00 | $15.00 | | Claude Opus | $15.00 | $75.00 |

The cost difference between Haiku and Opus is 60x. Between Haiku and Sonnet, it is 12x on input and 12x on output. If 70% of your calls are simple tasks -- classification, entity extraction, yes/no decisions, format conversion -- routing them to Haiku saves you a fortune with no measurable quality loss.

type TaskType = "classify" | "extract" | "generate" | "reason";

function selectModel(task: TaskType): string {
  const routing: Record<TaskType, string> = {
    classify: "claude-haiku-4-20250414",   // $0.25/M in
    extract: "claude-haiku-4-20250414",    // $0.25/M in
    generate: "claude-sonnet-4-20250514",  // $3.00/M in
    reason: "claude-opus-4-20250514",      // $15.00/M in
  };
  return routing[task];
}

// Classify a support ticket: Haiku handles this perfectly
const category = await client.messages.create({
  model: selectModel("classify"),
  max_tokens: 50,
  messages: [{
    role: "user",
    content: Classify this support ticket into one of: billing, technical, feature-request, other.\n\nTicket: "${ticket.body}",
  }],
});

// Draft a customer response: Sonnet produces better prose
const reply = await client.messages.create({
  model: selectModel("generate"),
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: Draft a helpful response to this support ticket: "${ticket.body}",
  }],
});

To figure out which of your calls can move to Haiku, audit your production prompts. Any prompt where the expected output is shorter than 100 tokens and follows a rigid structure (a category name, a JSON object, a yes/no) is a strong candidate. Run a quick evaluation: send 100 production inputs to both Sonnet and Haiku, compare outputs, and measure whether the cheaper model produces equivalent results. In our experience, Haiku matches Sonnet on 92% of classification tasks and 87% of extraction tasks.

Optimization 2: Cache Your System Prompts

Saved $17/month. If your system prompt is the same across requests (and it almost always is), you are paying full price for the same input tokens on every single call. Anthropic's prompt caching lets you mark static portions of your prompt as cacheable. The first request processes the tokens normally. Subsequent requests pay 90% less for cached input tokens and run 85% faster.

import anthropic

client = anthropic.Anthropic(
    base_url="https://aiwatch.luxkern.com/v1/proxy/anthropic"
)

SYSTEM_PROMPT = """You are a customer support agent for ShipFast...
[2000+ tokens of detailed instructions, tone guidelines,
product knowledge, escalation rules, response templates]"""

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": customer_message}],
)

The key rule: do not inject variable data into the cached portion. Timestamps, user IDs, session tokens -- put those in the user message, not the system prompt. The cache key is computed from the prompt content. Any change invalidates the cache and you pay full price again.

For an application making 10,000 requests per day with a 2,000-token system prompt, caching reduces system prompt input costs by roughly 90% after the first request in each 5-minute cache window. On Sonnet, that is the difference between $60/month and $6/month just for the system prompt tokens.

Optimization 3: Batch Non-Urgent Work

Saved $12/month. Both Anthropic and OpenAI offer batch APIs with a 50% discount. The tradeoff: results are returned within 24 hours instead of seconds. For any workload that does not need a real-time response -- nightly report generation, bulk classification, content moderation queues, data enrichment pipelines -- batching cuts costs in half.

// Collect tickets that arrived overnight
const overnightTickets = await db.tickets.find({
  created_at: { $gte: lastProcessedAt },
  classified: false,
});

// Build batch request
const batchRequests = overnightTickets.map((ticket) => ({
  custom_id: ticket-${ticket.id},
  params: {
    model: "claude-haiku-4-20250414",
    max_tokens: 100,
    messages: [
      {
        role: "user",
        content: Classify this ticket: billing, technical, feature-request, or other.\n\n"${ticket.body}",
      },
    ],
  },
}));

// Submit batch (50% cheaper, results within 24h)
const batch = await client.messages.batches.create({
  requests: batchRequests,
});

console.log(Submitted batch ${batch.id} with ${batchRequests.length} requests);

// Poll for results (or set up a webhook)
const results = await client.messages.batches.results(batch.id);
for (const result of results) {
  await db.tickets.update(result.custom_id.replace("ticket-", ""), {
    category: result.result.content[0].text,
    classified: true,
  });
}

In the support pipeline we optimized, 35% of all Claude API calls were classification tasks that ran in a nightly cron job. Moving them to the batch API cut their cost from $24/month to $12/month. The classification results were available by 6 AM, well before the support team started their day.

Optimization 4: Add Circuit Breakers for Agent Loops

Saved $4/month (and prevented two potential $50-100 incidents). This is not a cost optimization in the traditional sense. It is insurance against catastrophic cost events. AI agents that call tools in loops, retry on failure without backoff, or enter recursive reasoning chains can burn through your entire monthly budget in minutes.

class LLMCircuitBreaker {
  constructor(maxCallsPerMinute = 60, maxCostPerHour = 10.0) {
    this.maxCalls = maxCallsPerMinute;
    this.maxCost = maxCostPerHour;
    this.callCount = 0;
    this.costAccumulated = 0;
    this.windowStart = Date.now();
    this.consecutiveErrors = 0;
  }

  async call(fn, estimatedCost) {
    // Reset window every minute
    if (Date.now() - this.windowStart > 60_000) {
      this.callCount = 0;
      this.windowStart = Date.now();
    }

    if (this.callCount >= this.maxCalls) {
      throw new Error(Circuit breaker: ${this.maxCalls} calls/min limit hit);
    }
    if (this.costAccumulated >= this.maxCost) {
      throw new Error(Circuit breaker: $${this.maxCost}/hour cost limit hit);
    }
    if (this.consecutiveErrors >= 5) {
      throw new Error("Circuit breaker: 5 consecutive errors");
    }

    this.callCount++;
    this.costAccumulated += estimatedCost;

    try {
      const result = await fn();
      this.consecutiveErrors = 0;
      return result;
    } catch (err) {
      this.consecutiveErrors++;
      throw err;
    }
  }
}

// Usage
const breaker = new LLMCircuitBreaker(60, 5.0);

const result = await breaker.call(
  () =>
    client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 1024,
      messages,
    }),
  0.02 // estimated cost per call
);

During the 30-day optimization period, the circuit breaker caught two runaway agent loops before they could accumulate significant costs. Based on the loop rate (12 calls/second), each incident would have cost $50-100 over a weekend before manual intervention. The breaker terminated them in under 60 seconds, limiting damage to $1-2 each.

Optimization 5: Budget Monitoring with Hard Stops

Saved $4/month by catching prompt bloat early. Circuit breakers stop acute incidents. Budget monitoring catches gradual cost creep -- the kind that happens when prompts get longer over time, traffic grows 15% month-over-month, or someone adds a new LLM call to a high-traffic endpoint without thinking about cost implications.

import anthropic

Route through AIWatch for real-time cost monitoring
client = anthropic.Anthropic(
    base_url="https://aiwatch.luxkern.com/v1/proxy/anthropic"
)

Tag every call with the feature name for cost attribution
response = client.messages.create(
    model="claude-haiku-4-20250414",
    max_tokens=100,
    messages=[{"role": "user", "content": f"Classify: {ticket_body}"}],
    extra_headers={
        "X-Luxkern-Feature": "ticket-classification",
    },
)

In the AIWatch dashboard, set a daily budget of $5 with an 80% alert to Slack and a hard stop at 100%. The 80% alert ($4) fires early enough to investigate. The hard stop prevents a single day from blowing through your monthly allocation.

During optimization, the budget monitoring caught a case where a developer had accidentally doubled the system prompt length for the response-drafting feature, increasing per-call costs by 40%. The daily spend alert flagged the increase within 24 hours. Without monitoring, it would have shown up on the monthly invoice as $18 in unexpected additional spend. Our guide on monitoring Claude API costs in production walks through the full AIWatch setup, including per-customer budgets and feature-level cost attribution.

The Full Breakdown: $127 to $34

Here is the complete optimization breakdown from the support automation pipeline:

| Optimization | Before | After | Savings | Effort | |---|---|---|---|---| | Model routing (70% of calls to Haiku) | $127.00 | $71.00 | $56.00 | 45 min | | Prompt caching (2K-token system prompt) | $71.00 | $54.00 | $17.00 | 15 min | | Batch processing (nightly classification) | $54.00 | $42.00 | $12.00 | 30 min | | Circuit breaker (prevented 2 runaway loops) | $42.00 | $38.00 | $4.00 | 20 min | | Budget monitoring (caught prompt bloat) | $38.00 | $34.00 | $4.00 | 10 min | | Total | $127.00 | $34.00 | $93.00 (73%) | 2 hours |

The optimizations compound. Model routing reduces the base cost. Caching reduces the per-call cost on what remains. Batching halves the cost of deferrable work. Circuit breakers and monitoring prevent regressions. Two hours of work saves $93 every month, compounding to $1,116/year.

Implementation Order

If you are starting from scratch, implement in this order:

Model routing -- largest impact, lowest effort. Audit your calls, move simple tasks to Haiku.

Budget monitoring -- set up AIWatch with daily alerts before optimizing anything else. You need visibility.

Prompt caching -- if your system prompt is over 1,000 tokens, enable caching. Near-zero implementation cost.

Circuit breakers -- mandatory if you have agent loops or tool-use patterns.

Batch processing -- apply to any background workload that tolerates latency.

Each optimization is independent. You do not need to do all five. Model routing alone saves most teams 30-40% of their bill. But the compounding effect of all five is what gets you from $127 to $34.

For teams running AI agents in production, behavioral regression testing catches the quality side of the equation -- ensuring that cost optimizations do not silently degrade output quality.

Stop paying $15 per million tokens for work that a $0.25 model handles perfectly. Your LLM budget should reflect the complexity of each task, not the laziness of routing everything to one model.

Try Luxkern AIWatch free -- no credit card required.