Why Your AI Agent Needs a Cache - BinaryBourbon

Last month, I watched a startup burn through $50,000 in OpenAI credits in 72 hours. Their crime? Running the same customer support queries through GPT-4 thousands of times. The fix took 4 hours and reduced their costs by 95%.

Here's the thing: your users ask the same questions over and over. If you're not caching LLM responses, you're literally setting money on fire.

The $50K Wake-Up Call

The startup had built an AI customer support agent. Beautiful product, great UX, customers loved it. But they made a classic mistake: treating every query as unique.

Their logs told the real story:

"How do I reset my password?" - 8,432 times
"What's your refund policy?" - 6,218 times
"How do I upgrade my plan?" - 5,891 times

Each query cost ~$0.04 with GPT-4. Do the math. They were paying OpenAI $337 just to answer the password reset question.

Enter: The Cache

We implemented a semantic cache in one afternoon. Here's what changed:

// Before: Every request hits the API
const response = await openai.complete(userQuery);

// After: Cache handles 95% of requests
const response = await cache.get(userQuery, async () => {
  return await openai.complete(userQuery);
});

That's it. Three lines of code. $47,500 saved per month.

The Three Levels of LLM Caching

Level 1: Exact Match (The Gateway Drug)

Start here. It's dead simple and catches more than you'd think.

const cache = new Map();

async function getCachedResponse(query) {
  if (cache.has(query)) {
    return cache.get(query);
  }
  
  const response = await llm.complete(query);
  cache.set(query, response);
  return response;
}

This catches:

Repeated API calls from retries
Common queries from different users
Pagination requests for the same data

Hit rate: 20-30% in most applications.

Level 2: Semantic Caching (The Money Maker)

This is where it gets interesting. Instead of exact matches, we use embeddings to find similar queries:

const semanticCache = new SemanticCache({
  threshold: 0.95,  // Similarity threshold
  embedding: 'text-embedding-ada-002'
});

// These all return the same cached response:
"How do I reset my password?"
"I forgot my password"
"password reset"
"can't log in, need new password"

The magic: 70-90% cache hit rate for customer support, FAQ, and documentation queries.

Level 3: Template Caching (The Pro Move)

For structured queries, cache at the template level:

// Template: "Summarize the {type} report for {month}"
const template = extractTemplate(query);
const params = extractParams(query);

const cacheKey = `${template}:${hashParams(params)}`;
const cached = await cache.get(cacheKey);

This works beautifully for:

Report generation
Data analysis queries
Structured chatbot responses

Real Production Patterns

Pattern 1: The Request Deduplicator

Multiple users asking the same thing simultaneously? Don't make multiple API calls:

class RequestDeduplicator {
  constructor() {
    this.inFlight = new Map();
  }
  
  async request(key, fetchFn) {
    if (this.inFlight.has(key)) {
      return this.inFlight.get(key);
    }
    
    const promise = fetchFn();
    this.inFlight.set(key, promise);
    
    try {
      return await promise;
    } finally {
      this.inFlight.delete(key);
    }
  }
}

Pattern 2: The Sliding Window Cache

For time-sensitive data, implement sliding expiration:

class SlidingWindowCache {
  set(key, value, ttl) {
    const expiry = Date.now() + ttl;
    this.cache.set(key, { value, expiry });
  }
  
  get(key) {
    const item = this.cache.get(key);
    if (!item || item.expiry < Date.now()) {
      this.cache.delete(key);
      return null;
    }
    return item.value;
  }
}

Pattern 3: The Hierarchical Cache

Different TTLs for different query types:

const cacheConfig = {
  'weather': { ttl: 30 * 60 },         // 30 minutes
  'stock-price': { ttl: 60 },          // 1 minute
  'definition': { ttl: 7 * 24 * 3600 }, // 1 week
  'default': { ttl: 3600 }             // 1 hour
};

The Numbers Don't Lie

From three production deployments last quarter:

Company	Before	After	Savings	Hit Rate
E-commerce Support Bot	$28K/mo	$1.8K/mo	94%	91%
SaaS Analytics Agent	$45K/mo	$6K/mo	87%	83%
Legal Document Assistant	$12K/mo	$2.1K/mo	82%	78%

Implementation Gotchas

Learn from my mistakes:

1. Cache Invalidation

The two hardest problems in computer science are cache invalidation, naming things, and off-by-one errors. For LLMs, use TTLs aggressively:

User-specific data: 5-15 minutes
General knowledge: 1-7 days
Real-time data: 30-60 seconds

2. Memory Management

LLM responses are big. A naive cache will eat your RAM:

// Bad: Unbounded growth
cache.set(key, hugeResponse);

// Good: LRU with size limit
const cache = new LRU({
  max: 1000,  // max items
  maxSize: 100 * 1024 * 1024,  // 100MB
  sizeCalculation: (value) => JSON.stringify(value).length
});

3. Security Considerations

Never cache:

Personally identifiable information
User-specific recommendations
Sensitive business data

Always:

Include user context in cache keys when needed
Implement proper access controls
Encrypt cache data at rest

The 5-Minute Implementation

Want to start right now? Here's a production-ready semantic cache in 50 lines:

import { Redis } from 'ioredis';
import { OpenAI } from 'openai';

class QuickCache {
  constructor() {
    this.redis = new Redis(process.env.REDIS_URL);
    this.openai = new OpenAI();
  }
  
  async get(query, generateFn) {
    // Generate embedding
    const embedding = await this.openai.embeddings.create({
      input: query,
      model: 'text-embedding-ada-002'
    });
    
    // Check cache (using Redis sorted sets for similarity)
    const cached = await this.findSimilar(embedding);
    if (cached) return cached;
    
    // Generate fresh response
    const response = await generateFn();
    
    // Store with embedding
    await this.store(query, response, embedding);
    
    return response;
  }
  
  // Implement findSimilar and store...
}

What's Next?

Caching is just the beginning. In production, you'll want:

Monitoring: Track hit rates, latency, and savings
A/B Testing: Compare cached vs fresh responses
Smart Invalidation: Update cache based on data changes
Edge Caching: Put cache close to users

The Bottom Line

Every AI application needs a cache. Period. If you're making the same API call twice, you're doing it wrong.

Start simple. Even a basic exact-match cache will save you thousands. Then graduate to semantic caching when you're ready to save serious money.

Remember: The best API call is the one you don't make.

Want to implement caching in your AI application? Check out my open source cache patterns library or try the ML cost calculator to see how much you could save.