Everyone talks about AI being expensive. I built a SaaS that handles 50K daily active users, processes 2M+ AI requests per month, and runs on $200/month infrastructure. It's been profitable since month 3.

This isn't theoretical. This is the exact architecture powering ContentOptimizer.ai (name changed), which helps e-commerce teams write product descriptions. Here's how we did it.

The Numbers First

Let's start with what matters - the money:

  • Monthly Revenue: $8,500 (170 customers × $50/mo)
  • Infrastructure Cost: $198.47/mo
  • AI API Cost: ~$300/mo (after optimizations)
  • Gross Margin: 94%

Here's the infrastructure breakdown:

Vercel (Frontend + API):        $20/mo
Supabase (Database + Auth):     $25/mo
Redis Cloud (Caching):          $30/mo
Cloudflare R2 (Storage):        $5/mo
Resend (Email):                 $20/mo
BetterUptime (Monitoring):      $29/mo
Plausible (Analytics):          $9/mo
DigitalOcean (Background Jobs): $60/mo
-----------------------------------
Total:                          $198/mo

The Architecture

Here's the full system design:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Next.js   │────▶│ Edge Function│────▶│   Redis     │
│   Frontend  │     │   (Vercel)   │     │   Cache     │
└─────────────┘     └──────────────┘     └─────────────┘
                           │                     │
                           ▼                     ▼
                    ┌──────────────┐     ┌─────────────┐
                    │  Supabase    │     │ OpenAI API  │
                    │  PostgreSQL  │     │  (Fallback) │
                    └──────────────┘     └─────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ Background   │
                    │ Workers (DO)  │
                    └──────────────┘

The Secret Sauce: Intelligent Caching

The #1 reason we can run this cheaply: 93% cache hit rate.

Three-Layer Cache Strategy

Layer 1: Exact Match Cache (40% hit rate)

// Redis exact match cache
const cacheKey = `exact:${hashString(prompt)}`;
const cached = await redis.get(cacheKey);

if (cached) {
  return { result: cached, source: 'exact-cache' };
}

Layer 2: Semantic Cache (35% hit rate)

// Find similar prompts using embeddings
const embedding = await getEmbedding(prompt);
const similar = await findSimilarVectors(embedding, 0.95);

if (similar.length > 0) {
  return { result: similar[0].response, source: 'semantic-cache' };
}

Layer 3: Template Cache (18% hit rate)

// Extract template and parameters
const template = extractTemplate(prompt);
const params = extractParams(prompt);

// Cache by template + param hash
const cacheKey = `template:${template}:${hashParams(params)}`;
const cached = await redis.get(cacheKey);

if (cached) {
  return { result: cached, source: 'template-cache' };
}

This leaves only 7% of requests hitting the OpenAI API.

Cost Optimization Tricks

1. Smart Model Routing

Not every request needs GPT-4:

function selectModel(prompt, user) {
  // Simple requests → GPT-3.5 Turbo
  if (prompt.length < 100 && !hasComplexRequirements(prompt)) {
    return 'gpt-3.5-turbo';
  }
  
  // Premium users → GPT-4
  if (user.plan === 'premium') {
    return 'gpt-4-turbo-preview';
  }
  
  // Complex requests → GPT-4 with fallback
  return prompt.length > 500 ? 'gpt-4' : 'gpt-3.5-turbo';
}

Result: 78% of requests use the cheaper model.

2. Request Batching

Instead of processing immediately, we batch similar requests:

// Collect requests for 100ms
const batch = await collectBatch(100);

// Process as single prompt
const batchPrompt = formatBatchPrompt(batch);
const result = await openai.complete(batchPrompt);

// Parse and distribute results
const individual = parseBatchResponse(result);
batch.forEach((req, i) => req.resolve(individual[i]));

This reduces API calls by 60% during peak hours.

3. Aggressive Precomputation

We precompute common outputs during off-peak hours:

// Run at 3 AM daily
async function precomputeCommon() {
  const commonPrompts = await getTop1000Prompts();
  
  for (const prompt of commonPrompts) {
    const result = await generateWithCache(prompt);
    await redis.setex(`precomputed:${prompt}`, 86400, result);
  }
}

The Database Strategy

Supabase gives us Postgres + Auth + Realtime for $25/mo. Here's our schema:

-- Core tables only
CREATE TABLE users (
  id UUID PRIMARY KEY,
  email TEXT UNIQUE,
  plan TEXT DEFAULT 'free',
  usage_this_month INT DEFAULT 0
);

CREATE TABLE generations (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  prompt TEXT,
  result TEXT,
  model TEXT,
  cached BOOLEAN,
  cost DECIMAL(10,6),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Minimal indexes
CREATE INDEX idx_user_created ON generations(user_id, created_at);
CREATE INDEX idx_prompt_hash ON generations(MD5(prompt));

Key decisions:

  • No complex relationships
  • Denormalized where it makes sense
  • Indexes only where absolutely needed

Background Jobs Without Breaking the Bank

We run background jobs on a single $60/mo DigitalOcean droplet:

// Simple job queue using Redis
class JobQueue {
  async add(type, data) {
    const job = { id: uuid(), type, data, attempts: 0 };
    await redis.lpush('jobs', JSON.stringify(job));
  }
  
  async process() {
    while (true) {
      const job = await redis.brpop('jobs', 1);
      if (job) await this.handleJob(JSON.parse(job));
    }
  }
}

This handles:

  • Email notifications
  • Usage tracking
  • Cache warming
  • Data exports

Monitoring on a Budget

You can't optimize what you don't measure:

// Custom metrics to Plausible
function trackMetric(event, props) {
  // Plausible for user-facing metrics
  plausible(event, { props });
  
  // Console.log for CloudWatch (free tier)
  console.log(JSON.stringify({
    metric: event,
    ...props,
    timestamp: Date.now()
  }));
}

Key metrics we track:

  • Cache hit rates by type
  • API costs by endpoint
  • Response times by model
  • User actions by plan

Scaling Challenges & Solutions

Challenge 1: Redis Memory Limits

Problem: Cache was eating too much memory

Solution: Intelligent eviction + compression

// Compress large values
async function cacheSet(key, value, ttl) {
  const size = Buffer.byteLength(JSON.stringify(value));
  
  if (size > 1024) { // 1KB threshold
    const compressed = await gzip(JSON.stringify(value));
    await redis.setex(`${key}:gz`, ttl, compressed);
  } else {
    await redis.setex(key, ttl, JSON.stringify(value));
  }
}

Challenge 2: Vercel Function Timeouts

Problem: Complex requests timing out at 10s

Solution: Async processing with webhooks

// Return immediately, process async
app.post('/generate', async (req, res) => {
  const jobId = await jobQueue.add('generate', req.body);
  
  res.json({ 
    jobId, 
    status: 'processing',
    webhook: `/status/${jobId}`
  });
});

Challenge 3: Cost Spikes

Problem: Some users hammering the API

Solution: Dynamic rate limiting

// Adjust limits based on usage patterns
async function getRateLimit(userId) {
  const usage = await getUsageLastHour(userId);
  
  if (usage > 100) return { limit: 10, window: '1h' };
  if (usage > 50) return { limit: 50, window: '1h' };
  return { limit: 100, window: '1h' };
}

Lessons Learned

1. Start with Boring Tech

We could have used Kubernetes, microservices, and a fancy ML pipeline. Instead:

  • Next.js + Vercel = Dead simple deployment
  • Supabase = Database + Auth solved
  • Redis = Cache that just works

2. Cache Everything, Trust Nothing

Our cache strategy evolved from "cache some things" to "cache everything possible":

  • API responses: 24 hour TTL
  • User data: 5 minute TTL
  • Embeddings: 7 day TTL
  • Static content: 30 day TTL

3. Optimize for the Common Case

80% of our users generate similar content. We optimized ruthlessly for them:

  • Precomputed templates
  • Suggested prompts
  • One-click regeneration

What's Next?

At current growth, we'll need to scale up around 200K users. The plan:

  1. Move to dedicated Redis: ~$200/mo for 5GB
  2. Add read replicas: ~$100/mo for Postgres
  3. CDN for static assets: ~$50/mo
  4. Multi-region deployment: ~$200/mo

Even then, we're looking at ~$750/mo infrastructure for 200K users. That's $0.00375 per user.

The Uncomfortable Truth

Most AI startups fail not because AI is expensive, but because they:

  1. Over-engineer from day one
  2. Don't implement proper caching
  3. Use GPT-4 for everything
  4. Ignore usage patterns
  5. Scale infrastructure before product-market fit

Start small. Cache aggressively. Use boring tech. The $200/mo SaaS isn't a limitation - it's a feature.

Want to Build Your Own?

I've open-sourced the core caching logic at github.com/BinaryBourbon/llm-cache-patterns. The patterns there will get you 80% of the way.

Questions? Hit me up on Twitter. I love talking about infrastructure that doesn't break the bank.


This is part of my series on building profitable AI products. Next up: "The Hidden Costs of GPT-4" - a deep dive into when premium models are actually worth it.