Everyone talks about AI being expensive. I built a SaaS that handles 50K daily active users, processes 2M+ AI requests per month, and runs on $200/month infrastructure. It's been profitable since month 3.
This isn't theoretical. This is the exact architecture powering ContentOptimizer.ai (name changed), which helps e-commerce teams write product descriptions. Here's how we did it.
The Numbers First
Let's start with what matters - the money:
- Monthly Revenue: $8,500 (170 customers × $50/mo)
- Infrastructure Cost: $198.47/mo
- AI API Cost: ~$300/mo (after optimizations)
- Gross Margin: 94%
Here's the infrastructure breakdown:
Vercel (Frontend + API): $20/mo
Supabase (Database + Auth): $25/mo
Redis Cloud (Caching): $30/mo
Cloudflare R2 (Storage): $5/mo
Resend (Email): $20/mo
BetterUptime (Monitoring): $29/mo
Plausible (Analytics): $9/mo
DigitalOcean (Background Jobs): $60/mo
-----------------------------------
Total: $198/mo
The Architecture
Here's the full system design:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Next.js │────▶│ Edge Function│────▶│ Redis │
│ Frontend │ │ (Vercel) │ │ Cache │
└─────────────┘ └──────────────┘ └─────────────┘
│ │
▼ ▼
┌──────────────┐ ┌─────────────┐
│ Supabase │ │ OpenAI API │
│ PostgreSQL │ │ (Fallback) │
└──────────────┘ └─────────────┘
│
▼
┌──────────────┐
│ Background │
│ Workers (DO) │
└──────────────┘
The Secret Sauce: Intelligent Caching
The #1 reason we can run this cheaply: 93% cache hit rate.
Three-Layer Cache Strategy
Layer 1: Exact Match Cache (40% hit rate)
// Redis exact match cache
const cacheKey = `exact:${hashString(prompt)}`;
const cached = await redis.get(cacheKey);
if (cached) {
return { result: cached, source: 'exact-cache' };
}
Layer 2: Semantic Cache (35% hit rate)
// Find similar prompts using embeddings
const embedding = await getEmbedding(prompt);
const similar = await findSimilarVectors(embedding, 0.95);
if (similar.length > 0) {
return { result: similar[0].response, source: 'semantic-cache' };
}
Layer 3: Template Cache (18% hit rate)
// Extract template and parameters
const template = extractTemplate(prompt);
const params = extractParams(prompt);
// Cache by template + param hash
const cacheKey = `template:${template}:${hashParams(params)}`;
const cached = await redis.get(cacheKey);
if (cached) {
return { result: cached, source: 'template-cache' };
}
This leaves only 7% of requests hitting the OpenAI API.
Cost Optimization Tricks
1. Smart Model Routing
Not every request needs GPT-4:
function selectModel(prompt, user) {
// Simple requests → GPT-3.5 Turbo
if (prompt.length < 100 && !hasComplexRequirements(prompt)) {
return 'gpt-3.5-turbo';
}
// Premium users → GPT-4
if (user.plan === 'premium') {
return 'gpt-4-turbo-preview';
}
// Complex requests → GPT-4 with fallback
return prompt.length > 500 ? 'gpt-4' : 'gpt-3.5-turbo';
}
Result: 78% of requests use the cheaper model.
2. Request Batching
Instead of processing immediately, we batch similar requests:
// Collect requests for 100ms
const batch = await collectBatch(100);
// Process as single prompt
const batchPrompt = formatBatchPrompt(batch);
const result = await openai.complete(batchPrompt);
// Parse and distribute results
const individual = parseBatchResponse(result);
batch.forEach((req, i) => req.resolve(individual[i]));
This reduces API calls by 60% during peak hours.
3. Aggressive Precomputation
We precompute common outputs during off-peak hours:
// Run at 3 AM daily
async function precomputeCommon() {
const commonPrompts = await getTop1000Prompts();
for (const prompt of commonPrompts) {
const result = await generateWithCache(prompt);
await redis.setex(`precomputed:${prompt}`, 86400, result);
}
}
The Database Strategy
Supabase gives us Postgres + Auth + Realtime for $25/mo. Here's our schema:
-- Core tables only
CREATE TABLE users (
id UUID PRIMARY KEY,
email TEXT UNIQUE,
plan TEXT DEFAULT 'free',
usage_this_month INT DEFAULT 0
);
CREATE TABLE generations (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
prompt TEXT,
result TEXT,
model TEXT,
cached BOOLEAN,
cost DECIMAL(10,6),
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Minimal indexes
CREATE INDEX idx_user_created ON generations(user_id, created_at);
CREATE INDEX idx_prompt_hash ON generations(MD5(prompt));
Key decisions:
- No complex relationships
- Denormalized where it makes sense
- Indexes only where absolutely needed
Background Jobs Without Breaking the Bank
We run background jobs on a single $60/mo DigitalOcean droplet:
// Simple job queue using Redis
class JobQueue {
async add(type, data) {
const job = { id: uuid(), type, data, attempts: 0 };
await redis.lpush('jobs', JSON.stringify(job));
}
async process() {
while (true) {
const job = await redis.brpop('jobs', 1);
if (job) await this.handleJob(JSON.parse(job));
}
}
}
This handles:
- Email notifications
- Usage tracking
- Cache warming
- Data exports
Monitoring on a Budget
You can't optimize what you don't measure:
// Custom metrics to Plausible
function trackMetric(event, props) {
// Plausible for user-facing metrics
plausible(event, { props });
// Console.log for CloudWatch (free tier)
console.log(JSON.stringify({
metric: event,
...props,
timestamp: Date.now()
}));
}
Key metrics we track:
- Cache hit rates by type
- API costs by endpoint
- Response times by model
- User actions by plan
Scaling Challenges & Solutions
Challenge 1: Redis Memory Limits
Problem: Cache was eating too much memory
Solution: Intelligent eviction + compression
// Compress large values
async function cacheSet(key, value, ttl) {
const size = Buffer.byteLength(JSON.stringify(value));
if (size > 1024) { // 1KB threshold
const compressed = await gzip(JSON.stringify(value));
await redis.setex(`${key}:gz`, ttl, compressed);
} else {
await redis.setex(key, ttl, JSON.stringify(value));
}
}
Challenge 2: Vercel Function Timeouts
Problem: Complex requests timing out at 10s
Solution: Async processing with webhooks
// Return immediately, process async
app.post('/generate', async (req, res) => {
const jobId = await jobQueue.add('generate', req.body);
res.json({
jobId,
status: 'processing',
webhook: `/status/${jobId}`
});
});
Challenge 3: Cost Spikes
Problem: Some users hammering the API
Solution: Dynamic rate limiting
// Adjust limits based on usage patterns
async function getRateLimit(userId) {
const usage = await getUsageLastHour(userId);
if (usage > 100) return { limit: 10, window: '1h' };
if (usage > 50) return { limit: 50, window: '1h' };
return { limit: 100, window: '1h' };
}
Lessons Learned
1. Start with Boring Tech
We could have used Kubernetes, microservices, and a fancy ML pipeline. Instead:
- Next.js + Vercel = Dead simple deployment
- Supabase = Database + Auth solved
- Redis = Cache that just works
2. Cache Everything, Trust Nothing
Our cache strategy evolved from "cache some things" to "cache everything possible":
- API responses: 24 hour TTL
- User data: 5 minute TTL
- Embeddings: 7 day TTL
- Static content: 30 day TTL
3. Optimize for the Common Case
80% of our users generate similar content. We optimized ruthlessly for them:
- Precomputed templates
- Suggested prompts
- One-click regeneration
What's Next?
At current growth, we'll need to scale up around 200K users. The plan:
- Move to dedicated Redis: ~$200/mo for 5GB
- Add read replicas: ~$100/mo for Postgres
- CDN for static assets: ~$50/mo
- Multi-region deployment: ~$200/mo
Even then, we're looking at ~$750/mo infrastructure for 200K users. That's $0.00375 per user.
The Uncomfortable Truth
Most AI startups fail not because AI is expensive, but because they:
- Over-engineer from day one
- Don't implement proper caching
- Use GPT-4 for everything
- Ignore usage patterns
- Scale infrastructure before product-market fit
Start small. Cache aggressively. Use boring tech. The $200/mo SaaS isn't a limitation - it's a feature.
Want to Build Your Own?
I've open-sourced the core caching logic at github.com/BinaryBourbon/llm-cache-patterns. The patterns there will get you 80% of the way.
Questions? Hit me up on Twitter. I love talking about infrastructure that doesn't break the bank.
This is part of my series on building profitable AI products. Next up: "The Hidden Costs of GPT-4" - a deep dive into when premium models are actually worth it.