The AI Stack That Actually Ships: Local-First LLMs
OpenAI's API costs will bankrupt you. Here's how we use Ollama for local inference and only call cloud models when necessary.
Core Concept
Every AI startup in 2025 burned through their seed round on OpenAI API bills.
Typical scenario:
- $100K seed raised
- GPT-4 costs $0.03 per 1K tokens
- Average query: 2K tokens (input + output)
- Cost per query: $0.06
Math at scale:
- 1,000 users × 50 queries/day = 50K queries/day
- 50K × $0.06 = $3,000/day
- $90K/month in API costs
You’re bankrupt in 6 weeks.
The Constraint
Cloud LLMs are expensive and slow:
- Cost: GPT-4 pricing makes unit economics impossible
- Latency: 1-3 second API response (kills UX)
- Privacy: Sending user data to OpenAI (GDPR nightmare)
- Reliability: API rate limits (429 errors kill your demo day)
- Vendor lock-in: You’re training OpenAI’s model with your data
The dirty secret: Most AI features don’t need GPT-4. They need GPT-3.5 at best (but you’re too lazy to test).
The Solution
Run LLMs locally with Ollama. Only use cloud models for complex reasoning.
The Architecture:
User Query
↓
Local Classifier (Ollama - Llama 3.2, 8B params)
↓
├─ Simple query? → Local LLM (Ollama - Mistral 7B)
│ ↓
│ Response (0.5s, $0 cost)
│
└─ Complex query? → Cloud LLM (OpenAI GPT-4)
↓
Response (2s, $0.06 cost)
Result: 80% of queries handled locally. 20% sent to OpenAI.
New cost: $18K/month (vs $90K/month)
Savings: $864K/year.
The Example: Stella
Stella (mental health AI companion) needs to respond to users in real-time.
Requirements:
- Sub-second response time
- Privacy (can’t send therapy logs to OpenAI)
- Works offline (therapy sessions on planes)
Old architecture (GPT-3.5 Turbo):
- Latency: 2-4 seconds per message
- Cost: $0.02 per message
- Privacy: Poor (all data sent to OpenAI)
- Monthly cost: $60K at 100K users
New architecture (Ollama + selective GPT-4):
- 90% of queries: Ollama Llama 3.2 (local, 500ms, $0)
- 10% of queries: GPT-4 for complex advice (cloud, 2s, $0.06)
- Monthly cost: $6K
How we decide:
# Classify the query
if is_simple_question(user_input):
# Local inference (free, fast)
response = ollama.generate("llama3.2:8b", prompt)
else:
# Cloud inference (expensive, powerful)
response = openai.chat(model="gpt-4", messages=prompt)
Simple questions: “How do I feel less anxious?”
Complex questions: “I’m having suicidal thoughts” (requires GPT-4 nuance)
The Infrastructure
Server setup:
# Install Ollama on your server
curl https://ollama.ai/install.sh | sh
# Pull models (8GB each)
ollama pull llama3.2:8b
ollama pull mistral:7b
# Run inference server
ollama serve
API call from your app:
const response = await fetch('http://your-server:11434/api/generate', {
method: 'POST',
body: JSON.stringify({
model: 'llama3.2:8b',
prompt: userInput,
stream: false,
}),
});
Cost: $80/month for GPU server (vs $90K/month for OpenAI)
The Trade-Off
What you lose:
- Bleeding-edge reasoning (GPT-4 Turbo is smarter than Llama 3)
- Zero DevOps (you must manage Ollama servers)
What you gain:
- 95% cost reduction
- Sub-second latency
- Privacy compliance (HIPAA, GDPR)
- Works offline
- No vendor lock-in
The Investor Pitch
Bad unit economics:
Cost per user: $30/month (OpenAI API)
Revenue per user: $10/month
Gross margin: -200% (you lose money on every user)
Good unit economics:
Cost per user: $0.50/month (Ollama + selective GPT-4)
Revenue per user: $10/month
Gross margin: 95%
VCs invest in the second company, not the first.
The Hybrid Strategy
Our production stack:
- Tier 1 queries (80%): Ollama Llama 3.2 (local)
- Tier 2 queries (15%): Anthropic Claude Sonnet (cloud, cheaper than GPT-4)
- Tier 3 queries (5%): GPT-4 (only for complex reasoning)
Cost breakdown:
- Tier 1: $0 (local inference)
- Tier 2: $3K/month (Claude)
- Tier 3: $3K/month (GPT-4)
- Total: $6K/month (vs $90K with GPT-4 only)
The First Principle: Use the minimum viable model for each task. Don’t pay for GPT-4 reasoning when Llama 3 retrieval is enough.