Dev Tools

LLM Cost Comparison

Side-by-side monthly cost across 12 major LLMs. Enter your token volume, find the cheapest match.

Quick Answer

Cheapest tier: Gemini Flash-Lite ($0.10/$0.40), GPT-4o-mini ($0.15/$0.60), DeepSeek V3 ($0.27/$1.10). Frontier tier: Gemini 2.5 Pro ($1.25/$5) undercuts GPT-4o ($2.50/$10) and Claude Sonnet ($3/$15). For absolute peak quality, Claude Opus 4.7 ($15/$75).

ModelIn/OutPer callMonthly
Gemini 2.5 Flash-Lite
Google
$0.1 / $0.4$0.000520$5.20
GPT-4o-mini
OpenAI
$0.15 / $0.6$0.000780$7.80
DeepSeek V3
DeepSeek
$0.27 / $1.1$0.001420$14.20
Llama 3.3 70B (Together)
Together
$0.88 / $0.88$0.002464$24.64
Gemini 2.5 Flash
Google
$0.3 / $2.5$0.002600$26.00
Claude Haiku 4.5
Anthropic
$1 / $5$0.006000$60.00
Gemini 2.5 Pro
Google
$1.25 / $5$0.006500$65.00
Mistral Large 2
Mistral
$2 / $6$0.008800$88.00
GPT-4o
OpenAI
$2.5 / $10$0.0130$130.00
Claude Sonnet 4.6
Anthropic
$3 / $15$0.0180$180.00
GPT-4-turbo
OpenAI
$10 / $30$0.0440$440.00
Claude Opus 4.7
Anthropic
$15 / $75$0.0900$900.00
Cheapest
Gemini 2.5 Flash-Lite
$5.20/month
Premium
Claude Opus 4.7
$900.00/month (173.1x)

About This Tool

The LLM Cost Comparison tool puts every major language model side-by-side at your specific token volume. Enter input tokens per call, output tokens per call, and monthly request count. The tool computes per-call and monthly cost across 12 production models from OpenAI, Anthropic, Google, Mistral, DeepSeek, and hosted open-source providers.

The 2026 LLM pricing landscape

Three pricing tiers have emerged. Frontier ($1.25-$15 per million input): Claude Opus, GPT-4o, Gemini 2.5 Pro. Balanced ($0.30-$3): Sonnet 4.6, Gemini Flash, Mistral Large. Cheap ($0.10-$0.30): Flash-Lite, GPT-4o-mini, DeepSeek V3, Haiku 4.5. Hosted open-source (Llama, Mixtral) typically lands in the balanced tier.

How to read the comparison

The cheapest model isn't always the right one. A 50% cheaper model that fails 5% more often may cost more in retry overhead, support tickets, and brand damage. Run your own evals — pick three candidates from this comparison, evaluate on 100+ representative examples, then choose by quality-adjusted cost.

Output token cost dominates most workloads. Models with 4-5x output multipliers (which is all of them) become 4-5x more expensive when responses are long. Cap max_tokens. Prefer structured outputs over prose. Use cheaper models for first-pass generation and a flagship for final review.

Caching shifts the math

Anthropic's prompt cache cuts cached input by 90%. OpenAI auto-caches at 50% off. Gemini's context cache varies by tier. If your prompt has a stable 5K+ token prefix reused across many turns, caching can flip the cost ranking — Sonnet with caching often beats GPT-4o without it on long-context workloads.

Drill deeper with GPT cost calculator, Claude cost calculator, Gemini cost calculator, and the prompt caching savings calculator. To go from raw text to token estimates, use the token counter.

Frequently Asked Questions

Which LLM has the cheapest API in 2026?
Gemini 2.5 Flash-Lite at $0.10 / $0.40 per million tokens. DeepSeek V3 ($0.27 / $1.10) and GPT-4o-mini ($0.15 / $0.60) compete closely. For frontier-quality work, Gemini 2.5 Pro at $1.25 / $5 is the cheapest top-tier option.
How should I choose between GPT, Claude, and Gemini?
By task. Claude tends to win on coding and instruction-following. GPT excels at function calling and multimodal vision. Gemini dominates long-context (1M tokens) and is generally cheapest at the frontier. Test all three on your evals before committing.
Are open-source models really cheaper?
Hosted open models like Llama 3.3 70B run $0.50-$1 per million on Together, Groq, or Fireworks. Self-hosted on your own GPUs can be cheaper at scale but needs DevOps. Below ~5M tokens/day, hosted closed models win on TCO.
What about prompt caching savings?
Caching cuts cached-prefix cost by 50-90% depending on provider. Anthropic offers the deepest discount (90% off cache reads). OpenAI auto-caches at 50% off. Gemini's context cache varies. Use our prompt caching calculator for precise estimates.
Why is output 4-5x more than input?
Generation requires sequential GPU passes — one per output token — while input is processed in parallel. The 4-5x premium reflects that compute asymmetry. To control cost, cap max_tokens and prefer extractive responses over generative ones where possible.