LLM Latency Calculator
Estimate TTFT and total response time across GPT, Claude, Gemini, and Groq.
Quick Answer
Median throughput in 2026: GPT-4o ~110 tps, Claude Sonnet ~95 tps, Gemini Flash ~240 tps, Groq Llama 70B ~800 tps. TTFT runs 200-800ms baseline, plus ~50ms per 1K input tokens. For real-time UX, target sub-1-second TTFT.
| Model | TPS | TTFT | Generate | Total |
|---|---|---|---|---|
| Groq Llama 3.3 70B | 800 | 150ms | 0.63s | 0.78s |
| Gemini 2.5 Flash-Lite | 320 | 200ms | 1.56s | 1.76s |
| Gemini 2.5 Flash | 240 | 250ms | 2.08s | 2.33s |
| Claude Haiku 4.5 | 200 | 270ms | 2.50s | 2.77s |
| GPT-4o-mini | 180 | 300ms | 2.78s | 3.08s |
| GPT-4o | 110 | 400ms | 4.55s | 4.95s |
| Claude Sonnet 4.6 | 95 | 500ms | 5.26s | 5.76s |
| Gemini 2.5 Pro | 90 | 550ms | 5.56s | 6.11s |
| Claude Opus 4.7 | 65 | 850ms | 7.69s | 8.54s |
Estimates use median throughput from April 2026 public benchmarks. Real latency varies by region, time of day, prompt complexity, and provider load. Streaming masks total time — users perceive TTFT as the response start.
About This Tool
The LLM Latency Calculator estimates time-to-first-token (TTFT) and total response time for nine major models. Enter input and output token counts — the tool computes per-model TTFT (which scales with input length) plus generation time (output tokens / model throughput).
Why latency matters as much as cost
User-facing AI features live or die by perceived speed. Studies on chat interfaces show user satisfaction drops sharply past 1-second TTFT. Beyond 3 seconds, users assume something broke. For voice agents, TTFT must be under 600ms or conversation feels stilted. Cost-optimization isn't enough — you also need to budget latency.
The two latency components
TTFT (time-to-first-token): how long before the model emits the first output token. Driven by network round-trip, queue depth, and prompt-processing time. Baselines range from 100ms (Groq) to 800ms (Claude Opus). Each additional 1K input tokens adds roughly 50ms. Generation time: output tokens divided by tokens-per-second throughput. Linear in output length.
2026 throughput benchmarks
Median tokens per second from public benchmarks: Gemini 2.5 Flash-Lite ~320, Flash ~240, Claude Haiku ~200, GPT-4o-mini ~180, GPT-4o ~110, Sonnet ~95, Gemini 2.5 Pro ~90, Opus ~65. Groq running open-source Llama 3.3 70B hits 800 tps on their custom LPU silicon — by far the fastest production option for any frontier-class model.
Latency optimization techniques
Stream responses so users see tokens as they generate. Cap max_tokens aggressively — capping output to 200 tokens vs 2000 saves 8x on generation time. Use prompt caching to skip repeat input processing. Pick regional endpoints close to your users. For real-time chat, default to a fast-tier model (Haiku, Flash, Groq) and reach for slower flagships only when reasoning quality demands it.
The streaming trick
Streaming dramatically improves perceived latency without changing actual generation time. Even a 5-second total response feels fast if the first token arrives in 400ms and tokens flow continuously. Implement Server-Sent Events (SSE) or WebSockets for chat UIs. Most LLM SDKs (OpenAI, Anthropic, Google) support streaming as a flag.
Pair with the LLM cost comparison to balance speed and price, the token counter for input estimation, and the context window calculator for prompt size limits. For agentic workflows, see function calling cost calculator.
Frequently Asked Questions
What is TTFT and why does it matter?
Why is Groq so much faster than GPT-4o?
How does input length affect latency?
Can I reduce latency without changing models?
Do these speeds reflect real production?
You might also like
Hash Generator
Generate MD5, SHA-1, SHA-256, and SHA-512 hashes from text.
⏱ instantDev ToolsCron Expression Generator
Build and parse cron expressions with next run time preview.
⏱ 1 minDev ToolsLLM Cost Comparison
Side-by-side monthly cost across 12 major LLMs from every major provider.
⏱ instant