Dev Tools

LLM Latency Calculator

Estimate TTFT and total response time across GPT, Claude, Gemini, and Groq.

Quick Answer

Median throughput in 2026: GPT-4o ~110 tps, Claude Sonnet ~95 tps, Gemini Flash ~240 tps, Groq Llama 70B ~800 tps. TTFT runs 200-800ms baseline, plus ~50ms per 1K input tokens. For real-time UX, target sub-1-second TTFT.

ModelTPSTTFTGenerateTotal
Groq Llama 3.3 70B800150ms0.63s0.78s
Gemini 2.5 Flash-Lite320200ms1.56s1.76s
Gemini 2.5 Flash240250ms2.08s2.33s
Claude Haiku 4.5200270ms2.50s2.77s
GPT-4o-mini180300ms2.78s3.08s
GPT-4o110400ms4.55s4.95s
Claude Sonnet 4.695500ms5.26s5.76s
Gemini 2.5 Pro90550ms5.56s6.11s
Claude Opus 4.765850ms7.69s8.54s

Estimates use median throughput from April 2026 public benchmarks. Real latency varies by region, time of day, prompt complexity, and provider load. Streaming masks total time — users perceive TTFT as the response start.

About This Tool

The LLM Latency Calculator estimates time-to-first-token (TTFT) and total response time for nine major models. Enter input and output token counts — the tool computes per-model TTFT (which scales with input length) plus generation time (output tokens / model throughput).

Why latency matters as much as cost

User-facing AI features live or die by perceived speed. Studies on chat interfaces show user satisfaction drops sharply past 1-second TTFT. Beyond 3 seconds, users assume something broke. For voice agents, TTFT must be under 600ms or conversation feels stilted. Cost-optimization isn't enough — you also need to budget latency.

The two latency components

TTFT (time-to-first-token): how long before the model emits the first output token. Driven by network round-trip, queue depth, and prompt-processing time. Baselines range from 100ms (Groq) to 800ms (Claude Opus). Each additional 1K input tokens adds roughly 50ms. Generation time: output tokens divided by tokens-per-second throughput. Linear in output length.

2026 throughput benchmarks

Median tokens per second from public benchmarks: Gemini 2.5 Flash-Lite ~320, Flash ~240, Claude Haiku ~200, GPT-4o-mini ~180, GPT-4o ~110, Sonnet ~95, Gemini 2.5 Pro ~90, Opus ~65. Groq running open-source Llama 3.3 70B hits 800 tps on their custom LPU silicon — by far the fastest production option for any frontier-class model.

Latency optimization techniques

Stream responses so users see tokens as they generate. Cap max_tokens aggressively — capping output to 200 tokens vs 2000 saves 8x on generation time. Use prompt caching to skip repeat input processing. Pick regional endpoints close to your users. For real-time chat, default to a fast-tier model (Haiku, Flash, Groq) and reach for slower flagships only when reasoning quality demands it.

The streaming trick

Streaming dramatically improves perceived latency without changing actual generation time. Even a 5-second total response feels fast if the first token arrives in 400ms and tokens flow continuously. Implement Server-Sent Events (SSE) or WebSockets for chat UIs. Most LLM SDKs (OpenAI, Anthropic, Google) support streaming as a flag.

Pair with the LLM cost comparison to balance speed and price, the token counter for input estimation, and the context window calculator for prompt size limits. For agentic workflows, see function calling cost calculator.

Frequently Asked Questions

What is TTFT and why does it matter?
Time-to-first-token: the delay before the model emits its first output token. Critical for chat UX — users perceive a response as 'slow' if TTFT exceeds 1 second. Streaming helps mask total generation time but TTFT is the perceived starting gun.
Why is Groq so much faster than GPT-4o?
Groq runs open-source models on custom LPU (Language Processing Unit) hardware that specializes in inference. They hit 500-800 tokens/second on Llama 3.3 70B vs GPT-4o's ~110 tps. The trade-off: limited model selection and capacity caps.
How does input length affect latency?
Longer prompts increase TTFT — the model has to process the prompt before generating. Rule of thumb: TTFT scales sub-linearly with input length. A 50K-token input adds 1-3 seconds vs a 1K-token input. Output tokens scale linearly with output length.
Can I reduce latency without changing models?
Yes. Cap max_tokens aggressively. Stream responses (UI feels faster). Use prompt caching to skip repeat input processing. Place compute close to users (regional endpoints). For real-time use, pick fast-tier models (Haiku, Flash, Groq).
Do these speeds reflect real production?
These are median values from public benchmarks at typical load. Peak-time latency can be 2-3x higher. Always test with your actual prompt size and traffic pattern. Also note: TTFT spikes during model launches and outages are common.