Question 1

What is TTFT and why does it matter?

Accepted Answer

Time-to-first-token: the delay before the model emits its first output token. Critical for chat UX — users perceive a response as 'slow' if TTFT exceeds 1 second. Streaming helps mask total generation time but TTFT is the perceived starting gun.

Question 2

Why is Groq so much faster than GPT-4o?

Accepted Answer

Groq runs open-source models on custom LPU (Language Processing Unit) hardware that specializes in inference. They hit 500-800 tokens/second on Llama 3.3 70B vs GPT-4o's ~110 tps. The trade-off: limited model selection and capacity caps.

Question 3

How does input length affect latency?

Accepted Answer

Longer prompts increase TTFT — the model has to process the prompt before generating. Rule of thumb: TTFT scales sub-linearly with input length. A 50K-token input adds 1-3 seconds vs a 1K-token input. Output tokens scale linearly with output length.

Question 4

Can I reduce latency without changing models?

Accepted Answer

Yes. Cap max_tokens aggressively. Stream responses (UI feels faster). Use prompt caching to skip repeat input processing. Place compute close to users (regional endpoints). For real-time use, pick fast-tier models (Haiku, Flash, Groq).

Question 5

Do these speeds reflect real production?

Accepted Answer

These are median values from public benchmarks at typical load. Peak-time latency can be 2-3x higher. Always test with your actual prompt size and traffic pattern. Also note: TTFT spikes during model launches and outages are common.

Model	TPS	TTFT	Generate	Total
Groq Llama 3.3 70B	800	150ms	0.63s	0.78s
Gemini 2.5 Flash-Lite	320	200ms	1.56s	1.76s
Gemini 2.5 Flash	240	250ms	2.08s	2.33s
Claude Haiku 4.5	200	270ms	2.50s	2.77s
GPT-4o-mini	180	300ms	2.78s	3.08s
GPT-4o	110	400ms	4.55s	4.95s
Claude Sonnet 4.6	95	500ms	5.26s	5.76s
Gemini 2.5 Pro	90	550ms	5.56s	6.11s
Claude Opus 4.7	65	850ms	7.69s	8.54s

LLM Latency Calculator

About This Tool

Why latency matters as much as cost

The two latency components

2026 throughput benchmarks

Latency optimization techniques

The streaming trick

Frequently Asked Questions

You might also like

Hash Generator

Cron Expression Generator

LLM Cost Comparison