Introduction: what this calculator estimates
When serving large language models (LLMs) or other neural networks, you can often process multiple requests at once by grouping them into a batch. Batching can increase GPU utilization and reduce cost per token, but it can also increase end-user latency due to per-batch overhead and queueing. This page provides a simple, transparent model to estimate per-request latency, token throughput, request throughput, and cost per 1,000 tokens as you change batch size and token rates.
The model here is intentionally lightweight: it assumes a single effective token processing rate and a fixed per-batch overhead. That makes it useful for quick capacity planning, comparing configurations, and sanity-checking whether a target service-level objective (SLO) is feasible. For production decisions, validate with measurements from your actual serving stack (including P95/P99 latency and real traffic patterns).
How to use the calculator
- Enter your single-request token processing rate in tokens/sec. This should be a realized rate you observe (or expect) for one request on your hardware. If you only have a batch benchmark, you can approximate the single-request rate by dividing batch token throughput by batch size.
- Enter prompt tokens and completion tokens per request. Use typical, P50, or P95 values depending on what you want to plan for. If you are planning for tail latency, use a conservative token count.
- Choose a batch size. Larger batches usually increase throughput but can increase latency and may be limited by VRAM. If you use dynamic batching, the realized batch size may be smaller during low traffic.
- Set per-batch overhead (milliseconds). This captures fixed costs like request collation, padding, kernel launch overhead, synchronization, and framework overhead.
- Enter hardware cost per hour (USD). This can be your cloud GPU hourly price or an internal amortized cost.
- Click Compute to see the metrics. Use Copy Result to paste the table into a doc, ticket, or runbook.
Tip: if you dispatch a batch when either (a) it reaches size B or (b) a timer expires, run multiple scenarios (for example, B=1, B=4, B=8) to understand sensitivity to traffic.
Formulas and assumptions
Let: R = single-request token processing rate (tokens/sec), P = prompt tokens per request, C = completion tokens per request, B = batch size (requests), O = per-batch overhead (ms), K = hardware cost ($/hour).
Tokens per request:
Processing time per request (seconds):
Total time per batch (seconds): In this simplified model, per-request latency is the batch time because all requests in the batch complete together.
Token throughput (tokens/sec):
Request throughput (requests/sec):
Cost per batch and cost per 1,000 tokens:
Assumption summary: (1) near-linear scaling of token throughput with batch size, (2) fixed overhead per batch, (3) no explicit queueing delay, (4) token rate is treated as constant across prompt and generation. These assumptions are often good enough for first-order planning, but they can deviate from reality depending on model architecture, sequence lengths, and serving framework.
Worked example (step-by-step)
Suppose you measure R = 40 tokens/sec for a single request, and your typical request has P = 100 prompt tokens and C = 50 completion tokens. You plan to batch B = 8 requests, and you estimate a fixed overhead of O = 30 ms per batch. Your GPU costs K = $2.50/hour.
- Tokens per request: Tr = 100 + 50 = 150 tokens.
- Processing time: Tproc = 150 / 40 = 3.75 seconds.
- Batch time: Tbatch = 0.03 + 3.75 = 3.78 seconds → latency ≈ 3780 ms.
- Token throughput: 40 × 8 = 320 tokens/sec.
- Request throughput: 320 / 150 ≈ 2.13 req/sec.
The table below shows the same scenario in a compact form (values are illustrative and will vary with your hardware and model):
| Metric | Value |
|---|---|
| Per-request Latency (ms) | 3780.0 |
| Token Throughput (tokens/sec) | 320 |
| Request Throughput (req/sec) | 2.13 |
| Cost per 1000 tokens ($) | ≈ 0.02–0.03 (depends on rounding) |
Interpretation: batching increases throughput roughly in proportion to batch size, while latency is dominated by tokens per request plus overhead. If you reduce batch size to improve responsiveness, you should expect lower throughput and potentially higher cost per token.
Limitations and practical considerations
This calculator is a first-order model. Real systems can differ due to:
- Queueing delay: if requests arrive asynchronously, users may wait in a batching queue until enough requests accumulate or a dispatch timer fires. That waiting time can dominate tail latency and is not included here.
- Variable sequence lengths and padding: many runtimes pad to the longest sequence in the batch, wasting compute on shorter requests. If you see heavy padding, consider using effective token counts that reflect padded lengths.
- Nonlinear scaling: throughput may not scale linearly with batch size due to memory bandwidth limits, KV-cache pressure, kernel inefficiencies, or scheduler overhead.
- Different rates for prefill vs decode: some stacks have very different performance characteristics for prompt processing (prefill) versus token generation (decode). A single rate R may hide that split.
- Streaming and tool calls: streaming responses, retrieval steps, or tool/function execution can reduce batching opportunities and add external latency.
- VRAM constraints: batch size is often capped by memory. If you hit OOM, you may need smaller batches, shorter contexts, quantization, or KV-cache optimizations.
Use this page to explore trade-offs quickly, then validate with profiling and load tests (including P95/P99 latency) before committing to an SLO.
Planning guidance: turning calculator outputs into decisions
The four outputs are most useful when you connect them to a concrete goal. Here are common planning questions and how to use the metrics:
- “Can we hit a latency target?” Compare per-request latency to your SLO. If the computed latency is already above your target without queueing, you likely need to reduce tokens per request, increase the token rate (faster hardware or better kernels), or reduce overhead.
- “How many GPUs do we need?” Use request throughput to estimate capacity. If your expected arrival rate is 10 req/sec and the calculator shows 2.5 req/sec per GPU, you need at least 4 GPUs for average load, plus headroom for bursts and tail latency.
- “What does batching do to unit economics?” Use cost per 1,000 tokens to compare against internal budgets or external API pricing. If your revenue is $0.10/1k tokens and your cost is $0.03/1k tokens at batch size 8, you have margin—until traffic drops and realized batch size shrinks.
- “What should we set as max batch size?” Treat batch size as a lever. Increase B until you see diminishing returns in real benchmarks or until memory limits appear. In many stacks, the best setting is not the maximum possible batch size, but the maximum that still keeps latency within budget.
A practical workflow is to run three scenarios: (1) best case (high batch size, low overhead), (2) typical (expected batch size), and (3) worst case (batch size 1). If the worst case is unacceptable, you may need autoscaling, request shaping, or a fallback model.
Notes on choosing realistic inputs
The calculator is only as good as the inputs. The suggestions below help you pick values that match what happens in production.
- Token rate (R): measure it on the same model, precision, and runtime you plan to deploy. If you use speculative decoding, quantization, or tensor parallelism, the effective rate can change significantly.
- Prompt and completion tokens: use a distribution-aware value. For customer-facing chat, P95 completion length can be much larger than the median. If you cap output tokens, reflect that cap.
- Overhead (O): include fixed work that happens once per batch (collation, padding, host-to-device copies, synchronization). If you are unsure, start with 10–50 ms and refine.
- Cost per hour (K): include the full cost you care about. For cloud, that is usually the instance hourly price. For on-prem, you might use amortized hardware + power + ops.
If you want to approximate padding waste, one simple approach is to inflate token counts to the padded length. For example, if your typical prompt is 220 tokens but you pad to 256, use 256 in the prompt field. This is not perfect, but it often moves estimates in the right direction.
Mini FAQ
Does batching always reduce latency?
Not necessarily. In this model, per-request latency is mostly independent of batch size because the processing time is based on tokens per request and the overhead is fixed. In real systems, batching can increase latency due to queueing (waiting to form a batch) and due to padding or memory pressure. Batching is primarily a throughput and cost optimization; latency improvements usually come from faster kernels, smaller token counts, or better scheduling.
Why does the calculator use a single token rate for both prompt and generation?
Many serving stacks have different performance for prefill (prompt) and decode (generation). This calculator uses one rate to stay simple. If you have separate rates, you can approximate by using a weighted average rate or by converting tokens into an equivalent time and then back into an effective rate.
How should I interpret “cost per 1,000 tokens”?
It is the infrastructure cost implied by your hourly hardware price and the computed batch time, divided across all tokens in the batch. It does not include engineering time, networking, storage, or other platform costs. Use it as a unit-cost baseline, not a full accounting model.
Queue Conductor
Set your dispatch threshold before latency spikes outrun throughput gains.
Insight: bigger batches increase throughput, but dispatch delay can dominate user latency if you wait too long.
