Batch Inference Throughput and Latency Calculator

JJ Ben-Joseph headshot JJ Ben-Joseph

Enter batching parameters to evaluate throughput and cost.

Understanding the Trade-offs of Batch Inference

Large language models and other deep neural architectures often process individual requests sequentially, but modern hardware enables executing multiple requests in parallel by grouping them into a batch. Batching can dramatically increase throughput and reduce per-request cost, yet it introduces complexities such as queueing delays, variable sequence lengths, and memory constraints. This calculator offers a transparent way to explore how batch size and token processing rate affect both latency and resource efficiency, empowering teams to design serving stacks that balance responsiveness with cost savings.

At its core, the calculator models the time required to process a batch of requests. Let R denote the single-request token processing rate in tokens per second. Each request contains P prompt tokens and produces C completion tokens, yielding a per-request token count T_r=P+C. When B requests are grouped into a batch, the model assumes near-perfect parallelism, meaning the core token processing time stays close to T_{proc}=T_rR seconds. However, batching introduces overhead O to marshal tensors, synchronize kernels, or pad sequences, so the total time per batch becomes T_{batch}=O+T_{proc}. Because every request in the batch shares this time, the per-request latency is also T_{batch}.

Throughput reflects how many tokens or requests can be processed per second while batching. With perfect scaling, token throughput grows linearly with batch size: Throughput_{tokens}=R×B. To derive requests per second, divide the token throughput by tokens per request: Throughput_{req}=RT_r×B. These metrics guide decisions about backend capacity. For a given service-level objective, engineers can experiment with larger or smaller batches to hit latency targets while utilizing hardware fully.

Cost is another critical dimension, especially when inference happens on cloud GPUs billed by the hour. Suppose the hardware cost is K dollars per hour. Each batch consumes T_{batch}3600 hours, so the cost per batch is Cost_{batch}=K×T_{batch}3600. Converting to cost per token or per request helps compare against API pricing. Multiplying by 1000 yields the cost per thousand tokens: Cost_{1k}=K×T_{batch}×1000}{ 3600×B×T_r} . The calculator performs this conversion automatically.

The table below demonstrates a sample scenario using a rate of 40 tokens/sec, 150 total tokens per request, batch size 8, 30 ms overhead, and a $2.50 hourly GPU rate:

MetricValue
Per-request Latency (ms)3800.0
Token Throughput (tokens/sec)320
Request Throughput (req/sec)2.13
Cost per 1000 tokens ($)0.03

While the numbers are illustrative, they reveal typical trade-offs. With batch size 8, each request experiences ~3.8 seconds of latency, but the system processes over 300 tokens per second. If individual requests were served without batching, latency would drop, yet throughput—and thus cost efficiency—would suffer. Many organizations therefore implement dynamic batching queues that accumulate requests for a short window before dispatching them together, balancing latency budgets with GPU utilization.

Batching becomes more complex when request lengths vary. In frameworks like Transformers, sequences must often be padded to the longest in the batch, meaning shorter requests waste computation. Advanced scheduling algorithms group similarly sized prompts together or split long sequences into micro-batches to mitigate padding overhead. Users of this calculator can approximate such scenarios by adjusting the prompt and completion token counts to reflect padded lengths rather than actual lengths.

Memory limitations also cap batch size. Each additional request multiplies the activations stored for forward and backward passes (if doing batched training) or the key/value cache for inference. On a fixed 40 GB GPU, long sequences may restrict batches to only a few requests. Quantization or activation offloading can alleviate memory pressure, enabling larger batches and better throughput. The calculator focuses on timing and cost, but practitioners should cross-reference memory requirements, perhaps using the companion VRAM Requirement Calculator elsewhere in this project.

Queueing delay is another hidden cost. If requests arrive faster than the batch can process them, new requests wait until the next batch. A naive approach might simply batch as many requests as possible, but this can induce tail latency spikes. Production systems often define a maximum queue time: the batch dispatches either when it reaches the desired size or the timer expires, whichever comes first. Modeling such behavior analytically is tricky, but this calculator offers a starting point by revealing how latency scales with batch size under ideal conditions.

The mathematics underpinning batch throughput highlight the benefit of parallel hardware. Because tokens from multiple requests are processed concurrently, the effective throughput is proportional to B, yet latency remains dominated by sequence length and overhead. In symbols, per-request latency remains O+T_rR regardless of B, while throughput increases linearly. This asymmetry motivates micro-batching strategies that accumulate a handful of requests—enough to keep the GPU busy—without letting latency balloon uncontrollably.

Cost analysis is similarly nuanced. Suppose API users are charged $0.10 per thousand tokens, but your infrastructure cost is $0.03 per thousand using a batch size of 8. Profit margins depend heavily on maintaining that batch size; if traffic drops and batches shrink to size 1, cost per thousand may rise above revenue. Therefore, operators track realized batch size in production and may scale down GPUs during off-peak hours to avoid inefficiency.

Some workloads introduce additional overhead beyond the fixed per-batch figure. For example, streaming responses require keeping connections open and may flush tokens to the client as they are generated, limiting batching opportunities. Others involve retrieval-augmented generation where each request performs vector searches; these external calls do not benefit from batching unless the search engine supports it. In such cases, the model throughput calculated here may overestimate actual service throughput. Teams should augment the calculator’s results with empirical measurements.

Despite its simplifications, the Batch Inference Throughput and Latency Calculator provides an accessible lens into a complex optimization problem. By exposing fundamental relationships among rate, sequence length, batch size, and cost, it encourages experimentation and evidence-based capacity planning. Whether you are tuning a hobby project on a single GPU or operating a global-scale API, understanding batching dynamics helps deliver faster, cheaper, and more predictable results to end users.

Extending this model is straightforward. One might incorporate probabilistic arrival processes to estimate queueing delay, or model diminishing returns when kernel launches saturate memory bandwidth. Another extension could consider heterogeneous batches where some requests use advanced features like tool calling or function execution, altering token processing rates. Regardless, the core formulas here remain a valuable first approximation that demystifies the batching lever available in modern deep learning frameworks.

Related Calculators

AI Inference Energy Cost Calculator - Estimate Electricity Use

Estimate energy consumption, electricity cost, and carbon emissions for running AI inference workloads. Enter token counts, throughput, GPU wattage, and energy price.

ai inference energy cost calculator gpu power usage llm serving electricity

Model Ensemble Inference Cost Calculator

Analyze latency and expense of deploying multiple models together in an ensemble for inference.

model ensemble cost calculator inference latency ensemble deployment

Inference Autoscaling Cost Calculator

Plan GPU instance counts, cold start latency, and monthly spend when autoscaling inference services.

autoscaling inference cost GPU instances