What this calculator estimates
When people talk about the cost of large language models, the conversation often jumps straight to training. In practice, many teams feel inference costs more often because inference is the bill that keeps arriving: every prompt, every completion, every chat turn, and every background API call triggers more compute. This calculator focuses on that recurring part of the picture. It converts a practical set of inputs into daily estimates for GPU runtime, electricity use, carbon emissions, and direct compute cost. Instead of hiding the logic in a black box, it keeps the assumptions visible so you can use it for quick planning, internal budgeting, sustainability reporting, or simple comparison work between deployment options.
The estimates are intentionally operational. You enter a model size in billions of parameters, the average tokens processed per request, how many requests arrive each day, and the approximate performance and power draw of the GPUs serving the workload. From there, the tool calculates how much floating point work is implied by your traffic, how long the GPUs would need to stay busy to process it, and how that runtime translates into kilowatt-hours, kilograms of CO2, and dollars. The result is not a full production simulator, but it is a useful baseline that makes tradeoffs legible. If you reduce tokens, choose more efficient hardware, or run in a cleaner grid region, you can see the effect immediately.
How to use the inputs well
The easiest way to get a sensible estimate is to think in terms of one average request. Start with model parameters, entered in billions. A dense 7B model should be entered as 7, a 13B model as 13, and so on. Then decide what you mean by tokens per request. For most operational planning, the most consistent choice is total tokens processed per request, meaning prompt tokens plus generated output tokens together. If you only care about generation cost, you can enter generated tokens only, but then you should interpret the whole estimate through that narrower lens.
Next, enter requests per day. This is the daily traffic you expect the system to serve. It can be a real measured value from logs, a forecast for launch, or a scenario number for planning. Per-GPU compute should represent a realistic sustained figure when possible. Peak marketing TFLOPS can be much higher than what a serving stack actually sustains under latency constraints, batching limits, memory pressure, and model sharding overhead. If you have benchmark data for your deployment, use that. If you do not, start with a conservative number rather than an optimistic one.
The last three inputs connect runtime to money and environmental impact. GPU cost per hour can be an on-demand cloud price, a reserved effective rate, or an amortized on-prem estimate that folds hardware, depreciation, and facility costs into one hourly number. GPU power draw should be a typical serving power in watts, not necessarily the board maximum on a data sheet. If you want to account for cooling and other facility overhead without making the model complicated, you can simply increase the power figure. Finally, grid CO2 intensity expresses how much carbon is associated with each kilowatt-hour in your region or provider mix. That input is what turns energy into emissions.
- Enter the dense model parameter count in billions.
- Enter average tokens per request, ideally prompt plus output together.
- Enter daily request volume.
- Enter per-GPU TFLOPS and the number of serving GPUs.
- Enter hourly GPU cost, power draw, and grid carbon intensity.
- Click Estimate Inference Cost to see daily totals and the detailed breakdown.
After you calculate, the Copy Result button copies a short summary line. That is useful when you want to drop the headline figures into a doc, email, or spreadsheet without retyping anything.
Formula and assumptions
The calculator uses a deliberately simple approximation that is common in first-pass inference sizing for dense transformer models. The idea is that each generated token requires work that scales roughly with the number of model parameters. A widely used rule of thumb is about 2 × N FLOPs per token, where N is the parameter count. That is not a law of nature, and real systems can do better or worse depending on architecture and serving optimizations, but it gives a transparent starting point.
In those equations, N is the parameter count, T is tokens per request, R is requests per day, and G is the number of GPUs. Once compute hours are known, emissions and cost are straightforward multiplications. Carbon equals energy times grid intensity. Daily cost equals GPU hourly price times GPU count times runtime. The calculator also estimates throughput in tokens per second and requests per second, then compares that capacity with your daily traffic to show utilization.
Why use such a simple model? Because in early planning, clarity matters. You often want to know whether a change in prompt length or model size matters more than a change in electricity rate, or how much cleaner hardware might reduce both budget and emissions. A transparent estimate is easier to reason about than a more complex estimate whose assumptions are hidden. For many conversations, the direction and rough scale are more important than squeezing out every last percent of realism.
How to interpret the result
The headline result gives three daily numbers: energy, cost, and CO2. Energy is helpful when you care about power planning, sustainability, or comparing hardware efficiency. Cost translates the same runtime into a budget figure your team can act on. CO2 becomes especially useful when you are comparing regions or cloud providers with different electricity mixes. Because all three derive from the same runtime estimate, they are tightly connected: a faster or more efficient serving setup generally lowers runtime, which lowers energy, which lowers emissions and cost.
The table underneath adds context. FLOPs per request tells you the amount of compute implied by one average request. Compute time per day tells you how many total GPU-hours of active work are needed to process the traffic. Max requests per second is a rough capacity estimate under the current token assumption. Utilization compares your daily workload with that maximum. If utilization is very high, small traffic spikes or longer prompts may create latency pressure. If utilization is very low, you may be provisioning far more capacity than your average load actually needs.
Two smaller numbers are often useful in budgeting discussions: cost per request and cost per 1k tokens. These help translate infrastructure math into product pricing or unit economics. For example, if a product manager asks what happens when the average response grows from 200 tokens to 600 tokens, you can often answer that question more clearly by comparing the per-request and per-1k-token outputs than by only discussing daily totals.
Worked example
Suppose you serve a 7B model, your average request uses 200 tokens, and you handle 50,000 requests per day. Your deployment runs on 8 GPUs with an effective serving throughput of 150 TFLOPS each. The GPUs draw about 300 watts each during inference, your effective price is $2.50 per GPU-hour, and the electricity available to the deployment corresponds to 0.40 kg CO2 per kWh.
Using the 2N rule, the model needs about 14 billion FLOPs per token. Multiply that by 200 tokens and then by 50,000 requests to estimate daily work. Divide by the combined GPU throughput to get runtime. Once you have runtime, energy becomes power times runtime, cost becomes hourly rate times runtime, and carbon becomes energy times grid intensity. The calculator performs those steps instantly, but the underlying logic stays visible. That matters because it shows where the estimate is most sensitive: model size, tokens per request, and sustained hardware throughput dominate the final answer.
If your result surprises you, the first thing to check is almost always the token count. Many teams underestimate total tokens by looking only at output length and forgetting prompt length, system instructions, retrieval context, or hidden tool traffic. The second thing to check is whether the TFLOPS figure reflects actual serving conditions rather than a theoretical maximum. Those two adjustments alone often explain most of the gap between a rough estimate and real-world bills.
Important limitations
This calculator is best treated as a planning aid, not a replacement for measurements. Real inference systems are shaped by more than pure multiply-add throughput. Memory bandwidth, batch formation, queueing policy, latency targets, quantization strategy, interconnect limits, KV cache reuse, speculative decoding, sparsity, and prompt length distribution all matter. Some of those factors reduce effective work per token; others cap the throughput you can actually sustain.
- Dense-model approximation: the 2N rule is most natural for dense transformer reasoning and can misrepresent MoE or sparsity-heavy models.
- Peak versus sustained performance: published TFLOPS may exceed what your serving stack can realize under production latency constraints.
- No explicit idle billing: the estimate is based on active compute time. If you keep GPUs provisioned around the clock, your real bill may be higher.
- Facility overhead is simplified: CPUs, memory, networking, storage, load balancers, and cooling are not broken out separately.
- Traffic variation matters: a daily average can hide burstiness. A system that looks comfortable on average may still need more headroom for peak hours.
None of those caveats make the calculator useless. They simply define its job. It is strongest when you want a defensible first estimate, when you want to compare scenarios on equal footing, or when you want to explain the shape of inference costs to someone who does not need a full performance engineering model.
Reference table: FLOPs per token by model size
The table below uses the same 2N approximation as the calculator. It provides a quick intuition check. Even before you plug in request volume, you can see why model choice is such a strong driver of serving economics.
| Model Size (B parameters) | Approximate FLOPs per token (billions) |
|---|---|
| 7 | 14 |
| 13 | 26 |
| 33 | 66 |
| 65 | 130 |
Practical notes for better estimates
For realistic planning, it helps to run several scenarios instead of one. Try a typical day and a busy day. Try a p50 token count and a p95 token count. Try a smaller model and a larger one. Compare a region with a clean grid and a region with a carbon-intensive grid. The goal is not merely to obtain one number, but to learn which inputs dominate the outcome. In many deployments, prompt length discipline and model choice produce larger savings than tiny changes in hourly GPU rate.
If you want to approximate data center overhead, a simple trick is to inflate the power draw input with a PUE-like adjustment. For instance, if you believe total facility energy is about 1.3 times GPU energy, multiply the wattage input by 1.3 before calculating. This keeps the page simple while still letting you bring the estimate closer to measured energy use. Likewise, if your cloud bill includes idle capacity, compare the calculator's active-runtime estimate with a 24-hour reserved-capacity estimate to bracket the likely real cost.
The most useful habit is consistency. If your team always counts prompt plus output tokens the same way, always uses the same sustained TFLOPS assumption for a given hardware profile, and always documents whether power includes overhead, then repeated runs of the calculator become a practical decision tool rather than just an isolated number generator.
A detailed breakdown table appears here after you calculate.
Mini-game: Dispatch Rush
If you want a faster intuition for the same tradeoff this calculator models, try the optional mini-game below. Each request packet carries a token load, and your job is to route it into the most efficient GPU lane before the dispatch gate is missed. Small requests belong in the small lane, larger prompts belong in wider throughput windows, and mistakes build heat. The scoring loop mirrors the calculator's logic: token volume, throughput fit, and operating conditions determine whether a cluster runs efficiently or wastes energy.
The game uses your current calculator inputs as a theme for the run, so a page set up for a heavier workload feels different from a lighter one. Use touch or mouse to tap a lane, or press 1, 2, or 3 on a keyboard. A good streak lowers heat and lifts score; bad dispatches increase waste. It is optional, separate from the calculator result, and meant to teach the same concepts through action.
Tip: The most common way to waste score is the same way teams waste energy in production: sending large token loads through a lane sized for short requests.
