AI Inference Energy Cost Calculator

Estimate the real electricity footprint of serving AI output

AI inference often gets discussed in abstract language such as tokens, latency, and model quality, but the underlying work is physical. GPUs stay busy for a measurable amount of time, they draw power while they run, utilities charge for that electricity, and every kilowatt-hour has some associated carbon intensity depending on the grid. This calculator turns that chain into a practical estimate. If you know roughly how many tokens you need to process, how fast each GPU can serve them, how many GPUs are active, and how much power those GPUs draw, you can produce a useful first-pass estimate of runtime, electricity cost, and emissions without building a spreadsheet.

That makes the tool useful in several situations. A developer can use it to estimate what a product launch might cost before traffic arrives. An infrastructure team can compare a faster but hungrier deployment against a slower, more efficient one. A sustainability lead can turn serving volume into an approximate carbon figure for internal reporting. Even if you already monitor production systems, a calculator like this is still valuable because it helps you reason about hypothetical scenarios: a larger workload, a different GPU count, a new region with cheaper power, or a cleaner grid mix.

The most important idea is that the calculator is not guessing blindly. It follows a straightforward causal chain. Tokens determine how much work must be done. Throughput determines how quickly that work finishes. Runtime, multiplied by power, determines energy. Energy multiplied by electricity price gives cost, and energy multiplied by grid carbon intensity gives emissions. Once you see the chain clearly, the numbers become easier to sanity-check and discuss with other people.

How to choose each input so the estimate means what you think it means

Tokens to Process should represent the total token workload over the period you care about. For a large language model, that usually means prompt tokens plus generated tokens across all requests, not just the output tokens. If you are estimating a batch job, enter the total tokens in that job. If you are planning for a day of traffic, enter the day-level total. The calculator does not care whether the tokens come from one very long request or many shorter requests; it only sees the total amount of text the system must handle. That is why being explicit about the planning horizon matters.

Throughput (tokens/sec per GPU) is the input most people misread. The label is per GPU, not total cluster throughput. If your benchmark says a four-GPU system serves 480 tokens per second in aggregate, a good per-GPU entry is 120, not 480. Using total cluster throughput in the per-GPU field would make the model look four times faster than it really is. The best throughput value comes from a benchmark that resembles your production workload in batch size, sequence length, model size, quantization settings, and precision. If you only have a rough number, be conservative and test a lower and higher scenario.

Number of GPUs should reflect how many devices are actively doing the inference work for the workload you entered. This is the knob that trades latency against hardware footprint. More GPUs usually reduce runtime, but they do not automatically reduce total energy. In an ideal linear world, doubling GPU count doubles total power draw while halving runtime, leaving energy roughly unchanged. Real systems are messier: memory pressure, communication overhead, underutilization, and batching effects can all push the result up or down. Still, the ideal case is a useful baseline because it explains why adding hardware is often more about time and concurrency than magical energy savings.

Power per GPU should be an estimate of sustained power draw during inference, not necessarily the nameplate thermal design power. If you have monitoring data from a similar deployment, use the average wattage under load. If you only know the card's advertised maximum, that can act as an upper bound, but it may overstate energy if your workload is lighter than the card's worst case. On the other hand, entering idle power would understate the result. Try to match the number to the operating point your throughput measurement came from, because throughput and power are linked.

Electricity Price and Grid Carbon Intensity translate engineering activity into business and environmental terms. Price should be in dollars per kilowatt-hour. Carbon intensity should be in grams of CO₂e per kilowatt-hour. If you run in a colocation facility or cloud region with time-of-use pricing, you may want to test several values instead of relying on a single national average. The same is true for carbon intensity: a renewable-heavy grid can make the exact same hardware workload look much cleaner than a fossil-heavy one.

What the calculator computes

The calculator outputs four related quantities. First, it estimates runtime in seconds from total tokens divided by total serving throughput. Second, it converts runtime and GPU power into energy used in kilowatt-hours. Third, it applies your electricity price to estimate cost. Fourth, it multiplies energy by grid carbon intensity to estimate carbon emissions in kilograms of CO₂e. Those results are simple on purpose. They give you a clean baseline before you layer on more complex factors such as cooling overhead, networking, storage, or cloud markups.

The formulas used by the page are the direct versions of that logic:

time = tokens throughput per GPU · GPUs energy in kWh = power per GPU in watts · GPUs · time in seconds 1000 · 3600 cost = energy in kWh · electricity price emissions in kg CO₂e = energy in kWh · grid carbon intensity in g CO₂e per kWh 1000

If you prefer the abstract view, the same model can be read as a function of several inputs and a weighted combination of factors. Those generic forms are preserved below because they describe the same idea at a higher level.

R = f ( x1 , x2 , , xn ) T = i=1 n wi · xi

Worked example with realistic inference inputs

Suppose you need to process 10,000,000 tokens. Your benchmark shows about 120 tokens per second per GPU, you plan to use 4 GPUs, each GPU averages 300 watts during inference, your electricity price is $0.12 per kWh, and your grid carbon intensity is 400 g CO₂e per kWh. Total throughput is 120 × 4 = 480 tokens per second. Runtime is therefore 10,000,000 ÷ 480 = 20,833.33 seconds, or about 5.79 hours. Energy is 300 × 4 × 20,833.33 ÷ 3,600,000 = 6.9444 kWh. Electricity cost is 6.9444 × 0.12 = about $0.83. Emissions are 6.9444 × 400 ÷ 1000 = about 2.78 kg CO₂e.

That example reveals an important sanity check. If the only thing you change is GPU count, while assuming perfectly linear scaling and the same power per GPU, runtime changes a lot but energy barely changes. Doubling GPUs halves the time but doubles total power draw, so the two effects cancel. In the real world, the cancellation is not perfect, yet the pattern is still useful because it reminds you that a faster cluster is not automatically a cheaper or greener cluster. Often the main benefit is finishing sooner or handling more concurrent demand.

Idealized scaling comparison

The table below uses the same workload and per-GPU performance as the worked example. It shows why runtime is usually the first quantity to move when you add GPUs, while energy may stay roughly flat in an ideal linear model.

Example: 10,000,000 tokens, 120 tokens/sec per GPU, 300 W per GPU, $0.12 per kWh, 400 g CO₂e per kWh
Scenario GPUs Estimated time Energy used Electricity cost Emissions Interpretation
Smaller cluster 2 11.57 hours 6.9444 kWh $0.83 2.778 kg CO₂e Slower completion, but ideal energy stays similar because power is lower while runtime is longer.
Baseline 4 5.79 hours 6.9444 kWh $0.83 2.778 kg CO₂e A balanced reference case for comparing alternate deployments.
Larger cluster 8 2.89 hours 6.9444 kWh $0.83 2.778 kg CO₂e Much faster turnaround, but not necessarily lower energy unless scaling and power behavior improve in practice.

How to read the result panel

When you click calculate, the result panel gives you four numbers that answer different planning questions. Estimated time tells you how long the workload keeps the hardware occupied. Energy used is the direct electricity consumed by the GPUs based on the power figure you entered. Electricity cost converts that energy into money using your local rate. Carbon emissions gives an emissions estimate tied to the grid mix, not to the GPU brand itself. If one number looks surprising, work backward through the chain. An unexpectedly high cost can come from a long runtime, a high wattage assumption, an expensive electricity rate, or some combination of all three.

A quick sanity routine helps avoid most mistakes. First, confirm the unit on every field. Second, ask whether throughput is per GPU or already aggregated. Third, check whether your token count includes both input and output tokens. Fourth, compare the result to a back-of-the-envelope expectation. If you halve throughput, runtime should double. If you double electricity price, cost should double. If you keep energy fixed but change carbon intensity, only emissions should move. These simple relationship checks catch the majority of data-entry errors.

Assumptions and what the model leaves out

This calculator is intentionally compact, which means it also makes simplifying assumptions. For many planning conversations that is a strength, not a weakness, because it keeps the estimate understandable. Still, it helps to know what is not included:

  • Constant throughput: the calculation assumes your measured tokens per second remains stable across the workload. In production, throughput can vary with sequence length, batching, context size, and traffic shape.
  • Constant power draw: the model treats GPU wattage as steady during the run. Actual power may bounce around with utilization, memory access patterns, and clock behavior.
  • Direct GPU energy only: the estimate does not automatically include CPUs, RAM, storage, networking, or facility overhead such as cooling and power usage effectiveness.
  • Ideal scaling baseline: if you compare GPU counts, remember that real systems may scale less than linearly because of communication overhead or may scale better than expected if larger deployments improve batching.
  • Grid-average carbon: carbon intensity is simplified to a single value, even though real-time grid emissions can change by hour and by region.

Those limitations do not make the calculator unhelpful. They simply tell you what kind of question it answers best. It is strongest as a transparent baseline for planning and comparison. If you later need a more complete total-cost or lifecycle view, you can layer additional factors onto the output rather than replacing it entirely.

Practical advice before you rely on the number

If you have benchmark data from multiple operating modes, run at least three scenarios: conservative, expected, and optimistic. That gives you a range instead of a single figure. If you run in the cloud, compare the calculator's electricity estimate with the cloud bill to understand how much of total spend comes from energy versus margin and managed-service pricing. If you operate your own infrastructure, consider whether you want to multiply the GPU energy result by a data-center overhead factor to account for cooling and distribution losses. And if the result will be used in external reporting, document where each input came from so the estimate can be reproduced later.

In short, this page is most useful when you treat it as a decision aid rather than a magic answer. The math is simple, but the simplicity is valuable because it exposes which assumptions actually drive the outcome. Tokens, throughput, GPU count, watts, electricity price, and grid carbon intensity are all quantities you can measure or at least bound. Once they are written down clearly, tradeoffs become easier to explain.

Common questions about AI inference energy estimates

Should I use average or peak GPU power?

Use the best average under inference load that you can get. Peak power is useful as a worst case, but it can overstate energy if your workload rarely drives the card to its limits. The key is consistency: if your throughput value comes from a benchmark at a certain operating point, your wattage should describe that same operating point as closely as possible.

What if my throughput changes with prompt length or batch size?

Then the cleanest approach is to run several scenarios instead of forcing one number to cover everything. Short prompts, long contexts, small batches, and large batches can behave very differently. If you know your traffic mix, you can calculate a weighted average throughput. If you do not, bracket the answer with a slower and faster case. That range is usually more honest and more useful than a single false-precision result.

Does a lower-emissions grid always mean lower cost?

No. Electricity price and carbon intensity often move together, but not always. A region with cleaner power can still have expensive electricity, and a cheap region can still be carbon-intensive. That is exactly why the calculator keeps those inputs separate. The same inference job can look economically attractive in one place and environmentally attractive in another, so it helps to see cost and emissions side by side instead of collapsing them into one score.

Calculator inputs

Enter workload size, hardware throughput, and energy pricing to estimate runtime, cost, and emissions.

Total workload tokens. For LLM workloads, this often means prompt tokens plus generated tokens across all requests in the scenario you are modeling.

Enter throughput per GPU, not total cluster throughput. If your benchmark is for several GPUs together, divide by the GPU count first.

Use the number of GPUs actively serving the workload in this scenario.

Use average GPU wattage during inference if you have it. If not, the card's rated maximum can serve as a rough upper bound.

Local electricity price in dollars per kilowatt-hour.

Average grams of CO₂e per kilowatt-hour for the grid supplying the workload.

Tip: in an ideal linear system, adding GPUs mainly reduces runtime. Total energy may change less than expected unless throughput and power characteristics also improve.

Enter inference details to estimate energy use.

Copy status messages will appear here after you use the copy button.

Optional mini-game: Inference Load Balancer

This arcade mini-game turns the calculator's tradeoff into something you can feel. Your goal is to keep the request queue near the green target zone while switching between three GPU modes: Eco, Balanced, and Turbo. Eco is efficient but slower. Turbo clears bursts fast but burns more power. The highest scores come from using just enough throughput at the right moment instead of sitting at maximum power all the time.

Score: 0 Time: 75s Streak: 0 Queue: 34 Energy: 0.0 Wh Progress: 0%
Your browser does not support the canvas mini-game.

Click to play: Inference Load Balancer

Keep the queue in the green band. Tap a mode on the canvas or press 1, 2, or 3 to switch between Eco, Balanced, and Turbo GPU settings. Clear request bursts without wasting power.

Best score: 0

Educational takeaway: Higher throughput cuts waiting time, but efficient modes usually win between bursts.

Embed this calculator

Copy and paste the HTML below to add the AI Inference Energy Cost Calculator - Estimate Runtime, Electricity Cost, and CO₂ to your website.