Large language models and vision transformers are no longer confined to research labs. They power chatbots, code assistants, recommendation engines, and countless other services. Every time a user sends a prompt, the model performs billions of operations across multiple processing units. The electricity used for each answer may seem small, yet at scale it becomes significant. Businesses deploying artificial intelligence want to know how much energy their workloads require so they can forecast operating expenses, reduce environmental impact, and size their infrastructure correctly. This calculator translates common inference parameters into tangible energy and monetary figures, helping teams make informed choices about model sizes, hardware allocations, and deployment regions.
Inference workloads are usually measured in tokens rather than raw characters. A token is a word fragment or symbol understood by the model's tokenizer. The more tokens a task involves, the longer the GPU must run. Throughput, often described as tokens per second, indicates how fast a single GPU can process the workload. Modern accelerators handle thousands of tokens each second, yet the effective throughput can drop when sequence lengths grow, when batching is inefficient, or when multiple users compete for resources. By entering the expected token count and the per-GPU throughput, you reveal how long the job will run. Adjusting these numbers shows whether upgrading hardware or optimizing batching strategies could shave seconds off response time and reduce energy draw.
Inference services may use a single GPU for a small experiment or hundreds for a global product. Power consumption scales with the number of devices, so our form asks for the count and for the wattage of each unit. The wattage, sometimes referred to as Thermal Design Power, represents the maximum draw under load. Actual draw can be lower thanks to power management features, but using the rated wattage gives a conservative estimate. As companies scale up, it becomes vital to understand whether an extra GPU truly improves performance or simply adds cost. Inputting various GPU counts and wattages lets you model how infrastructure choices affect energy bills and carbon emissions when traffic surges.
The relationship between hardware power, runtime, and total energy can be expressed mathematically. If P is the wattage of one GPU, G is the number of GPUs, N is the total tokens to process, and S is the throughput in tokens per second per GPU, then the energy in kilowatt-hours is:
Dividing by 1000 converts watts to kilowatts, while the 3600 term converts seconds to hours. Our script computes this value and uses it to determine both monetary cost and greenhouse emissions based on your electricity price and grid carbon intensity. Understanding the formula empowers engineers to reason about which variable offers the biggest opportunity for savings—sometimes improving throughput yields greater gains than simply adding more GPUs.
The table below lists four popular accelerators with approximate throughput numbers for a 7B-parameter language model at batch size one. Actual performance varies by workload, sequence length, and software stack, but these figures provide a sense of scale when planning deployments.
GPU Model | TDP (W) | Tokens/sec* |
---|---|---|
NVIDIA A10G | 300 | 900 |
NVIDIA A100 | 400 | 1700 |
NVIDIA H100 | 700 | 3000 |
AMD MI250X | 500 | 1800 |
*Throughput estimates are for reference only and assume optimized kernels.
Electricity tariffs can vary by an order of magnitude between regions. Cloud providers often offer discounted rates for long-term commitments or for using renewable energy zones. By entering the price per kilowatt-hour, you immediately see how location affects operational expense. The carbon intensity field multiplies energy use by grams of carbon dioxide equivalent per kilowatt-hour. Regions rich in hydropower may have intensities below 20 gCO₂e/kWh, while coal-heavy grids exceed 800 gCO₂e/kWh. This number helps sustainability teams estimate emissions and identify opportunities for offsets or renewable energy procurement. Pairing cost and carbon in a single report fosters both fiscal discipline and environmental accountability.
Once you understand the baseline energy use, you can explore ways to reduce it:
Even small efficiency gains multiply when scaled across millions of users.
Start-ups can estimate the cost of offering a free-tier chatbot. Researchers can compare the energy implications of running experiments locally versus on the cloud. Product managers can justify budget requests for model optimization projects by showing expected energy savings. Sustainability officers can calculate the carbon footprint of customer interactions to include in corporate responsibility reports. The calculator’s quick feedback encourages iterative exploration, allowing you to adjust parameters and instantly visualize results without running a single real inference pass. This lightweight planning stage can save time and money, especially when exploring new architectures or service models.
The computation assumes that GPUs run at their rated power throughout the inference task. In practice, dynamic voltage and frequency scaling may reduce consumption, while host CPU and memory overhead add extra draw not captured here. Network latency and storage operations can also contribute to total energy. Additionally, throughput numbers may decrease with longer sequences or higher batch sizes. Treat the results as approximations rather than precise audits. For high-stakes decisions, integrate measured power data from your deployment environment and consult utility bills or smart-meter readings to validate estimates.
Advances in hardware and algorithms continually shift the energy landscape. Custom accelerators tailored for transformer operations promise more tokens per joule. Compiler-level optimizations and more efficient attention mechanisms are emerging to handle long contexts without linear scaling. As regulations around energy disclosure tighten, transparent calculators like this one help organizations stay ahead of reporting requirements. In the long term, coupling inference workloads with renewable energy sources or waste-heat recovery systems could make AI services far greener. By routinely evaluating energy and cost, teams contribute to a culture of responsible innovation where powerful models are deployed thoughtfully rather than wastefully.
Estimate how quickly a rainwater catchment system pays for itself. Enter installation costs, annual water capture, and water rates.
Calculate the bandwidth needed for your VoIP phone system based on codec bitrate and concurrent calls. Prevent choppy audio by planning the right connection.
Determine how long it takes for a professional energy audit to pay for itself. Input audit cost and estimated annual savings from improvements.