AI Training Compute Cost Calculator
Introduction
Training a modern AI model is not just a research decision; it is a budgeting decision. Before a team launches a run, it usually needs a practical answer to a very simple question: how much compute, time, energy, and money will this experiment consume? That is exactly what this calculator is built to estimate. You enter the size of the model, the amount of training data, the speed of each GPU, the number of GPUs in the cluster, the power draw of each device, and the electricity price. The result is a compact first-pass estimate of total training FLOPs, wall-clock runtime, GPU-hours, energy use, and electricity cost.
These estimates matter because AI training scales quickly. A small change in model size or token count can turn a weekend experiment into a multi-week run. That makes rough compute arithmetic extremely useful early in planning, long before profiling logs or cluster reservations exist. If you are comparing hardware, pitching a training budget, checking whether a project fits a grant, or teaching students how training cost grows, this page gives you a clean way to reason about the tradeoffs.
How to Use This Calculator
Start with the first two fields, because they determine the scale of the job itself. Model size is entered in billions of parameters. A 7 billion parameter model should be entered as 7, a 13 billion model as 13, and a 70 billion model as 70. Training tokens are also entered in billions. If your run will process 300 billion tokens, enter 300. Those two values are the heart of the estimate, since they control the total amount of training work.
Next, describe the hardware. Enter the approximate training throughput of one GPU in TFLOPS, then the total number of GPUs you expect to use. After that, enter the power draw per GPU in watts and your electricity price in dollars per kilowatt-hour. Once you run the calculation, the tool reports the total compute requirement, the idealized runtime, the aggregate GPU-hours, the total electrical energy, and the direct electricity bill implied by those assumptions.
- Enter model size in billions of parameters.
- Enter training tokens in billions.
- Enter per-GPU training throughput in TFLOPS.
- Enter GPU count, watts per GPU, and electricity price.
- Read the result as a planning estimate, then compare multiple scenarios.
A useful way to work with the calculator is to keep the model and token values fixed while changing the hardware assumptions. That lets you answer practical questions such as whether doubling the GPU count meaningfully shortens the schedule, whether a newer accelerator justifies its higher power draw, or whether a cheaper electricity region changes the overall operating cost enough to matter.
Formula and Assumptions
Developing a contemporary transformer model is as much a question of resource budgeting as it is of algorithmic cleverness. The billions of parameters that allow a network to generalize across language, images, or audio are not free; they require trillions of floating point operations to adjust through gradient descent. Researchers and engineers often quote aggregate compute in floating point operations, or FLOPs, when discussing a training run. A popular rule of thumb for dense transformers is that the total training FLOPs can be approximated by the expression , where is the number of parameters and is the number of tokens seen during training. The factor of six stems from forward and backward passes through each layer as well as optimizer overhead. Although the constant varies by architecture, the equation offers a remarkably useful starting point for budgeting experiments. For example, a 7 billion parameter model trained on 300 billion tokens demands roughly FLOPs.
Knowing how many operations training requires is only the first step. To transform theoretical FLOPs into an estimate of wall-clock time, you must consider the hardware throughput. Modern accelerators advertise their performance in teraFLOPS, or TFLOPS, which corresponds to floating point operations per second. If a GPU can sustain 200 TFLOPS on tensor cores and you employ 32 of them, the peak throughput is FLOPs per second. Dividing the training FLOPs by this figure reveals the runtime in seconds. In practice, data loading, communication overhead, and optimizer state updates reduce effective throughput, so the calculator should be read as an optimistic baseline rather than a guaranteed schedule.
Once runtime is known, the remaining outputs follow naturally. GPU-hours are the runtime in hours multiplied by the number of GPUs. Energy consumption comes from multiplying runtime by total cluster power draw. Electricity cost is then the product of energy use in kilowatt-hours and the price per kilowatt-hour. In compact form, the calculator is effectively using the relationships below.
In these formulas, t is runtime in seconds, H is GPU-hours, E is energy in kilowatt-hours, C is electricity cost, T is per-GPU throughput in FLOPs per second, G is the number of GPUs, P is power draw in watts per GPU, and p is your electricity price. The calculator assumes that every GPU contributes equally and that throughput scales linearly with GPU count. That is a helpful starting assumption for back-of-the-envelope planning, even though real clusters rarely achieve perfect scaling.
Worked Example
Suppose you plan to train a 13 billion parameter model on 400 billion tokens. You expect each accelerator to deliver roughly 312 TFLOPS of effective training throughput, and you plan to use 16 GPUs rated at 400 watts each. If electricity costs $0.12 per kilowatt-hour, the calculator will translate the model and dataset size into total FLOPs, then divide by the combined throughput of the 16-GPU cluster to estimate runtime. From there it derives GPU-hours, multiplies the power draw by runtime to estimate kilowatt-hours, and finally multiplies by the electricity rate to estimate direct power cost.
This example highlights an important interpretation point. The electricity bill is usually only one part of the total cost of training. Cloud rental rates, engineering time, storage, networking, checkpoint retention, and cooling overhead can all exceed the direct utility charge. Even so, the electricity estimate remains useful because it exposes the physical scale of the run and helps compare alternative hardware choices in a concrete way.
Example Hardware Comparison
The table below lists several accelerators and their approximate capabilities for training in mixed precision. Throughput and power figures change with software optimizations, batch sizes, memory limits, and model shape, but these values are good enough for scenario planning.
| GPU Model | TFLOPS (FP16) | Power (W) |
|---|---|---|
| NVIDIA A100 | 312 | 400 |
| NVIDIA H100 | 800 | 700 |
| AMD MI250X | 383 | 500 |
| Google TPU v4 | 275 | 450 |
Using this table, a practitioner could estimate the cost difference between running on an H100 cluster versus an older A100 deployment. Suppose you train a 70 billion parameter model on 1 trillion tokens. An H100 cluster might finish in weeks, whereas the A100 setup could take much longer, resulting in more elapsed project time even if the older hardware appears cheaper at first glance. That is why it helps to compare both runtime and energy use instead of focusing on a single price tag.
Planning Insights
Recent work on scaling laws suggests approximate relationships between model size, dataset size, and the loss achieved after training. One practical takeaway is that data and model size should usually scale together. If you double parameters while keeping token count too low, you may pay for a bigger model without fully benefiting from it. If you dramatically increase token count, you should expect training compute to rise proportionally. The calculator makes those tradeoffs visible immediately because the FLOP estimate responds directly to both variables.
GPU-hours are especially useful when coordinating teams. A result of 20,000 GPU-hours can be realized in more than one way: 100 GPUs for about eight days, 50 GPUs for about two weeks, or 25 GPUs for longer still. The total work is comparable, but the operational reality is not. Schedules, reservation queues, fault tolerance, and model iteration speed all change with cluster size. In other words, a faster run may not change the theoretical compute budget, but it can still be strategically valuable.
Energy use is another planning lens. Two clusters that deliver similar throughput may differ sharply in power draw, and regional electricity prices can vary from a few cents to well above thirty cents per kilowatt-hour. If you are deciding where to run a training job, the difference can be meaningful, particularly for repeated experiments or fine-tuning campaigns at scale. The calculator does not directly compute carbon emissions, but the energy result can be multiplied by regional carbon intensity if you also want an environmental estimate.
Limitations and Practical Caveats
This calculator intentionally simplifies a complicated system. Large training runs seldom use a single monolithic server. Instead, they employ distributed data parallelism, model parallelism, or pipeline parallelism to spread work across many accelerators. Real clusters lose time to communication overhead, synchronization waits, I/O bottlenecks, validation passes, checkpoint writes, and occasional hardware failures. Networking fabric, storage design, and software stack quality all influence how close you can get to the theoretical throughput entered into the form.
Data quality and preprocessing also matter. Tokenization, filtering, deduplication, and batching may consume substantial CPU time and storage bandwidth. If the pipeline cannot feed the GPUs fast enough, utilization drops and actual runtime expands. Cooling, host CPUs, memory, interconnect switches, and data center overhead can also push real electricity cost above the simple GPU-only estimate shown here. Many organizations therefore add a margin of safety, often 10 to 30 percent or more, when converting these numbers into an actual budget.
Checkpointing and fault tolerance introduce another real-world wrinkle. Long runs need periodic saves, and those saves consume both time and infrastructure resources. Recovering from failures may repeat a portion of training work. For that reason, the result should be interpreted as a first-order estimate rather than an invoice. It is most valuable when you compare scenarios, communicate the scale of a proposed run, or build intuition about how parameters, tokens, throughput, and power interact.
Copy status updates appear here.
Mini-game: GPU Sync Sprint
This optional arcade mini-game turns the same training concepts into a quick timing challenge. Launch green token batches when they pass through the cluster’s efficiency window, skip red power surges, and grab blue boosts to widen your timing window. A clean run represents higher sustained utilization, which is exactly what lowers wasted wall-clock time in real training jobs.
Takeaway: better utilization means the same FLOPs finish in less wall-clock time, which reduces GPU-hours and electricity use.
