Overview
Training a large language model (LLM) can require enormous compute, electricity, and budget. This calculator gives an order-of-magnitude estimate of training compute (FLOPs), wall-clock training time, direct GPU rental cost, electricity consumption, and carbon emissions based on a few inputs you can usually estimate early in project planning: model size (parameter count), training tokens, and hardware throughput/cost/power.
Use the outputs to compare scenarios (e.g., fewer tokens vs. more GPUs, different GPU generations, or different grid carbon intensity). The results are best interpreted as a planning baseline—not a quote—because real training runs are affected by utilization, parallelism efficiency, checkpointing, restarts, and non-GPU power.
What each input means
- Model parameters (billions): Total trainable weights. A “7B” model means ~7×109 parameters.
- Training tokens (billions): Total number of tokens processed across the full training run (after filtering/deduplication). If you do multiple epochs over a dataset, tokens increase accordingly.
- Per-GPU compute (TFLOPS): Your assumed sustained throughput per GPU (not peak marketing TFLOPS). Sustained throughput depends on precision (BF16/FP16/FP8), sequence length, batch size, kernel efficiency, and communication overhead.
- Number of GPUs: Total accelerators used concurrently.
- GPU cost per hour ($): What you pay per GPU-hour (cloud on-demand, reserved, or internal accounting rate).
- GPU power draw (watts): Average electrical draw per GPU while training (often below TDP, sometimes near it). This typically excludes CPU/network/storage unless you intentionally bake those into the per-GPU wattage.
- Grid CO₂ intensity (kg/kWh): Carbon intensity of your electricity source (location/time dependent). Lower values generally mean cleaner power.
Method and formulas
The core estimate is a widely used transformer training heuristic: total training compute is proportional to parameters and tokens. A common baseline is:
Where:
- F = total floating-point operations (FLOPs)
- N = number of model parameters
- T = number of training tokens processed
Why “6”? It roughly accounts for forward + backward passes and typical transformer training arithmetic. The true constant varies with architecture details (attention implementations, MoE routing, activation checkpointing), optimizer choice, and how you count FLOPs. Treat it as a practical rule of thumb.
Training time
If each GPU sustains X TFLOPS (teraFLOPs per second) and you have G GPUs, the aggregate throughput is approximately G×X TFLOPS. Convert TFLOPS to FLOPs/s (1 TFLOP/s = 1012 FLOP/s) and estimate:
- seconds ≈ F ÷ (G × X × 1012)
- hours = seconds ÷ 3600
- days = hours ÷ 24
Cost
Direct GPU rental/infrastructure cost is estimated as:
- cost ($) = hours × (GPU cost per hour) × (number of GPUs)
Energy and emissions
Electrical energy (kWh) from GPU power alone:
- kWh = (GPU watts ÷ 1000) × hours × (number of GPUs)
Then CO₂ emissions:
- kg CO₂ = kWh × (grid CO₂ intensity in kg/kWh)
Note: If you want to approximate full datacenter energy, you can scale energy by a factor reflecting PUE (power usage effectiveness) and non-GPU components; see limitations below.
Interpreting the results
- FLOPs helps you compare training plans independent of hardware. If you change tokens or parameter count, FLOPs changes proportionally.
- Time is extremely sensitive to sustained throughput and scaling efficiency. If your sustained TFLOPS estimate is optimistic, time (and cost) will be understated.
- Cost here is mainly GPU-hour cost. Real budgets also include CPU instances, storage, networking, engineering time, experimentation, and failed runs.
- Energy/CO₂ is useful for footprint comparisons (different grids, different power draw assumptions, different time-to-train). It is not a verified lifecycle assessment.
Worked example
Suppose you plan to train a 7B parameter model on 100B tokens. You expect 8 GPUs sustaining 150 TFLOPS each, at $2.50/GPU-hour, drawing 300 W/GPU, on a grid with 0.4 kg CO₂/kWh.
- Compute (FLOPs): F = 6 × N × T = 6 × 7×109 × 100×109 = 4.2×1021 FLOPs.
- Throughput: 8 × 150 TFLOPS = 1200 TFLOPS = 1.2×1015 FLOP/s.
- Time: seconds ≈ 4.2×1021 / 1.2×1015 = 3.5×106 s ≈ 972 h ≈ 40.5 days.
- GPU cost: 972 h × $2.50 × 8 ≈ $19,440.
- Energy: (300/1000) kW × 972 h × 8 ≈ 2,333 kWh.
- CO₂: 2,333 kWh × 0.4 ≈ 933 kg CO₂.
This is best read as a baseline under high utilization. If utilization drops (e.g., due to communication overhead or data pipeline stalls), the same FLOPs will take longer, increasing cost and energy.
Scenario comparison (illustrative)
The table below uses the same simple method to show how scale changes outcomes. These are illustrative examples to help build intuition; real training runs vary widely.
| Scenario |
Parameters |
Tokens |
GPUs |
Per-GPU TFLOPS (sust.) |
Est. time |
Est. GPU cost |
| Small fine-tune style run |
1B |
10B |
4 |
120 |
~3.5 days |
depends on rate |
| Mid-size pretraining |
7B |
100B |
8 |
150 |
~40.5 days |
~$19k at $2.50/GPU-h |
| Larger scale run |
70B |
300B |
128 |
200 |
~57 days |
depends on rate |
Assumptions & limitations (read before using)
- Heuristic compute model: The 6×N×T rule is a simplification. Different architectures (e.g., Mixture-of-Experts), different sequence lengths, and different implementations can shift effective compute substantially.
- Perfect scaling/utilization: Time assumes near-ideal scaling across GPUs and steady utilization. In practice, all-reduce/communication overhead, pipeline bubbles, kernel inefficiencies, and input pipeline stalls can reduce achieved TFLOPS.
- Sustained TFLOPS is hard to estimate: Peak TFLOPS is not the same as achieved throughput. Mixed precision (BF16/FP16/FP8), tensor core usage, and memory bandwidth constraints strongly affect sustained performance.
- Optimizer and training recipe effects: Extra compute for certain optimizers, regularization, longer context windows, or frequent evaluation/checkpointing is not explicitly modeled.
- Retries and failed jobs: Restarts, spot interruptions, debugging, and hyperparameter searches can multiply real cost beyond a single “clean run.”
- Energy scope: Energy is calculated from GPU power only unless you include other components in the “GPU power draw” input. Datacenters also consume power for CPUs, networking, storage, and cooling (often summarized by PUE).
- CO₂ intensity variability: Grid intensity can vary by location and time of day. If you use renewable-backed contracts or dedicated clean power, the effective intensity may differ from the regional average.
References (for further reading)
- Kaplan et al., “Scaling Laws for Neural Language Models” (2020) — discussion of compute/data/model scaling relationships.
- Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla, 2022) — compute and data trade-offs for transformers.
- Patterson et al., “Carbon Emissions and Large Neural Network Training” (2021) — approaches to estimating training energy/emissions.