Large Language Model Training Cost Calculator

JJ Ben-Joseph headshot JJ Ben-Joseph

Overview

Training a large language model (LLM) can require enormous compute, electricity, and budget. This calculator gives an order-of-magnitude estimate of training compute (FLOPs), wall-clock training time, direct GPU rental cost, electricity consumption, and carbon emissions based on a few inputs you can usually estimate early in project planning: model size (parameter count), training tokens, and hardware throughput/cost/power.

Use the outputs to compare scenarios (e.g., fewer tokens vs. more GPUs, different GPU generations, or different grid carbon intensity). The results are best interpreted as a planning baseline—not a quote—because real training runs are affected by utilization, parallelism efficiency, checkpointing, restarts, and non-GPU power.

What each input means

Method and formulas

The core estimate is a widely used transformer training heuristic: total training compute is proportional to parameters and tokens. A common baseline is:

F = 6 × N × T

Where:

Why “6”? It roughly accounts for forward + backward passes and typical transformer training arithmetic. The true constant varies with architecture details (attention implementations, MoE routing, activation checkpointing), optimizer choice, and how you count FLOPs. Treat it as a practical rule of thumb.

Training time

If each GPU sustains X TFLOPS (teraFLOPs per second) and you have G GPUs, the aggregate throughput is approximately G×X TFLOPS. Convert TFLOPS to FLOPs/s (1 TFLOP/s = 1012 FLOP/s) and estimate:

Cost

Direct GPU rental/infrastructure cost is estimated as:

Energy and emissions

Electrical energy (kWh) from GPU power alone:

Then CO₂ emissions:

Note: If you want to approximate full datacenter energy, you can scale energy by a factor reflecting PUE (power usage effectiveness) and non-GPU components; see limitations below.

Interpreting the results

Worked example

Suppose you plan to train a 7B parameter model on 100B tokens. You expect 8 GPUs sustaining 150 TFLOPS each, at $2.50/GPU-hour, drawing 300 W/GPU, on a grid with 0.4 kg CO₂/kWh.

  1. Compute (FLOPs): F = 6 × N × T = 6 × 7×109 × 100×109 = 4.2×1021 FLOPs.
  2. Throughput: 8 × 150 TFLOPS = 1200 TFLOPS = 1.2×1015 FLOP/s.
  3. Time: seconds ≈ 4.2×1021 / 1.2×1015 = 3.5×106 s ≈ 972 h ≈ 40.5 days.
  4. GPU cost: 972 h × $2.50 × 8 ≈ $19,440.
  5. Energy: (300/1000) kW × 972 h × 8 ≈ 2,333 kWh.
  6. CO₂: 2,333 kWh × 0.4 ≈ 933 kg CO₂.

This is best read as a baseline under high utilization. If utilization drops (e.g., due to communication overhead or data pipeline stalls), the same FLOPs will take longer, increasing cost and energy.

Scenario comparison (illustrative)

The table below uses the same simple method to show how scale changes outcomes. These are illustrative examples to help build intuition; real training runs vary widely.

Scenario Parameters Tokens GPUs Per-GPU TFLOPS (sust.) Est. time Est. GPU cost
Small fine-tune style run 1B 10B 4 120 ~3.5 days depends on rate
Mid-size pretraining 7B 100B 8 150 ~40.5 days ~$19k at $2.50/GPU-h
Larger scale run 70B 300B 128 200 ~57 days depends on rate

Assumptions & limitations (read before using)

References (for further reading)

Embed this calculator

Copy and paste the HTML below to add the Large Language Model Training Cost Calculator to your website.