Large neural networks achieve state-of-the-art accuracy but impose heavy computational burdens. Distillation compresses knowledge from a powerful teacher into a lightweight student, preserving performance while easing deployment. This calculator quantifies the trade-offs by modeling training time, cost, memory footprint, and inference speed. Researchers planning to release mobile-ready models or organizations aiming to trim cloud bills can use these estimates to gauge whether distillation justifies the extra effort.
The form requests parameter counts for teacher and student models, expressed in billions of parameters. It also accepts the number of dataset tokens processed during distillation, the per-GPU throughput in tokens per second, the number of GPUs, and an hourly GPU cost reflecting on-demand pricing or amortized hardware expense. With these inputs, the script compares baseline student training against distillation, where each token requires an additional forward pass through the teacher.
Training time for the student alone equals dataset tokens divided by aggregate throughput: where is tokens and is tokens per second across all GPUs. Distillation introduces a multiplicative factor because the teacher must compute logits for each token. Assuming teacher inference cost scales with parameter count, effective time becomes with and as teacher and student parameters respectively.
For example, a 7-billion-parameter student distilled from a 70-billion-parameter teacher incurs roughly an 11× slowdown if run on identical hardware, since . Optimized pipelines may reduce the penalty by caching activations or using mixed precision, but the equation captures the intuition that bigger teachers demand more compute.
Total training cost multiplies time by GPU hourly rate: , where is GPU count and is cost per GPU hour. Comparing for student-only training with for distillation reveals the monetary overhead of leveraging a teacher model. Many teams accept the extra expense in exchange for smaller inference footprints that unlock edge deployment or reduce cloud bills.
Parameter count also dictates memory consumption. Assuming two bytes per parameter—appropriate for 16-bit precision—the memory footprint in gigabytes is . The calculator outputs teacher and student requirements for quick comparison. Inference speed roughly scales inversely with parameter count, so the student enjoys a speedup factor, ignoring other bottlenecks like memory bandwidth.
Consider distilling a 70-billion-parameter language model into a 7-billion student using one billion tokens. Each GPU processes 2,000 tokens per second, and eight GPUs are available at $2.50 per hour. The student alone would finish in seconds, or roughly 62.5 hours. Distillation multiplies this by 11, yielding nearly 688 hours. Student training costs $1,250 while distillation costs about $13,750. Yet inference memory plummets from 130 GB for the teacher to 13 GB for the student, and speed increases tenfold. The table summarizes these outcomes.
Metric | Student Only | With Distillation |
---|---|---|
Training Time (hours) | 62.5 | 687.5 |
Training Cost ($) | 1,250 | 13,750 |
Inference Memory (GB) | 13 | 13 |
Inference Speedup | 10× vs teacher | 10× vs teacher |
At first glance, distillation’s training overhead seems daunting. However, many applications run inference millions of times, so even modest per-request savings accumulate. If deploying the teacher would cost $0.002 per query and the student $0.0002, distillation’s $12,500 premium pays off after 6 million requests. Users can extend this analysis by multiplying the speedup and memory reduction outputs by expected traffic to project long-term savings.
Distillation draws inspiration from the concept of learning through imitation. The student minimizes a divergence between its output distribution and the teacher’s, often using temperature-scaled soft targets. Mathematically, the loss function may blend cross-entropy with a Kullback–Leibler divergence term: . While the calculator does not model such intricacies, understanding them contextualizes the compute overhead: the teacher must generate full probability distributions, not just hard labels.
The throughput input assumes identical efficiency for teacher and student. In practice, teacher inference may benefit from optimized kernels or require larger batch sizes, altering the overhead ratio. The model also ignores data loading bottlenecks, communication overhead among GPUs, and the effect of longer sequence lengths on throughput. Users should treat the outputs as order-of-magnitude estimates and supplement them with small-scale benchmarks whenever possible.
Another consideration is curriculum. Some practitioners distill using progressively smaller subsets of data or intermediate layer matching, which can reduce tokens processed and thus time. Others perform multi-stage distillation where the student becomes the teacher for an even smaller apprentice model. The calculator can be applied iteratively in such cases, chaining scenarios to approximate multi-hop compression.
Model distillation offers a principled path toward efficient neural networks, trading additional upfront training for long-term deployment gains. By quantifying training time, cost, memory footprint, and expected speedup, this calculator helps decision makers evaluate whether the benefits align with their operational constraints. The computations run entirely in the browser, enabling rapid experimentation with hypothetical models and hardware setups.
Estimate compute, time, energy, and electricity cost for training large AI models based on parameters, tokens, and hardware.
Analyze latency and expense of deploying multiple models together in an ensemble for inference.
Estimate monthly and yearly costs of retaining model checkpoints across training runs.