Transformers process sequences by letting every token attend to every other token in the current context window. The number of tokens that can be seen at once is called the context window or sequence length. Extending this window from, say, 2k tokens to 8k or 32k enables models to reason over longer documents, handle large code files, or maintain long-running conversations without losing track of earlier messages.
The trade-off is cost. Standard dense self-attention scales quadratically with sequence length. That means that going from 2k to 8k tokens (a 4× increase) can require roughly 16× more attention compute and memory, and often leads to much lower tokens-per-second throughput. This calculator helps you quantify those trade-offs for a given hidden size, number of layers, precision, and hardware cost.
By plugging in a baseline context length and a longer target length, you can estimate how memory usage, throughput, and cost per million tokens change as you scale up context. The goal is not exact capacity planning, but a quick, transparent way to build intuition about long-context deployment economics.
The calculator uses simplified scaling relationships that reflect how standard transformer architectures behave under dense self-attention. The most important variables are:
d_model, embedding dimension)During inference or training, the model must store activations and attention key–value (KV) states. A common back-of-the-envelope approximation is that activation memory grows linearly with sequence length, while attention memory grows quadratically.
An approximate formula for total activation memory in bytes is:
The factor of 2 loosely accounts for forward activations plus gradients (during training) or KV caches (during inference). This is intentionally simple and is not meant to model every detail of a specific implementation.
Attention memory is dominated by the quadratic dependency on sequence length:
M_att = L^2 × N × b / 8
Total memory is then approximated as M_total = M_a + M_att. In practice, frameworks and kernels add overhead (optimizer states, padding, allocator fragmentation), so real values can be higher.
The dominant cost in transformers with dense self-attention arises from the attention operations and large matrix multiplications. Under the quadratic assumption, compute grows with L^2. If you know the baseline throughput T_b at baseline length L_b, the calculator estimates throughput T_t at target length L_t as:
T_t = T_b × (L_b / L_t)^2
For example, going from 2k to 8k tokens (4× longer) yields a throughput factor of (1/4)^2 = 1/16, so a system doing 100 tokens/s at 2k might only manage about 6.25 tokens/s at 8k under these assumptions.
If your hardware cost per hour is H_c (for example, a single GPU at $2/hour), and your throughput is T tokens per second, then the cost to process 1 million tokens is:
C = H_c / (T × 3600 / 10^6)
The calculator computes this for both the baseline and target context lengths:
C_b: cost per million tokens at baseline length L_b with throughput T_bC_t: cost per million tokens at target length L_t with projected throughput T_t
The difference ΔC = C_t - C_b indicates the incremental cost per million tokens purely from increasing the context window, assuming hardware and pricing stay constant.
When you run the calculator, you will typically see three main effects as you increase the target context length:
1 / L^2 under dense attention. This directly impacts latency and capacity (e.g., concurrent users you can serve).Large jumps in memory and cost do not automatically mean long context is a bad idea, but they do highlight where you might need architectural tricks (sparse attention, sliding windows) or system-level optimizations (KV caching strategies, batching) to keep serving costs under control.
Consider a configuration similar to the defaults in the form above:
Using the quadratic scaling formula:
T_t = 100 × (2048 / 8192)^2 = 100 × (1/4)^2 = 100 / 16 = 6.25 tokens/s
So, a 4× longer context leads to a 16× throughput decrease under this model.
Baseline cost per million tokens at 2k context:
C_b = 2 / (100 × 3600 / 10^6) ≈ $0.0056 per million tokens
At 8k context with T_t = 6.25 tokens/s:
C_t = 2 / (6.25 × 3600 / 10^6) ≈ $0.089 per million tokens
This is a 16× increase in cost per million tokens, mirroring the 16× throughput decrease. The example table on the page aggregates similar calculations into more rounded values (e.g., cost going from around $0.20 to $3.20) to emphasize the dramatic shift in economics when context windows are extended aggressively.
The table below summarizes typical qualitative differences between shorter and longer context windows, assuming the same model architecture and hardware.
| Aspect | Shorter context (e.g., 2k–4k) | Longer context (e.g., 8k–32k+) |
|---|---|---|
| GPU memory usage | Lower; easier to fit on a single GPU; more headroom for batching. | Much higher; may require larger GPUs, model/tensor parallelism, or offloading. |
| Throughput (tokens/s) | Higher; better latency and higher user concurrency. | Lower due to quadratic scaling; latency and capacity can degrade sharply. |
| Cost per million tokens | Lower; better hardware utilization per dollar. | Higher; cost can grow roughly with the square of the context length. |
| Suitability for long documents | Requires chunking or retrieval; may miss cross-chunk interactions. | Can hold entire documents or conversations in one window. |
| Implementation complexity | Simpler; standard dense-attention inference is usually enough. | Often needs optimized kernels, caching strategies, or alternative attention patterns. |
Under dense self-attention, the attention part of memory scales with L^2. Going from 2k to 8k is a 4× increase in L, which implies roughly 16× more attention memory. Activations also grow linearly with L, so total memory typically increases by well over 4×. The calculator approximates this growth based on your chosen hidden size, layers, and precision.
The model here assumes throughput scales as T &propto 1 / L^2 for dense self-attention. That means:
Real systems may deviate from this idealized scaling, but it is a useful upper bound on how much slower things can get.
A long-context model becomes problematic when the cost per million tokens and the required GPU count exceed your budget for the workload you care about. Use this calculator to see how cost scales with context, then compare that against business constraints like cost per chat session or per document processed. Often, a hybrid approach (moderate context length plus retrieval-augmented generation) gives a better trade-off than pushing context length to the maximum.
This calculator is intentionally simplified. It provides rough planning numbers, not production-grade capacity estimates. Key assumptions and limitations include:
Because of these limitations, you should use the calculator for intuition and relative comparisons (e.g., how costly is 8k vs 2k), and then validate key configurations with empirical benchmarks on your actual hardware and software stack.
If your calculations indicate that extremely long contexts (like 32k tokens) are too expensive, consider strategies such as retrieval-augmented generation, document chunking with smart linking, or architectures that reduce attention complexity. Combining a moderate context window (for example, 4k or 8k tokens) with retrieval can often capture most of the benefits of long context while keeping GPU memory, throughput, and cost per million tokens within acceptable bounds.