Retrieval augmented generation (RAG) systems blend neural information retrieval with large language model (LLM) reasoning. Each query typically flows through three stages: a user prompt triggers retrieval of semantically similar documents, the retrieved passages are collated into a context window, and an LLM synthesizes a response. This calculator follows the same structure by evaluating the token budget, retrieval fan-out, and refresh overhead that collectively determine the per-query cost and latency envelopes teams must respect when scaling. Rather than rely on a single cost estimate, the script simulates cold-cache, warm-cache, and aggressively optimized scenarios so that operators can see how caching, fan-out reduction, and retrieval latency improvements reshape their budgets. The inputs align with those used across the site’s AI tooling—plain numeric fields, contextual copy buttons, and narrative walkthroughs that empower decision makers to adapt the formulas to their stacks without touching backend infrastructure.
The computation begins with token accounting. Prompt tokens encompass instructions, conversation history, and retrieved context, while completion tokens cover the model’s response. Every provider quotes prices per thousand tokens, so the first step is normalizing prompt and completion counts into that pricing scheme. Because vector stores charge per retrieval or per thousand read operations, the calculator treats fan-out as a multiplier on a per-thousand retrieval rate. Embedding refreshes—the silent cost of RAG—are modeled using document counts, average tokens per document, and the share of the corpus re-embedded each month. These costs are amortized over the monthly query volume to deliver a per-query refresh burden that can be compared with the serving expense.
The expression above defines the cold-cache cost per query. P and R represent prompt and completion tokens, M is the model price per thousand tokens, F is the retrieval fan-out, and V denotes the vector read price per thousand retrievals. The final fraction captures the embedding refresh load where E is the number of documents in the index, T the tokens per document, D the monthly refresh percentage, p the embedding price per thousand tokens, and Q the monthly query volume used to amortize the expense.
Warm-cache cost introduces the hit rate h, reducing both the LLM token bill and vector reads by the fraction of queries served directly from cache. Latency modeling parallels these formulas: cold latency is the base LLM latency plus fan-out multiplied by retrieval latency, while warm latency multiplies those components by (1 − h). An optimized profile reduces fan-out by twenty percent and caps cache hit rate at 95%, mirroring common engineering efforts such as better reranking and cache eviction tuning.
Scenario | Cost per Query | Monthly Cost | Latency (ms) |
---|---|---|---|
Cold Cache | $0.00 | $0.00 | 0 |
Warm Cache | $0.00 | $0.00 | 0 |
Optimized | $0.00 | $0.00 | 0 |
The cold-cache row imagines every request falling through to the model and vector store. It highlights the worst case operators must provision for during cache flushes or traffic spikes. The warm-cache row applies the entered hit rate, giving a more realistic steady-state view once responses and retrieved passages are re-used. The optimized row extrapolates the effect of tuning that nudges hit rate upward while reducing fan-out through reranking, deduplication, or hybrid search filters. Each row adds the amortized embedding refresh cost because document updates are a background process independent of cache behavior.
Beyond raw numbers, the explanatory narrative dives into instrumentation strategies, back-of-the-envelope sensitivity analysis, and the trade-offs between retrieval fan-out and hallucination risk. Readers can adapt the formulas to batched serving, multi-tenant caches, or streaming completions. If latency targets are missed, the narrative suggests options such as distilling prompts to cut token counts, precomputing summaries to lower context length, or adding tiered caches that separate frequently used knowledge articles from long tail corpus entries. The walk-through also covers budget governance—tying monthly query volume to quota policies—and the operational implications of embedding refresh cadence when new documents arrive in bursts. Throughout, the discussion mirrors the tone and depth of other AI calculators on the site, offering real-world anecdotes, integration pointers, and encouragement to validate assumptions against observability dashboards before locking budgets.
Teams experimenting with hybrid retrieval approaches will find the MathML derivations handy for mixing sparse keyword signals with dense vectors. The warm-cache formula, for instance, can be extended by splitting fan-out across two retrieval channels and assigning distinct vector read costs. Likewise, latency modeling can be decomposed into parallel segments to reflect asynchronous retrieval. The calculator’s structure is intentionally transparent, allowing architects to fold in reranker costs, re-reranking latencies, or guardrail model invocations. Because all arithmetic runs client-side, it also doubles as documentation: product managers can screenshot the table during planning reviews, while engineers can export the numbers via the copy button for spreadsheet validation. This storytelling-driven approach ensures the page exceeds a thousand words, matching the long-form guidance that power users expect from the catalog of calculators.