RAG Query Cost and Latency Calculator
Introduction: What this calculator estimates
Retrieval-augmented generation (RAG) turns a single user request into a small pipeline: (1) retrieve relevant documents/chunks from a vector database, (2) build an LLM prompt that includes your system instructions, chat history, and retrieved context, and (3) generate the final answer with the model. This calculator estimates cost and latency for that pipeline using the inputs you provide, and compares outcomes across caching scenarios.
It is designed for planning and “what-if” analysis: budgeting monthly spend, estimating the ROI of improving cache hit rate, choosing between models or vector stores, and understanding which knob (tokens, fan-out, cache hit rate, retrieval latency) is driving your p50-style response time and your unit economics.
Inputs explained (units and how they’re used)
- Monthly Queries (
Q): total queries served per month. - Prompt Tokens per Query (
P): tokens sent to the model per query (instructions + history + retrieved context). If your provider charges different prompt/completion rates, this calculator assumes a blended rate (see limitations). - Completion Tokens per Query (
R): tokens generated by the model per query. - Model Price per 1K Tokens ($) (
M): price per 1,000 tokens. The calculator treatsP + Ras billable tokens at this single rate. - Retrieval Fan-out (documents) (
F): how many documents/chunks you fetch per query. Each “document” here means one vector read operation (or equivalent) for pricing purposes. - Cache Hit Rate (0–1) (
h): fraction of queries served from cache. Enter as a decimal: 0.35 = 35%. - Latency per Retrieval (ms) (
L_r): average time to perform one retrieval. Used linearly with fan-out asF × L_r. - Base LLM Latency (ms) (
L_llm): average model latency excluding retrieval (think: prompt processing + generation time under typical load). - Vector Read Cost per 1K Retrievals ($) (
V): cost per 1,000 retrieval operations. - Embedding Price per 1K Tokens ($) (
p): price per 1,000 tokens to embed documents during refresh/reindex. - Documents in Index (
E): total documents/chunks stored. - Average Tokens per Document (
T): approximate tokens per document/chunk that you embed. - Monthly Refresh Percentage (%) (
D%): what fraction of the index is re-embedded per month (0–100). In formulas this becomesD = D% / 100.
Formula: Cost model
The total monthly cost is modeled as the sum of: (a) LLM inference (prompt + completion), (b) vector reads for retrieval, and (c) embedding refresh cost amortized across your monthly query volume. The calculator then reports per-query values by dividing by Q where appropriate.
1) LLM cost
Billable tokens per query are approximated as P + R. With a single blended price per 1K tokens (M):
LLM cost per query:
2) Vector read cost
Each query triggers F retrievals. If your vector store charges V dollars per 1,000 retrievals:
Vector cost per query:
C_vector = (F × V) / 1000
3) Embedding refresh amortization
RAG systems often pay a recurring “silent” cost to keep the index fresh: re-embedding new/changed content. If you have E documents, each averaging T tokens to embed, and you refresh a fraction D of them per month, then:
- Monthly embedding tokens:
E × T × D - Monthly embedding cost:
(E × T × D × p) / 1000
To compare embedding upkeep to serving, the calculator amortizes the monthly embedding cost over monthly queries Q:
Embedding refresh cost per query:
C_embed = (E × T × D × p) / (1000 × Q)
Cold-cache vs warm-cache cost
This calculator treats a cache hit as avoiding both retrieval and LLM inference for the cached portion of queries (i.e., a hit returns a stored answer). Under that simplifying assumption:
- Cold-cache (no cache):
C_cold = C_llm + C_vector + C_embed - Warm-cache (hit rate h):
C_warm = (1 − h) × (C_llm + C_vector) + C_embed
Note that embedding refresh is typically paid regardless of cache performance, so it is not reduced by h in this model.
Latency model
Latency is modeled as retrieval time plus base model time. Retrieval is approximated as fan-out times per-retrieval latency (linear fan-out):
- Cold latency:
L_cold = L_llm + (F × L_r) - Warm latency:
L_warm = (1 − h) × L_cold(cache hits are treated as ~0ms incremental compute in this simplified view)
In practice, cached responses still incur some overhead (routing, cache lookup, network). If you want that reflected, subtract less than the full h benefit by using a lower effective hit rate.
How to use: Worked example (using the default inputs)
Defaults: Q=100,000, P=800, R=600, M=$0.003, F=8, h=0.35, L_r=45ms, L_llm=700ms, V=$0.15, p=$0.0001, E=250,000, T=750, D%=20 (so D=0.20).
- LLM cost/query:
((800+600)/1000)×0.003 = 1.4×0.003 = $0.0042 - Vector cost/query:
(8×0.15)/1000 = $0.0012 - Embedding monthly cost: tokens
= 250,000×750×0.20 = 37,500,000; cost= 37,500,000/1000×0.0001 = $3.75 - Embedding cost/query:
$3.75 / 100,000 = $0.0000375
Cold cost/query ≈ 0.0042 + 0.0012 + 0.0000375 = $0.0054375 (~$0.0054). That implies monthly cold serving ≈ $543.75.
Warm cost/query (35% hits) ≈ (1−0.35)×(0.0042+0.0012) + 0.0000375 ≈ 0.65×0.0054 + 0.0000375 ≈ $0.0035475 (~$0.00355). Monthly warm serving ≈ $354.75.
Cold latency ≈ 700 + 8×45 = 1060ms. Warm latency under the simplified hit model ≈ 0.65×1060 = 689ms. Interpretation: with these parameters, latency is dominated by base LLM time, while cost is split between the model and retrieval reads; embeddings are small per query but can matter if query volume is low or refresh rates are high.
How to interpret the results
- If per-query cost is high, first check token counts (
PandR). Large retrieved context or long completions usually dominate. - If retrieval cost is high, reduce
F, improve filtering, or add a reranker to keep fan-out small while preserving quality. - If embedding amortization looks large, it typically means either (a) low query volume
Q(so fixed refresh is spread over few queries) or (b) high refresh percentage and large corpus. - If latency is high, determine whether it’s
L_llm-dominated (model choice, output length, throughput limits) or retrieval-dominated (FandL_r).
Scenario comparison (what changes when you improve caching/fan-out)
| Lever | What you change | Primary impact | Secondary effects / notes |
|---|---|---|---|
| Increase cache hit rate (h) | Better cache keys, longer TTLs, semantic caching | Reduces serving cost and latency on cached queries | In this model, embeddings are not reduced by caching |
| Reduce prompt tokens (P) | Smaller context, better chunking, tighter instructions | Strong cost reduction; often latency reduction | May impact answer quality if context becomes insufficient |
| Reduce fan-out (F) | Tighter filtering, reranking, better embeddings | Lower vector cost and retrieval latency | Also can reduce prompt tokens if fewer passages are included |
| Lower LLM latency (L_llm) | Faster model, lower max tokens, higher throughput | Direct latency improvement | May change price per 1K tokens (M) and output quality |
| Lower refresh rate (D%) | Refresh only changed docs, incremental pipelines | Lowers embedding overhead | Risk of stale answers if freshness requirements are strict |
Assumptions and limitations
- Blended token price: Many providers price prompt and completion tokens differently. This calculator uses one rate (
M) for both. - Cache hit semantics: A cache hit is treated as avoiding both retrieval and LLM inference (returning a stored final answer). If you only cache retrieval results or prompt prefixes, real savings differ.
- Latency is an average-style estimate: Real systems care about p95/p99. Queueing, throttling, cold starts, and regional network variance are not modeled.
- Retrieval latency linearity: The formula uses
F × L_r. Some systems parallelize retrievals, making latency closer tomax(L_r)plus overhead. - Not included: reranker model costs, tool/function calls, retries, guardrails/moderation, streaming overhead, vector index build costs, writes, storage, observability, and application/server costs.
- Token estimates: Tokens per document and per query can vary widely depending on chunking strategy and prompt template. Treat results as directional until validated with logging.
Sourcing note
Use your provider’s current published pricing for M and p, your vector database pricing for V, and measured production timings for L_r and L_llm. The most reliable way to calibrate is to sample real traces and plug in p50 and p95 as separate runs.
Arcade Mini-Game: RAG Query Cost and Latency Calculator Calibration Run
Use this quick arcade run to practice separating useful scenario inputs from common planning mistakes before you rely on the calculator output.
Start the game, then use your pointer or arrow keys to catch useful inputs and avoid bad assumptions.
| Scenario | Cost / Query | Monthly Cost | Latency (ms) |
|---|---|---|---|
| Cold cache | $0 | $0 | 0 |
| Warm cache | $0 | $0 | 0 |
| Optimized | $0 | $0 | 0 |
