Skip to content

Cost Optimisation

Purpose

Concrete techniques and a cost estimation tool to keep AI system costs manageable during the Development and Operations phases.

When to use this?

You want to estimate the monthly costs of your AI system or are looking for concrete techniques to reduce API, infrastructure and operational costs.

Concrete techniques and a cost estimation tool for AI systems. Use this document in the Development and Monitoring & Optimisation phases to keep costs manageable.


1. Cost Estimation (Calculator)

Complete the table below for a quick monthly estimate.

LLM API Costs

Parameter Your value Example
Requests per day 500
Average input tokens per request 800
Average output tokens per request 300
Price per 1M input tokens (€) €2.50
Price per 1M output tokens (€) €10.00
Monthly input costs  = (requests/day × 30 × input tokens) / 1,000,000 × price
Monthly output costs = (requests/day × 30 × output tokens) / 1,000,000 × price
Total API costs/month = input costs + output costs

Example: 500 requests/day → 500 × 30 × 800 / 1,000,000 × €2.50 = €30/month input + 500 × 30 × 300 / 1,000,000 × €10 = €45/month output = €75/month total

Total Monthly Cost Estimate

Cost item Monthly (€)
LLM API (inference)
Compute (servers/GPU)
Storage (vector store, logs, artefacts)
Monitoring & observability tools
Development/maintenance (internal)
Total

Scenarios:

Scenario Volume Estimated costs
Best case (low volume) 20% of expected
Expected 100%
Worst case (high volume) 300%
Scale scenario (10× growth) 1000%

2. Optimisation Techniques

Technique 1 — Prompt Optimisation

Expected saving: 20–40% on input tokens

Unnecessary tokens in system prompts and user instructions increase costs without quality gains.

Action Approach
Remove redundant instructions Check for overlap between system prompt and user instructions
Use shorter examples Compress few-shot examples without quality loss
System caching Reuse identical system prompts via provider caching
Remove unnecessary context Send only relevant document sections, not the full document

Technique 2 — Response Caching

Expected saving: 30–60% for repetitive queries

Identifiable, repeated questions (FAQ, standard reports) are cached rather than re-sent to the API.

Cache type Suitable for TTL recommendation
Exact match Identical queries 24–72 hours
Semantic match Similar questions (cosine similarity > 0.95) 6–24 hours
Template output Generated documents based on fixed structure Up to 7 days

Technique 3 — Model Tiering

Expected saving: 40–60% for mixed workloads

Not every question requires the heaviest (most expensive) model. Route based on complexity.

Tier Model (example) Suitable for Relative cost
Light Claude Haiku, GPT-4o mini Classification, extraction, simple questions
Medium Claude Sonnet Analysis, summarisation, Q&A 5–10×
Heavy Claude Opus Complex reasoning, legal, medical 15–30×

Example routing logic (Python):

def select_model(query: str, complexity_score: float) -> str:
    if complexity_score < 0.3:
        return "claude-haiku-4-5-20251001"   # Light — fast & cheap
    elif complexity_score < 0.7:
        return "claude-sonnet-4-6"           # Medium — balanced
    else:
        return "claude-opus-4-6"             # Heavy — complex reasoning

Technique 4 — Chunking & RAG Optimisation

Expected saving: 20–40% on context length for document processing

Parameter Suboptimal Optimised
Chunk size 2000 tokens 400–600 tokens
Chunks per query 10 3–5 (with reranking)
Similarity threshold 0.70 0.82+
Chunk compression No Yes (extractive summarisation)

Technique 5 — Batch Processing

Expected saving: 30–50% for non-real-time workloads

  • Use Batch API endpoints (Anthropic, OpenAI offer 50% discounts)
  • Schedule heavy processing outside peak hours
  • Combine multiple documents in one API request where possible

3. Monitoring & Cost Management

KPIs for Cost Management

Metric Threshold (warning) Action
Cost per successful task > 2× baseline Investigate model tiering
Token usage per request > 130% of average Prompt optimisation
Cache hit rate \< 20% Increase TTL or cache scope
Cost/month vs. budget > 80% of budget Review and adjust

Budget Alert Configuration

Always configure budget alerts at:

  • 70% of monthly budget → warning notification
  • 90% of monthly budget → escalation to AI PM + CAIO
  • 100% of monthly budget → automatic rate limiting or stop

Cost Allocation

Allocate costs per system, team or use case via tags/labels in your cloud environment. This enables ROI calculation per project (see Benefits Realisation).


4. Cost Optimisation per Phase

Phase Priority Action
Discovery Basic Use light model for prototyping; set budget cap
Validation Basic Measure cost per test case; calculate monthly cost at production volume
Development High Implement caching and model tiering; set up monitoring
Delivery High Validate costs vs. Business Case; automate budget alerts
Monitoring Ongoing Review monthly; optimise when > 10% deviation from baseline