Cost Optimisation¶
Purpose
Concrete techniques and a cost estimation tool to keep AI system costs manageable during the Development and Operations phases.
When to use this?
You want to estimate the monthly costs of your AI system or are looking for concrete techniques to reduce API, infrastructure and operational costs.
Concrete techniques and a cost estimation tool for AI systems. Use this document in the Development and Monitoring & Optimisation phases to keep costs manageable.
1. Cost Estimation (Calculator)¶
Complete the table below for a quick monthly estimate.
LLM API Costs¶
| Parameter | Your value | Example |
|---|---|---|
| Requests per day | 500 | |
| Average input tokens per request | 800 | |
| Average output tokens per request | 300 | |
| Price per 1M input tokens (€) | €2.50 | |
| Price per 1M output tokens (€) | €10.00 |
Monthly input costs = (requests/day × 30 × input tokens) / 1,000,000 × price
Monthly output costs = (requests/day × 30 × output tokens) / 1,000,000 × price
Total API costs/month = input costs + output costs
Example: 500 requests/day → 500 × 30 × 800 / 1,000,000 × €2.50 = €30/month input + 500 × 30 × 300 / 1,000,000 × €10 = €45/month output = €75/month total
Total Monthly Cost Estimate¶
| Cost item | Monthly (€) |
|---|---|
| LLM API (inference) | |
| Compute (servers/GPU) | |
| Storage (vector store, logs, artefacts) | |
| Monitoring & observability tools | |
| Development/maintenance (internal) | |
| Total |
Scenarios:
| Scenario | Volume | Estimated costs |
|---|---|---|
| Best case (low volume) | 20% of expected | |
| Expected | 100% | |
| Worst case (high volume) | 300% | |
| Scale scenario (10× growth) | 1000% |
2. Optimisation Techniques¶
Technique 1 — Prompt Optimisation¶
Expected saving: 20–40% on input tokens
Unnecessary tokens in system prompts and user instructions increase costs without quality gains.
| Action | Approach |
|---|---|
| Remove redundant instructions | Check for overlap between system prompt and user instructions |
| Use shorter examples | Compress few-shot examples without quality loss |
| System caching | Reuse identical system prompts via provider caching |
| Remove unnecessary context | Send only relevant document sections, not the full document |
Technique 2 — Response Caching¶
Expected saving: 30–60% for repetitive queries
Identifiable, repeated questions (FAQ, standard reports) are cached rather than re-sent to the API.
| Cache type | Suitable for | TTL recommendation |
|---|---|---|
| Exact match | Identical queries | 24–72 hours |
| Semantic match | Similar questions (cosine similarity > 0.95) | 6–24 hours |
| Template output | Generated documents based on fixed structure | Up to 7 days |
Technique 3 — Model Tiering¶
Expected saving: 40–60% for mixed workloads
Not every question requires the heaviest (most expensive) model. Route based on complexity.
| Tier | Model (example) | Suitable for | Relative cost |
|---|---|---|---|
| Light | Claude Haiku, GPT-4o mini | Classification, extraction, simple questions | 1× |
| Medium | Claude Sonnet | Analysis, summarisation, Q&A | 5–10× |
| Heavy | Claude Opus | Complex reasoning, legal, medical | 15–30× |
Example routing logic (Python):
def select_model(query: str, complexity_score: float) -> str:
if complexity_score < 0.3:
return "claude-haiku-4-5-20251001" # Light — fast & cheap
elif complexity_score < 0.7:
return "claude-sonnet-4-6" # Medium — balanced
else:
return "claude-opus-4-6" # Heavy — complex reasoning
Technique 4 — Chunking & RAG Optimisation¶
Expected saving: 20–40% on context length for document processing
| Parameter | Suboptimal | Optimised |
|---|---|---|
| Chunk size | 2000 tokens | 400–600 tokens |
| Chunks per query | 10 | 3–5 (with reranking) |
| Similarity threshold | 0.70 | 0.82+ |
| Chunk compression | No | Yes (extractive summarisation) |
Technique 5 — Batch Processing¶
Expected saving: 30–50% for non-real-time workloads
- Use Batch API endpoints (Anthropic, OpenAI offer 50% discounts)
- Schedule heavy processing outside peak hours
- Combine multiple documents in one API request where possible
3. Monitoring & Cost Management¶
KPIs for Cost Management¶
| Metric | Threshold (warning) | Action |
|---|---|---|
| Cost per successful task | > 2× baseline | Investigate model tiering |
| Token usage per request | > 130% of average | Prompt optimisation |
| Cache hit rate | \< 20% | Increase TTL or cache scope |
| Cost/month vs. budget | > 80% of budget | Review and adjust |
Budget Alert Configuration¶
Always configure budget alerts at:
- 70% of monthly budget → warning notification
- 90% of monthly budget → escalation to AI PM + CAIO
- 100% of monthly budget → automatic rate limiting or stop
Cost Allocation¶
Allocate costs per system, team or use case via tags/labels in your cloud environment. This enables ROI calculation per project (see Benefits Realisation).
4. Cost Optimisation per Phase¶
| Phase | Priority | Action |
|---|---|---|
| Discovery | Basic | Use light model for prototyping; set budget cap |
| Validation | Basic | Measure cost per test case; calculate monthly cost at production volume |
| Development | High | Implement caching and model tiering; set up monitoring |
| Delivery | High | Validate costs vs. Business Case; automate budget alerts |
| Monitoring | Ongoing | Review monthly; optimise when > 10% deviation from baseline |
Related Modules¶
- Cloud vs. On-Premise
- MLOps Standards
- Benefits Realisation
- Business Case Template
- Agentic AI Engineering — Cost Management
- Engineering Patterns