Cost Tracking¶
Machina tracks LLM costs per action via the ActionTracer. Every LLM call
records token counts and estimated USD cost.
How It Works¶
When the agent makes an LLM call, the tracer records:
prompt_tokens— input tokens sent to the modelcompletion_tokens— output tokens generatedtotal_tokens— sum of prompt + completionusd_cost— estimated cost based on the model's pricingmodel— which model was used (e.g.,openai/gpt-4o)
These fields appear in every trace entry where an LLM call occurs.
Analyzing Costs¶
Per-Conversation Cost¶
python -c "
import json, sys
total = 0
for line in open(sys.argv[1]):
e = json.loads(line)
cost = e.get('usd_cost', 0)
if cost:
total += cost
print(f' {e[\"action\"]}: \${cost:.4f} ({e[\"total_tokens\"]} tokens, {e[\"model\"]})')
print(f'\nTotal: \${total:.4f}')
" traces/2026-04-20_conv-abc123.jsonl
Daily Cost Summary¶
python -c "
import json, glob
total = 0
for f in sorted(glob.glob('traces/2026-04-20_*.jsonl')):
for line in open(f):
total += json.loads(line).get('usd_cost', 0)
print(f'Daily total: \${total:.4f}')
"
Cost Anomaly Detection¶
Flag conversations that exceed a budget threshold:
python -c "
import json, glob, sys
BUDGET = float(sys.argv[1]) # e.g., 0.50
for f in sorted(glob.glob('traces/*.jsonl')):
total = sum(json.loads(l).get('usd_cost', 0) for l in open(f))
if total > BUDGET:
print(f'OVER BUDGET: {f} — \${total:.4f}')
" 0.50
Cost Benchmarks¶
Typical costs for common maintenance operations (GPT-4o pricing):
| Operation | Tokens | Est. Cost |
|---|---|---|
| Simple asset lookup + response | 500–1,500 | $0.01–$0.03 |
| Failure diagnosis with manual search | 2,000–5,000 | $0.03–$0.08 |
| Work order creation with context | 1,500–3,000 | $0.02–$0.05 |
| Full alarm-to-WO workflow (6 steps) | 5,000–15,000 | $0.08–$0.25 |
Costs vary significantly by model. Ollama (local) is free. Claude and GPT-4o have similar per-token pricing. Smaller models (GPT-4o-mini, Haiku) can reduce costs 5–10x for simpler tasks.
Cost Optimization¶
- Use cheaper models for simple tasks. Asset lookups and spare part checks don't need GPT-4o — GPT-4o-mini or Haiku work fine.
- Reduce context size. The entity resolver and context injection functions control how much data is sent to the LLM. Trim large asset registries or maintenance histories to relevant subsets.
- Cache common queries. If technicians ask the same questions repeatedly, consider caching LLM responses (not built into Machina — implement at the application layer).
- Monitor with budgets. Use the per-conversation cost analysis above to set alerts and catch runaway conversations.
See Also¶
- Action Traces — full trace format and export
- Scaling — cost-per-conversation as a scaling metric