Uptime and Resilience¶
How Machina handles transient failures, restarts, and degraded environments.
Design Principle: Stateless by Design¶
Machina does not persist state across restarts. All durable state lives in your CMMS (work orders, maintenance plans, asset records) and your vector store (document embeddings).
A restart loses agent conversation context but not maintenance data. This is a deliberate design choice — it keeps Machina simple to operate and eliminates an entire class of state-corruption bugs.
Transient Failure Handling¶
CMMS Connector (HTTP)¶
All CMMS HTTP calls go through request_with_retry (machina.connectors.cmms.retry):
- Retryable conditions: HTTP 429 (Too Many Requests), HTTP 503 (Service Unavailable),
TimeoutException,ConnectError,ReadError - Strategy: Exponential backoff —
min(0.5s × 2^attempt, 8s), up to 3 retries - Retry-After: Honored for 429 responses (numeric seconds)
- Non-retryable errors: 4xx (except 429) return immediately — the caller decides how to handle
If the CMMS is down for longer than the retry window (~15 seconds), the operation fails and the agent reports the error to the user. Machina does not queue failed writes.
OPC-UA Connector¶
The OPC-UA connector (machina.connectors.iot.opcua) does not auto-reconnect.
If the OPC-UA server drops the connection:
health_check()returnsFalse(reads theServerStatusnode).- Active subscriptions are not re-established automatically.
- The MCP server's
/healthendpoint reflects the degraded state.
To reconnect, the operator (or an orchestrator) must call disconnect() then connect().
Subscriptions must be re-created after reconnection.
MQTT Connector¶
The MQTT connector uses aiomqtt, which raises MqttError on disconnect.
Reconnection follows the same manual pattern as OPC-UA.
Graceful Shutdown¶
On SIGTERM (systemd stop, Docker stop, Ctrl+C):
- The MCP server stops accepting new requests.
- In-flight MCP requests are given up to 30 seconds to complete.
MachinaRuntime.disconnect_all()runs, closing every connector. Each connector'sdisconnect()is called independently — a failure in one does not block the others.- The process exits.
# Graceful stop (sends SIGTERM, waits 90s by default)
sudo systemctl stop machina
# Check for clean shutdown in logs
grep "disconnect_all" /var/log/machina/machina.log
Behavior Matrix¶
What happens under every combination of degraded subsystems:
| CMMS | LLM | Sandbox | In-flight WF | Behavior |
|---|---|---|---|---|
| :material-check: Up | :material-check: Up | Off | No | Normal operation. Reads and writes execute against live CMMS. |
| :material-check: Up | :material-check: Up | Off | Yes | Normal. Workflow steps execute, WOs created in CMMS. |
| :material-check: Up | :material-check: Up | On | No | Reads succeed. Writes are logged but not executed. |
| :material-check: Up | :material-check: Up | On | Yes | Workflow steps run. Write steps logged only. |
| :material-check: Up | :material-close: Down | Off | No | Agent cannot reason. MCP tools still callable directly by the MCP client. |
| :material-check: Up | :material-close: Down | Off | Yes | In-flight workflow halts at next LLM-dependent step. CMMS state unchanged for unexecuted steps. |
| :material-check: Up | :material-close: Down | On | No | Same as LLM-down + sandbox-off: tools callable, no agent reasoning. |
| :material-check: Up | :material-close: Down | On | Yes | Workflow halts at LLM step. No writes attempted (sandbox). |
| :material-close: Down | :material-check: Up | Off | No | CMMS reads/writes fail after retry window (~15s). Agent can still reason and report the outage. |
| :material-close: Down | :material-check: Up | Off | Yes | Workflow halts at CMMS-dependent step. Agent reports failure. |
| :material-close: Down | :material-check: Up | On | No | CMMS reads fail. Writes would be logged only anyway. Agent can reason about the outage. |
| :material-close: Down | :material-check: Up | On | Yes | Workflow halts at CMMS read step. Write steps would have been logged only. |
| :material-close: Down | :material-close: Down | Off | No | Fully degraded. MCP server responds to /health but all tool calls fail. |
| :material-close: Down | :material-close: Down | Off | Yes | Workflow halts. No state changes. |
| :material-close: Down | :material-close: Down | On | No | Fully degraded, sandbox active. Same as above — nothing to sandbox when nothing works. |
| :material-close: Down | :material-close: Down | On | Yes | Workflow halts. No state changes. Sandbox irrelevant. |
Key takeaway: Machina fails open for reads (returns errors to the caller) and fails safe for writes in sandbox mode (logs but does not execute). There is no silent data loss — every failure is surfaced to the MCP client.
Health Endpoint¶
The MCP server exposes GET /health on the streamable-http transport:
Use this for load-balancer health checks and monitoring:
Monitoring Recommendations¶
- Log aggregation: Ship
/var/log/machina/machina.logto your SIEM or log platform. Logs are structured (structlog) and machine-parseable. - Trace analysis: Action trace JSONL files include cost tracking (
llm_cost_usd) and timing. Use them for cost dashboards and latency monitoring. - Alerting: Alert on
systemctl is-active machinareturninginactiveor on/healthreturning non-200 for >60 seconds.