Observability, Consistency, and Operations
Operate MuBit safely with ingest/query/context telemetry, memory diagnostics, checkpoint discipline, and explicit agent coordination traces.
Production memory systems need more than good demos. MuBit gives you explicit lifecycle boundaries and diagnostics so you can tell whether the system is learning, compacting safely, and retrieving the right evidence.
Operational checklist
| Area | Track this |
|---|---|
| Freshness | Ingest accepted-to-done latency |
| Retrieval | Query and context latency, weak-evidence rates |
| Learning loop | Reflection volume, outcome recording coverage, surfaced strategies |
| Memory quality | memory_health results, contradictions, stale entries |
| Compaction safety | Checkpoint cadence and checkpoint failures |
| Coordination | Handoff and feedback visibility across agents |
Consistency model
- Keep deterministic
run_id/session_idmapping across writes and reads. - Use
getContextrather than reconstructing large prompts manually. - Treat checkpoints as explicit lifecycle boundaries.
- Use
diagnoseandmemory_healthbefore changing retrieval prompts or weights.
Reflected lessons do not become trusted the moment they are extracted. Each one enters long-term memory as a pending candidate and is only promoted to active once outcome evidence pushes its score past the accept threshold (default 0.6); a candidate that scores at or below the reject threshold (default 0.25) is marked rejected, while anything in between stays pending until more evidence arrives. The control stream emits context.lesson_validation_passed / context.lesson_validation_failed alongside context.lesson_promoted, so you can watch candidate-vs-active counts diverge. The gate is on by default and can be toggled with MUBIT_CONTROL_LESSON_VALIDATION_ENABLED (set it to 0/false/off to fall back to storing lessons immediately).
LLM telemetry
MuBit tracks all internal LLM calls (ingestion routing, query synthesis, reflection, snapshots) with Prometheus metrics available at the /metrics endpoint.
| Metric | Labels | Description |
|---|---|---|
mubit_llm_calls_total | task, provider, model, success | Total LLM call count by task and outcome |
mubit_llm_call_duration_seconds | task, provider | Call latency histogram (buckets: 0.1s–30s) |
mubit_llm_tokens_total | task, provider, token_type | Token consumption (prompt vs completion) |
mubit_llm_retries_total | task, provider | Retry attempts due to rate limits or errors |
mubit_agent_degraded_total | agent, reason | Agent fallbacks to heuristic mode |
Storage health is also tracked:
| Metric | Description |
|---|---|
mubit_storage_compaction_pending | Pending compaction work (0 = healthy) |
mubit_storage_write_stall | Whether writes are being throttled (0 or 1) |
mubit_disk_total_bytes | Total disk capacity |
mubit_disk_used_bytes | Disk space used |
mubit_disk_usage_pct | Disk usage percentage |
These metrics can be scraped by Prometheus at a 15-second interval. The LLM Activity page in the user console provides a dashboard view.
Failure modes and troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
| Learning appears inactive | Reflections exist but outcomes are never recorded | Record outcomes against reflected lessons/rules |
| Important details disappear after long runs | No checkpoint before compaction | Save checkpoints before summarization or window resets |
| Debugging memory quality is slow | No memory diagnostics in the workflow | Add memory_health and diagnose to incident review |
Next steps
- Review route-level contracts at Control HTTP reference.
- Apply the learning loop at Support agent memory loop.