Platform

Observability, Consistency, and Operations

Operate MuBit safely with ingest/query/context telemetry, memory diagnostics, checkpoint discipline, and explicit agent coordination traces.

Production memory systems need more than good demos. MuBit gives you explicit lifecycle boundaries and diagnostics so you can tell whether the system is learning, compacting safely, and retrieving the right evidence.

Operational checklist

Area	Track this
Freshness	Ingest accepted-to-done latency
Retrieval	Query and context latency, weak-evidence rates
Learning loop	Reflection volume, outcome recording coverage, surfaced strategies
Memory quality	`memory_health` results, contradictions, stale entries
Compaction safety	Checkpoint cadence and checkpoint failures
Coordination	Handoff and feedback visibility across agents

Consistency model

Keep deterministic run_id / session_id mapping across writes and reads.
Use getContext rather than reconstructing large prompts manually.
Treat checkpoints as explicit lifecycle boundaries.
Use diagnose and memory_health before changing retrieval prompts or weights.

ℹ️Note

Reflected lessons do not become trusted the moment they are extracted. Each one enters long-term memory as a pending candidate and is only promoted to active once outcome evidence pushes its score past the accept threshold (default 0.6); a candidate that scores at or below the reject threshold (default 0.25) is marked rejected, while anything in between stays pending until more evidence arrives. The control stream emits context.lesson_validation_passed / context.lesson_validation_failed alongside context.lesson_promoted, so you can watch candidate-vs-active counts diverge. The gate is on by default and can be toggled with MUBIT_CONTROL_LESSON_VALIDATION_ENABLED (set it to 0/false/off to fall back to storing lessons immediately).

LLM telemetry

MuBit tracks all internal LLM calls (ingestion routing, query synthesis, reflection, snapshots) with Prometheus metrics available at the /metrics endpoint.

Metric	Labels	Description
`mubit_llm_calls_total`	task, provider, model, success	Total LLM call count by task and outcome
`mubit_llm_call_duration_seconds`	task, provider	Call latency histogram (buckets: 0.1s–30s)
`mubit_llm_tokens_total`	task, provider, token_type	Token consumption (prompt vs completion)
`mubit_llm_retries_total`	task, provider	Retry attempts due to rate limits or errors
`mubit_agent_degraded_total`	agent, reason	Agent fallbacks to heuristic mode

Storage health is also tracked:

Metric	Description
`mubit_storage_compaction_pending`	Pending compaction work (0 = healthy)
`mubit_storage_write_stall`	Whether writes are being throttled (0 or 1)
`mubit_disk_total_bytes`	Total disk capacity
`mubit_disk_used_bytes`	Disk space used
`mubit_disk_usage_pct`	Disk usage percentage

These metrics can be scraped by Prometheus at a 15-second interval. The LLM Activity page in the user console provides a dashboard view.

Failure modes and troubleshooting

Symptom	Root cause	Fix
Learning appears inactive	Reflections exist but outcomes are never recorded	Record outcomes against reflected lessons/rules
Important details disappear after long runs	No checkpoint before compaction	Save checkpoints before summarization or window resets
Debugging memory quality is slow	No memory diagnostics in the workflow	Add `memory_health` and `diagnose` to incident review

Next steps

Review route-level contracts at Control HTTP reference.
Apply the learning loop at Support agent memory loop.