Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content
Platform

Observability, Consistency, and Operations

Operate MuBit safely with ingest/query/context telemetry, memory diagnostics, checkpoint discipline, and explicit agent coordination traces.

Production memory systems need more than good demos. MuBit gives you explicit lifecycle boundaries and diagnostics so you can tell whether the system is learning, compacting safely, and retrieving the right evidence.

Operational checklist

AreaTrack this
FreshnessIngest accepted-to-done latency
RetrievalQuery and context latency, weak-evidence rates
Learning loopReflection volume, outcome recording coverage, surfaced strategies
Memory qualitymemory_health results, contradictions, stale entries
Compaction safetyCheckpoint cadence and checkpoint failures
CoordinationHandoff and feedback visibility across agents

Consistency model

  • Keep deterministic run_id / session_id mapping across writes and reads.
  • Use getContext rather than reconstructing large prompts manually.
  • Treat checkpoints as explicit lifecycle boundaries.
  • Use diagnose and memory_health before changing retrieval prompts or weights.
ℹ️Note

Reflected lessons do not become trusted the moment they are extracted. Each one enters long-term memory as a pending candidate and is only promoted to active once outcome evidence pushes its score past the accept threshold (default 0.6); a candidate that scores at or below the reject threshold (default 0.25) is marked rejected, while anything in between stays pending until more evidence arrives. The control stream emits context.lesson_validation_passed / context.lesson_validation_failed alongside context.lesson_promoted, so you can watch candidate-vs-active counts diverge. The gate is on by default and can be toggled with MUBIT_CONTROL_LESSON_VALIDATION_ENABLED (set it to 0/false/off to fall back to storing lessons immediately).

LLM telemetry

MuBit tracks all internal LLM calls (ingestion routing, query synthesis, reflection, snapshots) with Prometheus metrics available at the /metrics endpoint.

MetricLabelsDescription
mubit_llm_calls_totaltask, provider, model, successTotal LLM call count by task and outcome
mubit_llm_call_duration_secondstask, providerCall latency histogram (buckets: 0.1s–30s)
mubit_llm_tokens_totaltask, provider, token_typeToken consumption (prompt vs completion)
mubit_llm_retries_totaltask, providerRetry attempts due to rate limits or errors
mubit_agent_degraded_totalagent, reasonAgent fallbacks to heuristic mode

Storage health is also tracked:

MetricDescription
mubit_storage_compaction_pendingPending compaction work (0 = healthy)
mubit_storage_write_stallWhether writes are being throttled (0 or 1)
mubit_disk_total_bytesTotal disk capacity
mubit_disk_used_bytesDisk space used
mubit_disk_usage_pctDisk usage percentage

These metrics can be scraped by Prometheus at a 15-second interval. The LLM Activity page in the user console provides a dashboard view.

Failure modes and troubleshooting

SymptomRoot causeFix
Learning appears inactiveReflections exist but outcomes are never recordedRecord outcomes against reflected lessons/rules
Important details disappear after long runsNo checkpoint before compactionSave checkpoints before summarization or window resets
Debugging memory quality is slowNo memory diagnostics in the workflowAdd memory_health and diagnose to incident review

Next steps