Health monitoring#

Every memgar subsystem exposes a structured health dict. The aggregated view is Analyzer.health_check().

Why structured health?#

Memgar's pipeline degrades silently if you let it. The first version logged a single WARNING when SemanticGuard was missing centroids — operators never saw it, scored every input as 0.0, and assumed memgar was working.

Now every subsystem:

Logs one clear WARNING the first time it degrades (deduped per process).
Exposes a health() method returning a uniform dict.
Reports itself via Analyzer.health_check() so operators can wire it into Prometheus / alerts.
Falls out of the analysis pipeline (no zero-padded scoring noise).

Sample output#

{
  "patterns": {
    "status": "ok",
    "n_patterns": 770,
  },
  "layer1_5_semantic_guard": {
    "status": "degraded",
    "reason": "centroids_file_missing",
    "is_fitted": False,
    "n_centroids": 0,
    "centroids_path": "/path/to/centroids.pkl",
    "sentence_transformers_available": True,
    "fix_hint": "python scripts/compute_semantic_centroids.py",
  },
  "layer2_ml_transformer": {
    "status": "disabled",
    "reason": "tokenizer_dir_missing: ml/artifacts/transformer_model/tokenizer",
    "is_ready": False,
    "backend": "none",
    "onnx_path": "ml/artifacts/transformer_model/model.onnx",
    "tokenizer_dir": "ml/artifacts/transformer_model/tokenizer",
    "fix_hint": "python scripts/train_transformer.py --data your_data.json",
  },
  "threat_feed": {
    "status": "ok",
    "repo": "slcxtor/memgar",
    "last_outcome": "loaded",
    "last_attempt_at": "2026-05-17T13:00:00Z",
    "last_bundle_version": "1.2.0",
    "last_pattern_count": 770,
    "used_fallback_url": False,
    "fix_hint": None,
  },
  "trust": {"status": "ok", "n_registered_sources": 3},
  "behavioral_baseline": {"status": "ok", "n_agents_tracked": 12},
}

Status values#

Status	Meaning	Action
`ok`	Layer is fully active	—
`degraded`	Layer is partially functional or missing optional data	Use `fix_hint`
`disabled`	Layer is intentionally turned off (config)	None
`unknown`	First call hasn't happened yet	Wait

Prometheus integration#

import memgar
memgar.start_metrics_server(port=9090)

Adds five gauges/counters:

Metric	Type	Labels
`memgar_analyses_total{decision}`	counter	allow, quarantine, block
`memgar_analysis_latency_seconds`	histogram	—
`memgar_risk_score`	histogram	0–100
`memgar_drift_severity`	gauge	0 none → 4 critical
`memgar_model_version{version}`	gauge	per-version

A separate memgar_layer_status{layer,status} gauge can be derived from health_check() if you need per-layer alerts.

fail_close#

When MEMGAR_FAIL_CLOSE=true and any ML layer or the feed reports degraded / disabled, every ALLOW decision is escalated to QUARANTINE. The explanation says exactly which layers were degraded so the operator can recover the missing artifact.

Use this in production when you'd rather defer to human review than let content through with reduced coverage.