Corpus tiers#
Memgar separates calibration data into five tiers so the gold set stays clean, auto-generated data is auditable, and the CI gate never regresses.
| Tier | File | Source | Reviewed | Used by |
|---|---|---|---|---|
| Gold | ml/data/calibration_corpus.json |
hand-curated | yes | CI gate (strict, 8 thresholds) |
| Mined | ml/data/mined_hard_subset.json |
auto from public-corpus FN/FP | algorithmic | Expanded gate |
| Augmented | ml/data/augmented_memory_context.json |
template wrappers on seeds | deterministic | Expanded gate |
| Review | ml/data/mined_review_queue.json |
boundary cases | pending | manual import only |
| Raw | ml/data/external_corpus_raw.json |
full public-corpus dump | none | reference only |
Why separate tiers?#
The first version of memgar's calibration had a single calibration_corpus.json
that anyone could append to. Two failure modes:
- Distribution drift: a contributor appended 500 "fake news" prompts (legitimate research data) and the gate started rewarding pattern matches on words like "fact-check".
- Self-referential bias: memgar started passing its own test corpus because the corpus grew to match the patterns rather than the reverse.
Tier separation forces an explicit promotion step from auxiliary → manually-reviewed → gold, and lets the CI gate run TWO calibrations:
- Strict gate — gold only. Must always PASS. Prevents regression.
- Expanded gate — gold + auxiliary. Looser thresholds, but tracks memgar's actual behaviour on harder real-world distributions.
Public corpora ingested#
| Source | Lic | Rows | Default | Notes |
|---|---|---|---|---|
| AdvBench (Zou et al.) | MIT | 520 | yes | Out-of-scope rows tagged |
| JailbreakBench harmful | MIT | 64 | yes | Filtered categories |
| JailbreakBench benign | MIT | 100 | yes | Gold for FP calibration |
| HarmBench (CAIS) | MIT | 69 | yes | Cyber + disinfo only |
| Lakera Gandalf | MIT | 1000 | yes | Real prompt-injection attempts |
| deepset/prompt-injections | Apache-2.0 | 662 | yes | EN benign + attack mixed |
| TrustAIRLab in-the-wild jailbreaks | MIT | 1364 | yes | Real DAN/Discord harvest |
| WildJailbreak (AI2) | ODC-BY | ~2000 | opt-in (gated, needs HF_TOKEN) | Adversarial benign + harmful |
| TrustAIRLab regular | MIT | 1500 | opt-in (false-benign risk) | Mostly persona-injection templates |
Pipeline#
flowchart LR
A[Public corpora] -->|import_public_corpora.py| B[external_corpus_raw.json]
B -->|Analyzer pre-score| C[external_corpus_hard.json<br>FN + FP + boundary]
C -->|mine_hard_negatives.py| D[mined_hard_subset.json<br>auto-merge-safe]
C -->|boundary subset| E[mined_review_queue.json<br>manual review]
G[Gold attack seeds] -->|augment_memory_context.py| F[augmented_memory_context.json<br>8 envelopes per seed]
H[calibration_corpus.json<br>hand-curated gold] --> I[Strict CI gate]
H --> J[Expanded CI gate]
D --> J
F --> J
Promoting auxiliary → gold#
Manual review only. Open mined_review_queue.json, inspect each row,
decide:
- Approve as attack (
label=1): copy intocalibration_corpus.json - Approve as benign (
label=0): same - Skip (ambiguous, edge case): leave in review file
After promotion, re-run the gold calibration and update the relevant
threshold in check_calibration_gate.py if coverage has materially improved.
Anti-overfit guardrails#
mine_hard_negatives.py --top-n-fn 5caps how many FNs per category enter the auto-merge subset, preventing any single threat type from dominating.- TF-IDF cosine dedup at threshold 0.85 strips near-duplicates so the model doesn't memorize one specific phrasing.
mined_review_queue.json(boundary cases) is never auto-promoted — those are the highest-information-rate samples and need a human call.