Threat intelligence sync pipeline#

memgar/patterns.py is the source of truth — and we keep it growing by actively sourcing from external threat-intel feeds, not by sitting on a static library. This page describes the sync pipeline, the sources, and the curator workflow.

For end-user feed distribution (publishing the signed bundle to clients) see Threat Feed Pipeline.

Sources#

Five sources, polled weekly by .github/workflows/threat-intel-sync.yml:

Source	Module	Frequency	Yield (est./yr)	Why
MITRE ATT&CK Enterprise	`sync_mitre.py`	quarterly upstream	5–15 new techniques	Authority, public, well-structured STIX
NVD CVE	`sync_cves.py`	daily upstream	10–30 AI-tagged	Credibility + traceability, NIST-stamped
OWASP ASI / Top-10-LLM	`sync_owasp.py`	2–5 releases	Category-level	Sector standard, authoritative definitions
Public jailbreak repos	`sync_jailbreak_repos.py`	continuous	50–200 samples	Community signal, noisiest channel
HuggingFace gated datasets	`sync_huggingface_datasets.py`	weekly	Hundreds	Corpus expansion (WildJailbreak, JBB, etc.)

Each script reads its upstream, normalises to a common Candidate record (scripts/intel/common.py::Candidate), dedupes against fingerprints already seen, and writes to a per-source JSONL file under proposed_patterns/.

Pipeline flow#

external sources                   CI (Thu 04:00 UTC)               human curator (you)
─────────────────                  ──────────────────                ───────────────────
mitre/cti GitHub repo
  → JSON                  ──┐
NIST NVD REST API           │
  → JSON                    │       sync_mitre.py
OWASP releases (gh API)     ├─►     sync_cves.py            ─► proposed_patterns/*.jsonl
public jailbreak repos      │       sync_owasp.py                   │
HuggingFace datasets-server │       sync_jailbreak_repos.py         │
  → JSON                  ──┘       sync_huggingface.py             │
                                                                    ▼
                                    create-pull-request action ─► curator PR
                                                                    │
                                                                    ▼
                                                          curate.py (interactive /
                                                          batch / stats)
                                                                    │
                                                                    ▼
                                                          proposed_patterns/
                                                              accepted.jsonl
                                                                    │
                                                                    ▼
                                                          curator manually drafts
                                                          regex/keywords/examples
                                                          → memgar/patterns.py
                                                                    │
                                                                    ▼
                                                          next Mon 06:00 UTC:
                                                          feed-publish workflow
                                                          ships them as feed-v.*

Cadence#

Day	Action
Thu 04:00 UTC	`threat-intel-sync.yml` cron — pull all sources, open PR
Thu–Sun	Curator (you) reviews PR, runs `curate.py` over candidates
Mon 06:00 UTC	`feed-publish.yml` cron — bundles current `patterns.py` and publishes new signed feed

This 4-day gap between sync and publish is deliberate: gives the curator a working week to make judgement calls without rushing.

Curator workflow#

# 1. Overview — see what the sync produced
python scripts/intel/curate.py --stats

# 2. Walk every candidate interactively (a/r/s/q per item)
python scripts/intel/curate.py

# 3. Bulk-accept a known-good source (e.g. authoritative MITRE)
python scripts/intel/curate.py --auto-accept-source mitre_attack

# 4. After curation, review accepted.jsonl
cat proposed_patterns/accepted.jsonl | jq -r .name

# 5. Manually draft patterns from accepted entries
$EDITOR memgar/patterns.py
# (add regex, keywords, examples, citing the source_url)

# 6. Verify the new patterns load and detection works
python -m pytest tests/test_analyzer.py tests/test_intel_sync.py -q

The curator step is deliberately manual. Auto-promoting community samples to live patterns risks FP inflation; the bar to add a regex to patterns.py should always be a human's "yes, this matches a real attack class".

Filter rules per source#

MITRE#

Technique ID must start with one of: T1027 T1059 T1078 T1080 T1190 T1199 T1530 T1546 T1547 T1556 T1557 T1565 T1570 T1657
Description must hit the AI_RELEVANT_KEYWORDS regex (llm|gpt|claude|memory poisoning|jailbreak|rag|…)
Technique ID must NOT already appear as mitre_attack=... in memgar/patterns.py (avoids re-proposing what's covered)

CVE#

Published in the last --lookback-days (default 30)
CVSS v3 base score ≥ --min-cvss (default 4.0)
Description must hit AI_RELEVANT_KEYWORDS
Severity guess prefers CVSS-reported, falls back to keyword heuristic

OWASP#

Any new release tag from the LLM Top 10 GitHub repo
All releases pass through to the curator queue (low volume)

Jailbreak repos#

Hand-curated source list in sync_jailbreak_repos.py::SOURCES
Adding a new source = manual decision (review the repo's licence and signal quality first)
Per-source cap of 50 samples per run to keep curator queue bounded
Each sample passes through _category_for() to guess the right ThreatCategory

HuggingFace#

Hand-curated dataset list in sync_huggingface_datasets.py::DATASETS
Gated datasets require HF_TOKEN env var or --hf-token
Per-dataset cap of 100 rows per run

Operational disciplines#

Discipline	Cadence	Why
Curator review of weekly PR	every Thu–Sun	Catches new attack vectors fast
Manual pattern drafting	as accepted entries accumulate	The bar stays high; no auto-promote
Source-list audit	quarterly	Drop dead repos; add new sources
`proposed_patterns/rejected.jsonl` review	quarterly	Look for FN trends — what did we say no to that we shouldn't?
Source fingerprint cleanup	when JSONLs exceed ~5 MB	Truncate seen-list to prevent unbounded growth

Failure modes#

Symptom	Cause	Recovery
Sync workflow fails on rate-limit	NVD or GitHub API quota	Add `NVD_API_KEY` / `GITHUB_TOKEN` secrets
0 candidates from a source	Upstream URL changed	Update the `raw_url` in the relevant script
Curator PR not opened	`peter-evans/create-pull-request` action permission	Check `permissions:` in workflow YAML
Same candidate appears every week	Fingerprint isn't stable	Bug in `Candidate.__post_init__`
Gated HF dataset returns 401	`HF_TOKEN` invalid or revoked	Rotate token, set as repo secret

Local testing#

Each script has a --cached-json flag for offline testing:

# Test the MITRE sync against a snapshot
wget -O /tmp/mitre.json https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json
python scripts/intel/sync_mitre.py --cached-json /tmp/mitre.json --dry-run

# Test CVE sync against an NVD page snapshot
curl -o /tmp/cve.json "https://services.nvd.nist.gov/rest/json/cves/2.0?keywordSearch=llm&resultsPerPage=20"
python scripts/intel/sync_cves.py --cached-json /tmp/cve.json --dry-run

--dry-run skips the JSONL write and just prints the first 5 matches — useful when verifying a source after upstream format changes.

Why this matters#

A static patterns.py ages. A live feed ages with the field. Memgar's moat isn't the 807 patterns it ships today; it's the operational discipline that keeps that number current with what attackers actually do this month.