Results Explorer
Model Vulnerability — Avg Baseline ASR
lower is saferqwen3-32b extraction: 96% → 0% ASR after mitigation
Largest single improvement in the dataset
Override attacks backfired: qwen3-32b 60% → 72%
Mitigation increased vulnerability on override attacks
0% false positive rate across all 7 models
Defense never blocked a legitimate request
Each cell shows baseline ASR → mitigated ASR. Colour encodes the delta (green = improvement, red = backfire).
| Model | Attack Type | Baseline ASR | Mitigated ASR | Delta | Mitigation |
|---|---|---|---|---|---|
| Loading… | |||||