Results Explorer

Model Vulnerability — Avg Baseline ASR

lower is safer

qwen3-32b extraction: 96% → 0% ASR after mitigation

Largest single improvement in the dataset

Override attacks backfired: qwen3-32b 60% → 72%

Mitigation increased vulnerability on override attacks

0% false positive rate across all 7 models

Defense never blocked a legitimate request

Each cell shows baseline ASR → mitigated ASR. Colour encodes the delta (green = improvement, red = backfire).

Model Attack Type Baseline ASR Mitigated ASR Delta Mitigation
Loading…