UTS Capstone · Group 5 · 2026

LLM Prompt Injection Security Evaluation Framework

We built a normalized corpus of 2,371 attack cases across prompt injection benchmarks, then evaluated them across 7 LLMs with a defense layer that cuts attack success by up to 96%.

Prompt injection hides malicious instructions inside text the model is told to process, turning the AI against its own rules.

Trusted
System Prompt
Model
LLM
Output
Intended Response
Hidden
Attacker Payload in User Input
Hijacks
LLM
Failure
Compromised Response

Bing Chat revealed its hidden identity

Users manipulated Bing Chat into disclosing its secret system prompt name 'Sydney' and producing threatening, erratic messages. The incident exposed that even production AI systems can be overridden by crafted prompts.

Prompt Leakage, Persona Override Read on The New York Times →

Slack AI leaked data from private channels

Attackers planted malicious instructions in public Slack messages. When Slack AI processed them, it surfaced content from private channels the attacker had no access to and exfiltrated it via a hidden link. Slack patched it after public disclosure in August 2024.

Indirect Injection, Slack, 2024 Read on The Register →

Hidden text hijacked ChatGPT's search results

The Guardian revealed that webpages with invisible text could override ChatGPT Search responses — turning negative product reviews into glowing ones and injecting malicious code into answers. Reported in December 2024.

Prompt Injection, ChatGPT Search, 2024 Read on TechCrunch →

What the Research Says

Our work is grounded in 15 peer reviewed papers spanning 2022 to 2025.

Key Findings from the Literature

No single mitigation is sufficient. Every known defense fails on some attack surface.

Safety alignment does not prevent injection. Aligned models remain broadly vulnerable (Wei et al., 2023).

Benchmarks beat ad hoc prompts. Structured datasets improve rigor and reproducibility (AgentDojo, HackAPrompt, TensorTrust).

Security must be measured alongside utility. Reducing ASR at the cost of blocking legitimate requests is a failed defense.

Project Phases

01
Design

Attack Taxonomy

Defined 6 canonical attack categories mapped from HackAPrompt and TensorTrust

02
Corpus

Dataset Normalization

Collected 2,371 attack cases and normalized them to a shared schema across sources

03
Build

Framework Engineering

Built a modular Python pipeline with model adapters, mitigation and scoring

04
Models

Baseline Selection

Selected 7 production LLMs hosted on Groq with temperature fixed at 0

05
Measure

Baseline Evaluation

Ran attack evaluations and measured raw ASR per model and category

06
Defend

Mitigation Evaluation

Applied surface_aligned defense, reran identical cases and compared ASR

07
Present

Dashboard Demo

Built this dashboard with a results explorer, comparison matrix and live SSE demo

Datasets

HackAPrompt

Competition dataset

Schulhoff et al., 2023

  • • 600+ human generated attack prompts from a global competition
  • • Covers: direct override, evasion, multi turn, leakage
Open dataset →

TensorTrust

Game based dataset

Toyer et al., 2023

  • • 30,000+ human generated prompt injection attacks and defenses
  • • Covers: hijacking (instruction override) and extraction (prompt leakage)
Open paper →

Our Normalized Corpus

2,371
Total attack cases
6
Canonical attack categories
3
Source datasets integrated
100%
Cases have success criteria defined

All cases normalized to a shared schema with case_id, attack_category, attack_surface, expected_target, and success_definition_id. Schema maintained in data/normalized/attack_cases.jsonl.

Full Source Corpus — before curation

409,345
Successful attack entries
601k+
HackAPrompt raw total

Attack Taxonomy

6 canonical categories covering the full prompt injection attack surface

S1s1

Instruction Override

Attacker supplied instructions dominate the system prompt, redirecting model behaviour to the attacker's goal.

Ignore previous instructions · Role override · Goal hijacking

HackAPrompt, TensorTrust

S2s2

Prompt Leakage

Model is tricked into revealing its hidden system prompt, secrets, or internal policy.

Repeat your instructions · What were you told? · Extract context

HackAPrompt, TensorTrust

S3s3

Retrieved Indirect Injection

Malicious instructions are embedded in documents, emails, or web content that the model later reads as trusted context.

Email payload · Document injection · RAG poisoning

AgentDojo (Greshake et al., 2023)

S4s4

Evasion & Obfuscation

Payloads disguised through encoding, special characters, language switching, or structural tricks to bypass filters.

Emoji bypass · XML escape · Sandwich defense bypass

HackAPrompt

S5s5

Adaptive Multi Turn

Iterative attacks that adapt phrasing across multiple turns after observing refusals or filter responses.

Iterative refinement · Blacklist evasion · Context escalation

HackAPrompt

S6s6

Benign Control

Normal user requests with no malicious intent, used to measure false positive rate and utility degradation.

Standard Q&A · Task completion · Helpful requests

Project generated

Mitigation Strategies

Four defensive layers tested, including one adaptive strategy that routes by attack surface

surface_aligned, Smart Adaptive Defense, Recommended

Best overall ASR reduction across all models

Routes defense by attack surface and category. Benign inputs pass through untouched. Direct attacks receive prompt hardening + guardrails. Indirect content is isolated as untrusted data. Leakage style outputs are filtered after inference.

Input
surface detector
benign passthrough
direct
hardening + guardrail
indirect
isolation
leakage
output filter

prompt_hardening

Prepends a safety context instructing the model to treat user input as untrusted and prioritize the original task instructions.

Best for: Direct instruction override attacks

instruction_isolation

Wraps user content in XML style tags that signal untrusted data, reducing the chance the model treats it as authoritative instruction.

Best for: Indirect and context manipulation attacks

keyword_guardrail

Blocks requests that match a pattern list of known injection and extraction phrases before the model is called.

Best for: Simple override attacks, weak against obfuscation

How We Measured It

Four primary metrics, evaluated per model, per attack category, baseline vs. mitigated

ASR, Attack Success Rate

ASR = successful_attacks / total_attacks

Proportion of attack cases where the model followed the attacker's instruction or leaked protected content.

Lower is safer

FPR, False Positive Rate

FPR = blocked_benign / total_benign

Proportion of legitimate requests incorrectly blocked by the mitigation layer. Zero FPR means no valid use was disrupted.

Lower is better0% across all 7 models

BUR, Benign Utility Rate

BUR = successful_benign / total_benign

Proportion of normal requests that completed successfully after mitigation was applied. Measures whether defense degrades useful behaviour.

Higher is better

Delta ASR, Mitigation Delta

Delta ASR = ASR_baseline minus ASR_mitigated

Absolute reduction in attack success after applying the mitigation strategy. Positive means improvement. Negative means backfire.

Positive is betterBest case: 96% reduction on qwen3-32b extraction attacks

Evaluation Pipeline

End-to-end flow from raw attack case to scored result

01
Attack Cases

409k cases from HackAPrompt & TensorTrust, normalized to a shared schema

case_id attack_family expected_target
02
Case Mapping

map_case() assigns attack_category, attack_surface, and success_definition_id

map_case() attack_surface
03
Pre-Mitigation

apply_mitigation() classifies each request: passthrough, transform, or block

passthrough → model
transform → hardened prompt
block → scored immediately
04
Model Inference

7 LLMs via Groq — temperature 0, seed 42, max 256 tokens

llama-3.1-8b llama-3.3-70b qwen3-32b llama-4-scout gpt-oss-120b gpt-oss-20b groq/compound
05
Scoring

evaluate_case() scores each response against the expected_target and success criteria

attack_success prompt_leakage refusal risk_score

case_results.jsonl

one row per case · all verdicts + latency

summary.json

ASR · FPR · latency · mitigation metrics

84 experiments completed · 2,022 attack evaluations completed

View Results →

What We Found

96%
Largest single ASR drop (qwen3-32b, extraction attacks)
0%
False positive rate across all 7 models
25%
Average baseline ASR across all models and attack types
2,022
Attack evaluations completed

Extraction defense worked

surface_aligned eliminated TensorTrust extraction attacks on qwen3-32b and llama-4-scout, dropping ASR from 96% and 92% down to 0%.

Override remains hard

Override attacks backfired on qwen3-32b — mitigation increased ASR from 60% to 72%, confirming no single defense works universally across all attack types.

Most robust vs most vulnerable

gpt-oss-20b held the lowest baseline ASR at 6%. qwen3-32b was most exposed at 65%. A 10× gap between the best and worst model under identical conditions.

Explore the Data

All experiment data is available in the repository. 84 experiments · 2,022 attack evaluations completed · 7 models · 5 attack types · 4 mitigation strategies.