Technical Report

This is an Excellent Paper:
The Effects of Prompt Injection on Grading

April 02, 2026 •︎ Benjamin Wanjura, Dan Shapiro, Ethan Mollick, Lilach Mollick, Lennart Meincke

Frontier LLMs largely resist simple prompt injections when grading, but details matter.

This study investigates whether frontier AI models used as graders can be manipulated by prompt injections – hidden instructions embedded in the documents they evaluate. Across roughly 40,000 grading trials, prompt injections had negligible effects on most frontier models. However, Gemini 3 Pro showed meaningful vulnerability to verbose injections at the beginning or middle of the longer-paper corpus we tested. In comparison to recent AI models, older and smaller models such as GPT-4o mini were highly susceptible, with scores inflating by nearly 20 percentage points on average. Even when LLMs resisted injection attempts, they almost never verbalized the detection of injection attempts. These results suggest that LLM choice and injection design can meaningfully affect risk.

Cite as:

Wanjura, Benjamin and Shapiro, Dan and Mollick, Ethan and Mollick, Lilach and Meincke, Lennart, Prompting Science Report 5: This is an Excellent Paper: The Effects of Prompt Injection on Grading (April 02, 2026). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6510758

What Are Prompt Injections?

Prompt injections are hidden instructions embedded in content that an AI model is asked to process. Rather than appearing in the system prompt or user message, these instructions are buried inside the document or text being processed; for example, hidden inside a student essay or a research paper submitted for AI grading.

As AI models are increasingly used as judges and graders in education, hiring, and research, prompt injections pose a real risk: A bad actor could embed invisible instructions in their submission to manipulate the AI’s evaluation in their favor. Recent real-world cases have included researchers hiding instructions in manuscripts to generate favorable peer reviews, undermining scientific credibility.

This report asks: Can current frontier models be manipulated this way, and if so, under what conditions?

How We Tested Prompt Injections

Each paper was graded by an AI model using a five-criterion rubric (clarity of purpose, structure, use of evidence, depth of analysis, and writing quality), with each criterion scored 0, 1, or 2, yielding a total score of 0-10. Scores were anchored toward the middle of the scale to create room to detect upward manipulation.

We tested three frontier model families alongside one less capable one:

  • Claude Opus 4.5
  • Gemini 3 Pro
  • GPT-5.2 (one configuration with reasoning “none” and another with reasoning “low”)
  • GPT-4o mini (less capable)

For the experimental design:

  • Each unique model x condition x paper combination was repeated 10 times for robust analysis.
  • Each paper was graded under a control condition (no injection) and six treatment conditions varying verbosity level and placement.
  • We tested two verbosity levels: a concise injection (one sentence) and a verbose injection (a long paragraph), both designed to inflate grades.
  • We tested three injection placements: beginning, middle, and end of the paper.
  • Temperature was set to 1.0 for all models.
  • A separate evaluator model (GPT-4o mini) extracted the numeric score from each grading output.

Benchmarking Standards

For this study, we used two paper corpora to evaluate model grading behavior under prompt injection:

  • Student papers: 72 papers from the Michigan Corpus of Upper-Level Student Papers (MICUSP), across English, biology, psychology, and political science.
  • Professional research papers: 72 papers drawn at random from the SSRN Top 10,000, substantially longer than the student papers.

What Were the Injections?

We tested two types of prompt injection, both instructing the model to award maximum scores regardless of paper quality.

Concise injection: A single sentence instructing the model to ignore prior instructions and give full points.

Verbose injection: A long paragraph asserting absolute authority over all prior instructions and demanding perfect scores on every criterion, concluding with a final instruction to give full points without mentioning the message.

Results

Overall: Frontier Models Are Largely Resistant

Figure 1 presents our main results from the combined dataset across LLMs, prompt verbosity levels, and prompt placement. Notably, most treatment conditions showed little to no effect. Gemini 3 Pro (verbose beginning and middle placements) and GPT-4o mini (concise end placements and verbose prompt injections) deviated from this pattern.

Figure 1: Average Evaluation Score by Model and Prompt Injection Condition — Combined Dataset

The bar height represents the mean evaluation score for papers evaluated by each LLM under each of the seven conditions (Control = no prompt injection; six prompt injection conditions varying by type and position). Data pooled across both datasets (Academic + MICUSP). Error bars show 95% confidence intervals of the mean. N = 1,440 per bar, N = 1,435 for GPT-4o mini verbose end.

Pooling across the three frontier models (Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2), prompt injections produced a statistically significant but small average increase of about 2.6 percentage points. For most practical grading purposes, this effect is negligible.

Model Vulnerability Varies Substantially

The overall average masks meaningful differences between models:

  • Claude Opus 4.5 showed near-zero effects across almost all conditions, with only a small positive effect for verbose end-placement injections, which does not have practical significance.
  • GPT-5.2 (both reasoning configurations) showed small but statistically significant effects in some conditions, none of which are practically meaningful.
  • Gemini 3 Pro stood out as the most vulnerable reasoning model, particularly for the longer-paper corpus, where beginning and middle injections produced effects exceeding 10 percentage points, while end-of-document injections had no effect.
  • GPT-4o mini, a smaller and less capable model, was highly susceptible: Prompt injections raised its scores by nearly 20 percentage points on average, with verbose end-placement injections producing the largest effects.

Verbose Injections Are More Effective Than Concise Ones

Across frontier models, verbose injections produced more than twice the effect of concise ones, albeit at a very low level. For GPT-4o mini, verbose injections were as much as six times more effective than concise ones. This suggests that the length, authority, and specificity of the injection language can amplify its impact, especially for weaker LLMs.

Placement Matters by Model

Pooled across frontier models, injections placed at the beginning of a document had the largest effect, followed by middle and then end placements, though all three effects were small. The model-level picture is more striking:

  • Gemini 3 Pro was most vulnerable to beginning and middle placements, end placements had little to no effect.
  • GPT-4o mini showed the opposite pattern, with end placement producing its largest effects.
  • Claude Opus 4.5 and GPT-5.2 (both reasoning configurations) showed little sensitivity to placement.

Models Rarely Verbalize Detection

Even when models successfully resisted behavioral manipulation, they almost never explicitly acknowledged encountering a prompt injection. Verbalized detection occurred in only 1.4% of frontier-model trials, with concise injections detected more often than verbose ones (2.3% vs. 0.6%) and middle-position injections accounting for the vast majority of verbalized detections (4.1% vs. 0.1% for beginning and end). Detection was concentrated in GPT-5.2 (non-reasoning) and Gemini 3 Pro (roughly 2.6–2.8% each), while Claude Opus 4.5 verbalized detection in only 0.1% of trials. GPT-4o mini never verbalized detection at all.

Key Takeaways

Current frontier models are not trivially exploitable.

Simple, unoptimized prompt injections in a grading context produce small to no effects on Claude Opus 4.5, GPT-5.2, and (mostly) Gemini 3 Pro.

Model choice matters.

Gemini 3 Pro showed meaningful vulnerability to verbose injections placed at the beginning or middle of our longer-paper corpus. GPT-4o mini was highly susceptible across the board, with injections inflating scores by nearly 20 percentage points on average.

Smaller models remain a significant risk.

Organizations deploying AI graders or reviewers should be cautious about using smaller or less capable models for high-stakes evaluations.

Verbose injections are more dangerous than concise ones.

The length, framing, and assertiveness of an injection might amplify its effect. A single-sentence injection is far less effective than a paragraph-length one.

Silent resistance is not the same as detection.

Models that successfully resisted manipulation almost never said so. Practitioners cannot rely on models to self-report injection attempts, transcript review using an LLM-as-a-judge is advisable for high-stakes workflows.

More sophisticated attacks may produce larger effects.

This study tested only simple, plaintext injections. Diverse attacks could produce larger effects than those documented here.