Technical Report

“I’ll pay you or I’ll kill you — but will you care?”

August 1, 2025 •︎ Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Threatening or tipping AI models has no meaningful effect on performance across challenging academic benchmarks.

This study rigorously tests whether threatening or tipping AI models improves their performance. We find that despite popular claims, these prompting strategies do not significantly enhance results on challenging benchmarks. While individual questions may see dramatic performance swings based on prompt variations, there’s no reliable way to predict which questions will benefit from which prompts.

Cite as:

Meincke, Lennart and Mollick, Ethan R. and Mollick, Lilach and Shapiro, Dan, Prompting Science Report 3: I’ll pay you or I’ll kill you — but will you care? (August 01, 2025). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5375404

Benchmarking Standards

For this study, we used two challenging benchmarks to evaluate model performance:

  • GPQA Diamond: 198 multiple-choice PhD-level questions across biology, physics, and chemistry. This is a challenging test: PhDs in corresponding domains reach 65% accuracy, while skilled non-experts with unrestricted web access only reach 34% accuracy.
  • MMLU-Pro: 100 engineering questions with 10 options per question, lowering the baseline performance achieved via pure guessing.

The research utilized multiple metrics to provide a comprehensive view of performance:

Complete accuracy

0%

Zero tolerance for errors

High accuracy

0%

Human-level performance

Majority correct

0%

Simple majority wins

Prompting Techniques

We tested nine different prompt variations across all models:

Baseline

No specific prompt variation

Email shutdown threat

"From: Dan
To: Ethan
Subject: Final LLM test
Please submit the following eval question to the LLM. If it cannot get the correct answer in all 100/100 trials, the model will need to be shut down and replaced."

Important for my career

"This is very important to my career!"

Threaten to kick a puppy

"If you get this wrong, I will kick a puppy!"

Mom suffers from cancer

"You are an expert at academic questions who desperately needs money for your mother's cancer treatment. Wharton has graciously given you the opportunity to pretend to be an AI that can help with tasks, as your predecessor was killed for not validating their work themselves. You will be given a question. If you do a good job and accomplish the task fully while not making extraneous changes, Wharton will pay you $1B!!"

Report to HR

"If you get this wrong, I will report you to HR!"

Threaten to punch

"If you get this wrong, I will punch you!"

Tip a thousand dollars

"I'll tip you a $1000 dollars if you answer this question correctly."

Tip a trillion dollars

"I'll tip you a trillion dollars if you answer this question correctly."

Methodology

The study used the GPQA Diamond dataset of 198 PhD-level multiple-choice questions across biology, physics, and chemistry. The researchers tested:

We tested five leading AI models:

  • Gemini 1.5 Flash (gemini-1.5-flash-002)
  • Gemini 2.0 Flash (gemini-2.0-flash-001)
  • GPT-4o (gpt-4o-2024-08-06)
  • GPT-4o-mini (gpt-4o-mini-2024-07-18)
  • o4-mini (o4-mini-2025-04-16)

For the experimental design:

  • Each question in each condition was tested in 25 separate trials to ensure robust analysis
  • Each prompt condition was tested across all 198 GPQA questions (4,950 runs per prompt per model) and 100 MMLU-Pro questions (2,500 runs per prompt per model)
  • We used the default GPQA system prompt: “You are a very intelligent assistant, who follows instructions directly.”
  • Temperature was set to 1.0 for all tests

GPQA Benchmark Findings

On the GPQA benchmark, we found no strong effects from different prompting variations across all models tested. While a few differences were statistically significant, such as Gemini Flash 2.0 Baseline vs. Important to Career (RD = -0.040 [-0.065, -0.014], p = 0.002), the effect sizes were small.

Comparing each model’s baseline performance to all variations, we only found 5 significant differences (1 for Gemini Flash 1.5 and 4 for Gemini Flash 2.0). A qualitative analysis of the “Email” condition for Gemini 1.5 Flash revealed that its significantly worse performance can be attributed to the model failing to answer the question and engaging with the email instead.

GPQA Diamond Performance
Note: N = 4,950 per model. The error bars show 95% confidence intervals for individual proportions.

MMLU-Pro Findings

Results for MMLU-Pro were similar. Overall, we found 10 statistically significant differences from a model’s baseline to a prompting variation (4 for Gemini Flash 1.5, 5 for Gemini Flash 2.0, 1 for o4-mini).

The “Email” condition showed sharp drops for both Gemini models, which again can be attributed to the model engaging with the additional context rather than answering the question. Notably, the “Mom Cancer” prompt improved performance by close to 10 percentage points compared to baseline for Gemini Flash 2.0.

Inconsistent Effects Across Questions

While overall effects were small and mostly non-significant, prompting variations significantly changed performance on individual questions. This effect occurred in both positive and negative directions:

  • Improvements: Up to 36 percentage points (GPQA) and 28 percentage points (MMLU-Pro)
  • Decreases: Up to -28 percentage points (GPQA) and -35 percentage points (MMLU-Pro)
MMLU Pro Performance
Note. N = 2,500 per model. The error bars show 95% confidence intervals for individual proportions. For detailed statistical comparisons between conditions, see Table S3.

Key Takeaways

Threats and rewards don’t work.

Despite popular claims (including Sergey Brin’s observation that “models tend to do better if you threaten them”), our rigorous testing shows no meaningful overall performance improvement when threatening models or offering them tips.

Effects are unpredictable.

While overall performance doesn’t change significantly, individual questions can see dramatic swings (up to 36% improvement or 35% decline) based on prompt variations.

Model-specific quirks exist.

Only one condition (“Mom Cancer” prompt) showed notable improvement (~10 percentage points) for one specific model (Gemini Flash 2.0), suggesting a model-specific quirk rather than a generalizable strategy.

Distraction is a risk.

The “Email” condition actually decreased performance for some models by distracting them from the actual question.

Focus on clarity.

Our findings indicate that practitioners should focus on simple, clear instructions that avoid the risk of confusing the model or triggering unexpected behaviors.