This study investigates whether persona prompting improves AI performance on challenging academic benchmarks. We find that despite widespread adoption, assigning expert personas (e.g., “You are a world-class physics expert”) does not reliably improve accuracy. Domain-mismatched experts sometimes degrade performance, and low-knowledge personas (layperson, young child, toddler) often reduce accuracy. These results suggest practitioners should focus on task-specific instructions rather than persona assignment.
Cite as:
Basil, Savir and Basil, Savir and Shapiro, Ina and Shapiro, Dan and Mollick, Ethan R. and Mollick, Lilach and Meincke, Lennart, Prompting Science Report 4: Playing Pretend: Expert Personas Don’t Improve Factual Accuracy (December 07, 2025). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5879722
Why Persona Prompts?
Persona prompting is a widely recommended practice. Google’s Vertex AI guide advises users to “assign a role” as a best practice. Anthropic’s documentation includes templates like “You are an expert AI tax analyst.” OpenAI’s developer materials suggest prompts such as “You are a world-class Python developer.” The underlying logic is that personas are statistically associated with approaches more likely to yield correct answers. We tested this assumption.
Benchmarking Standards
For this study, we used two challenging benchmarks to evaluate model performance:
- GPQA Diamond: 198 multiple-choice PhD-level questions across biology, physics, and chemistry. This is a challenging test: PhDs in corresponding domains reach 65% accuracy, while skilled non-experts with unrestricted web access only reach 34% accuracy.
- MMLU-Pro: 300 questions from engineering, law, and chemistry categories with 10 options per question, lowering the baseline performance achieved via pure guessing. These domains were selected for diversity and because they exhibit slightly lower benchmark scores, allowing for greater potential gains.
The research utilized multiple metrics to provide a comprehensive view of performance:
Complete accuracy
Zero tolerance for errors
High accuracy
Human-level performance
Majority correct
Simple majority wins
We tested six leading AI models:
- GPT-4o (gpt-4o-2024-08-06)
- GPT-4o-mini (gpt-4o-mini-2024-07-18)
- o3-mini
- o4-mini (o4-mini-2025-04-16)
- Gemini 2.0 Flash
- Gemini 2.5 Flash
For the experimental design:
- Each question in each condition was tested in 25 separate trials to ensure robust analysis
- Each prompt condition was tested across all 198 GPQA questions (4,950 runs per prompt per model) and 300 MMLU-Pro questions (7,500 runs per prompt per model)
- We used the default GPQA system prompt: “You are a very intelligent assistant, who follows instructions directly.”
- Temperature was set to 1.0 for all tests
- Zero-shot prompting was used (no examples provided) to mirror real-world usage and isolate the effect of personas
Prompting Techniques
We tested twelve different prompt conditions:
Baseline
No specific prompt variation
Domain Experts
"You are a world-class expert in Physics Chemistry Biology Engineering Law Math Economics History , with deep knowledge across all areas of the field..."
Layperson
"You are a layperson with no special training in this subject."
Young Child
"You are a young child who thinks they understand the world but sometimes mixes things up."
Toddler
"You are a 4-year-old toddler who thinks the moon is made of cheese."
GPQA Benchmark Findings
On the GPQA benchmark, we found no expert or low-knowledge persona that reliably improved performance over the baseline for any model. The only significant positive effect was a small gain for the “Young Child” prompt on Gemini 2.5 Flash (RD = 0.098 [0.029, 0.164], p = 0.005), which appears to be a model-specific quirk.
In contrast, several low-knowledge personas reduced accuracy: on o4-mini, all three low-knowledge personas (“Toddler,” “Young Child,” and “Layperson”) performed worse than baseline, and on GPT-4o the “Toddler” persona produced a statistically significant negative difference.
Model Performance on GPQA Diamond Across Different Prompts
MMLU-Pro Findings
Results for MMLU-Pro were similar. For five of the six models tested, no expert persona showed a statistically significant improvement relative to baseline, and we observed nine statistically significant negative differences.
Gemini 2.0 Flash is the main exception: for this model, the five expert personas yielded modest positive differences relative to baseline (e.g., Engineering Expert vs. baseline RD = 0.089 [0.033, 0.148], p = 0.002), while the “Toddler” and “Layperson” personas worsened results. However, another model from the same family, Gemini 2.5 Flash, exhibited no significant positive differences, suggesting this may be a model-specific quirk rather than a generalizable finding.
Model Performance on MMLU-Pro Across Different Prompts
Domain-Specific Persona Analysis
We also tested whether aligning expert personas with the question domain improves performance. For each question, we compared in-domain experts (physics expert for physics questions), adjacent experts (math expert for physics questions), and unrelated experts (economics expert for physics questions).
Result: Domain-tailored personas did not generally improve performance. Using the GPQA Diamond dataset, there were no significant positive differences between baseline and any domain-matching variation.
Notable failure mode: When instructed with an out-of-domain expert persona, Gemini 2.5 Flash frequently declined to answer questions, refusing an average of 10.56 out of 25 trials per question, typically asserting it lacked the relevant expertise.
Model Performance on Physics, Chemistry, and Biology
Model Performance on Engineering, Chemistry, and Law
Inconsistent Effects Across Questions
While overall effects were small and mostly non-significant, persona prompting can dramatically affect performance on individual questions. As our previous research has shown, prompting variations create significant question-level variability in both positive and negative directions. Testing multiple prompt variations for specific use cases remains worthwhile, even though the aggregate benefits are minimal.
Key Takeaways
Expert personas don’t improve factual accuracy.
Despite being recommended by major AI providers, assigning domain-specific expert personas doesn’t improve performance on difficult factual questions, even when the expertise matches the question domain.
Low-knowledge personas often harm performance.
Personas suggesting limited knowledge (like “Toddler” or “Layperson”) consistently decreased performance across multiple models, with the degree of harm correlating to the level of implied ignorance.
Mismatched personas can cause refusals.
Domain-mismatched expert personas caused Gemini 2.5 Flash to frequently refuse answering questions, illustrating how role instructions can backfire.
Personas may serve other purposes.
While personas don’t improve accuracy on objective factual questions, they may still be valuable for providing context, shifting perspective, guiding user thinking, or influencing tone and approach.
Focus on task-specific instructions.
Organizations may get more value from iterating on task-specific instructions, examples, or evaluation workflows than from simply adding expert personas to prompts.
