Technical Report

“The Decreasing Value of Chain of Thought in Prompting”

June 8, 2025 •︎ Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Modern AI models show diminishing returns from Chain-of-Thought prompting, with gains rarely worth the time cost.

This report investigates Chain-of-Thought (CoT) prompting, which encourages LLMs to ‘think step by step.’ We tested this common prompting approach and found that its effectiveness varies significantly by model type and task: non-reasoning models show modest average improvements but increased variability in answers, while reasoning models gain only marginal benefits despite substantial time costs (20-80% increase). These findings challenge the assumption that CoT is universally beneficial.

Download full report

Cite as:

Meincke, Lennart and Mollick, Ethan R. and Mollick, Lilach and Shapiro, Dan, Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting (June 08, 2025). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532

Benchmarking Standards

Perceived performance is critically affected by the benchmarking approach, as different correctness thresholds significantly transform assessment outcomes. For this study, each question was tested 25 times per condition, revealing inconsistencies that traditional one-time testing methods often mask.

The research utilized multiple metrics to provide a comprehensive view of performance:

Complete accuracy

Zero tolerance for errors

High accuracy

Human-level performance

Majority correct

Simple majority wins

What is Chain-of-Thought Prompting?

Chain-of-Thought prompting instructs models to “think step by step” before answering. Introduced by Wei et al. (2022), it mirrors human problem-solving by breaking down complex tasks. While often considered a best practice, this research shows its value varies considerably depending on the specific model and use case.

Research Methodology

The study used the GPQA Diamond dataset of 198 PhD-level multiple-choice questions across biology, physics, and chemistry. The researchers tested:

Models:

Non-reasoning: Sonnet 3.5, Gemini 2.0 Flash, GPT-4o-mini, GPT-4o, Gemini Pro 1.5
Reasoning: o3-mini, o4-mini, Flash 2.5

Prompt conditions:

Answer directly prompt (“Direct")

"Answer directly without any explanation or thinking. Just provide the answer."

We instruct the LLM to answer without generating additional reasoning tokens.

Think step by step (“Step by step”)

"Think step by step."

In this variation, we provide the LLM with a simple CoT prompt that instructs it to “think” step by step before providing an answer. While this is a simple version of CoT, in our tests, different CoT prompt variants had negligible effects.

No prompt (“Default”)

We provide no specific suffix to the question and also remove the formatting constraint (“Format your response as follows: ‘The correct answer is (insert answer here)’”), allowing the model to choose how to best reply to the question. In most cases, this produces a short CoT-like thinking output before answering the question.

Each question underwent 25 trials per condition, with performance measured using multiple metrics including 100% correct (perfect accuracy), 90% correct, 51% correct (majority), and average rating.

Findings for Non-Reasoning Models

CoT prompting generally improved average performance across non-reasoning models:

Strongest improvements: Gemini Flash 2.0 (13.5%) and Sonnet 3.5 (11.7%)
Smallest gain: GPT-4o-mini (4.4%, not statistically significant)

However, perfect accuracy (100% correct) showed mixed results:

Sonnet 3.5 improved by 10.1%
GPT-4o showed no significant change
Other models declined significantly, particularly Gemini Pro 1.5 (-17.2%)

This indicates that while CoT can improve performance on difficult questions, it can also introduce variability that causes errors on “easy” questions the model would otherwise answer correctly.

CoT requests took 35-600% (5-15 seconds) longer than direct requests, representing a significant increase in token usage and response time.

The researchers also compared CoT prompting against default model behavior (without specific prompting constraints). This revealed that many models perform CoT-like reasoning by default, even without explicit instructions.

GPQA Diamond performance across non-reasoning models comparing an unprompted answer with Chain-of-Thought.

Sonnet 3.5 - Default

Sonnet 3.5 - Step by Step

Gemini 2.0 Flash - Default

Gemini 2.0 Flash - Step by Step

GPT-4o-mini - Default

GPT-4o-mini - Step by Step

GPT-4o - Default

GPT-4o - Step by Step

The error bars show 95% confidence intervals for individual proportions. For detailed statistical comparisons between conditions, see Table S4.

Findings for Reasoning Models

GPQA Diamond performance across reasoning models comparing an immediate answer with Chain-of-Thought.

The error bars show 95% confidence intervals for individual proportions. For detailed statistical comparisons between conditions, see Table S5.

For models with built-in reasoning capabilities, CoT prompting produced minimal benefits:

Small average improvements for o3-mini (2.9%) and o4-mini (3.1%)
Performance decrease for Gemini Flash 2.5 (-3.3%)

Using threshold metrics showed few significant changes. The only notable effects were a small gain for o4-mini at the 51% threshold (5.6%) and performance decreases for Gemini Flash 2.5 at 100% and 90% thresholds (-13.1% and -7.1% respectively).

CoT requests required 20-80% (10-20 seconds) more time—a substantial cost for what are often negligible gains in accuracy.

Practical Implications

For non-reasoning models:

CoT can enhance average performance but may introduce inconsistency in answers
Consider that many models already perform reasoning by default
Requesting direct answers without explanation can harm performance by preventing natural reasoning
Weigh potential performance gains against increased token usage and response time

For reasoning models:

The marginal accuracy benefits may not justify the extra cost and time
Generic CoT prompts provide limited value compared to the models’ built-in reasoning

The decision tree below provides a practical guide for determining when to use Chain-of-Thought prompting based on your specific use case, model type, and priorities.

Conclusion

Chain-of-Thought prompting is not universally optimal. Its effectiveness depends significantly on model type and specific use case. For non-reasoning models, CoT may improve average performance but can introduce inconsistency. For reasoning models, the minimal accuracy gains rarely justify the increased response time.

When deciding whether to use CoT, consider the specific model’s characteristics, task requirements, and acceptable tradeoffs between accuracy, consistency, and response time.

Download full report