DeepThink Bench Logo

One Think, One Generate: Benchmarking Pairwise Language Models for Reasoning

Guiyao Tie1, Zeli Zhao1, Tianhe Gu1, Chaoran Hu1, Hao He1, Tianyao Luo1, Ruihang Zhang1, Sizhe Zhang1, Pan Zhou1, Lichao Sun2
1Huazhong University of Science and Technology , 2Lehigh University

Abstract icon Abstract

Recent advancements in large language models (LLMs) have led to the development of large reasoning models (LRMs), which incorporate intermediate deep thinking to guide decision-making. These LRMs have demonstrated promising results in a range of domains, including commonsense reasoning, mathematics, and code generation. However, the precise role of deep thinking in improving model performance remains underexplored, and no universally accepted framework exists to evaluate its impact. To address this gap, we introduce TGBench, a comprehensive benchmarking framework designed to evaluate the effects of deep thinking on instruction-based LLMs. Our experiments reveal three key findings: 1) incorporating deep thinking from LRMs significantly enhances the performance of instruction-based LLMs, particularly in tasks that require multi-step reasoning; 2) deep thinking improves both accuracy and efficiency, though the extent of improvement varies depending on the task; and 3) we propose three distinct rankings (i.e., ranking single LLMs, ranking single LRMs, and ranking combined LLMs), providing a holistic view of deep thinking. These contributions highlight the potential of integrating deep thinking to advance instruction-based LLM capabilities, and we advocate for further research on optimizing deep thinking integration to enhance model scalability, robustness, and real-world applicability across diverse tasks.

Framework Overview

Overview of the DeepThink Bench framework. 1) Thought Extraction, focusing on the generation of deep thinking; 2) Prompt Construction, mainly for integrating deep thinking; 3) Evaluation on multiple tasks.

Results icon Results

Leaderboard icon Leaderboard

# Model Dataset
LRM LLM OpenBookQA HellaSwag GSM8K MATH HumanEval SST-2 IMDB SQuAD DROP Avg
1 Deepseek-R1 Mixtral-8x7B-Instruct-v0.1 87.19 57.66 91.85 62.17 27.50 89.89 93.20 88.39 90.17 76.45
2 Deepseek-R1 LLaMA3.1-8B-Instruct 89.37 59.84 93.15 62.56 83.01 89.99 94.42 91.73 94.28 84.26
3 Deepseek-R1 LLaMA3.3-70B-Instruct 88.50 62.31 96.76 80.44 85.89 91.22 94.55 90.67 88.93 86.59
4 Deepseek-R1 Qwen2.5-7B-Instruct 88.12 60.45 96.78 56.90 82.34 89.56 95.78 89.01 91.23 83.35
5 Deepseek-R1 Gemma-2-9b-it 89.45 59.67 89.89 60.12 80.34 88.56 93.78 91.90 90.12 82.65
6 Deepseek-R1 Gemini1.5-flash 88.34 60.56 97.78 78.90 90.12 90.34 94.56 93.78 91.90 87.36
7 Deepseek-R1 Gemini1.5-pro 87.12 61.34 97.56 83.78 93.90 90.12 94.34 91.56 91.78 87.94
8 Deepseek-R1 claude-3-5-sonnet-20241022 82.90 62.12 91.34 76.56 70.78 90.90 94.12 93.34 93.56 83.96
9 Deepseek-R1 GPT3.5 89.78 59.90 92.12 66.34 59.56 91.78 95.90 93.12 90.34 82.09
10 Deepseek-R1 GPT4o 82.56 62.78 97.90 61.12 73.34 90.56 94.78 91.90 93.12 83.12
11 Deepseek-R1 GLM-4-9B-Chat 78.84 53.12 83.97 52.45 79.31 81.66 84.08 85.23 80.77 75.49
12 Deepseek-R1 Gemma 7B 62.39 31.85 53.72 30.14 69.93 67.28 69.55 70.49 74.61 58.88
13 Deepseek-R1 ChatGLM2-6B 61.27 32.90 54.03 32.68 70.19 66.44 67.82 72.37 70.15 58.65
14 Deepseek-R1 Qwen2.5-1.5B 70.56 54.41 69.88 40.27 73.95 82.33 84.76 86.09 90.42 72.52
15 Deepseek-R1 Phi-3-mini 3.8B 81.63 61.78 87.25 59.06 76.84 89.97 91.52 89.30 88.71 80.67
16 Deepseek-R1-zero Mixtral-8x7B-Instruct-v0.1 89.45 58.83 93.12 65.74 26.37 88.95 94.08 89.63 81.52 76.41
17 Deepseek-R1-zero LLaMA3.1-8B-Instruct 89.17 60.29 94.86 63.55 81.33 87.44 89.77 92.16 83.09 82.41
18 Deepseek-R1-zero LLaMA3.3-70B-Instruct 93.68 67.41 96.03 78.92 84.57 90.24 94.91 88.05 87.38 86.80
19 Deepseek-R1-zero Qwen2.5-7B-Instruct 92.73 62.85 97.56 58.39 83.64 89.47 93.82 91.25 87.93 84.18
20 Deepseek-R1-zero Gemma-2-9b-it 90.06 66.58 98.14 59.27 79.81 89.62 94.36 93.71 87.48 84.34
21 Deepseek-R1-zero Gemini1.5-flash 91.83 65.97 97.28 80.45 88.19 87.03 94.55 95.22 83.76 87.14
22 Deepseek-R1-zero Gemini1.5-pro 91.24 66.35 98.67 83.09 94.72 90.85 93.41 90.63 84.94 88.21
23 Deepseek-R1-zero claude-3-5-sonnet-20241022 94.50 68.92 86.37 79.18 72.53 91.46 96.15 90.84 85.29 85.03
24 Deepseek-R1-zero GPT3.5 91.77 66.44 96.85 68.73 60.29 88.96 93.59 91.38 86.67 82.74
25 Deepseek-R1-zero GPT4o 90.32 69.15 98.03 65.88 74.42 90.57 94.28 92.16 87.84 84.74
26 Deepseek-R1-zero GLM-4-9B-Chat 79.61 52.34 84.97 50.81 75.26 84.73 81.69 83.95 84.08 75.27
27 Deepseek-R1-zero Gemma 7B 62.93 31.78 53.45 30.62 69.14 68.37 67.85 74.29 72.51 58.99
28 Deepseek-R1-zero ChatGLM2-6B 64.20 30.56 57.89 34.17 68.48 70.92 69.63 70.74 74.39 60.11
29 Deepseek-R1-zero Qwen2.5-1.5B 69.35 58.07 66.82 42.59 75.83 80.46 83.19 84.03 88.72 72.12
30 Deepseek-R1-zero Phi-3-mini 3.8B 80.88 61.24 87.57 57.95 78.06 84.89 88.44 85.97 90.31 79.48
31 QwQ-32B Mixtral-8x7B-Instruct-v0.1 87.23 57.84 95.16 63.47 48.95 90.03 88.76 87.59 85.12 78.24
32 QwQ-32B LLaMA3.1-8B-Instruct 88.45 64.91 92.37 65.28 77.64 89.82 89.15 88.03 92.74 83.15
33 QwQ-32B LLaMA3.3-70B-Instruct 90.68 63.52 97.85 83.19 76.43 91.27 90.56 89.89 86.31 85.52
34 QwQ-32B Qwen2.5-7B-Instruct 88.94 64.03 96.72 50.68 81.45 90.58 91.39 91.84 87.07 82.52
35 QwQ-32B Gemma-2-9b-it 88.62 62.47 99.35 72.81 78.26 90.49 90.93 84.75 87.68 83.93
36 QwQ-32B Gemini1.5-flash 88.15 60.89 97.04 82.57 88.32 90.66 90.28 90.17 90.84 86.55
37 QwQ-32B Gemini1.5-pro 90.73 67.42 98.25 85.96 91.08 90.79 91.63 92.35 88.47 88.52
38 QwQ-32B claude-3-5-sonnet-20241022 91.86 67.59 97.13 76.24 66.83 89.41 91.05 92.76 93.48 85.15
39 QwQ-32B GPT3.5 89.27 65.74 97.82 65.39 50.62 90.84 88.97 88.46 86.51 80.40
40 QwQ-32B GPT4o 89.63 62.18 98.06 62.75 82.93 92.15 89.32 90.81 90.43 84.25
41 QwQ-32B GLM-4-9B-Chat 70.35 46.82 80.57 43.06 70.19 84.28 78.64 75.39 78.91 69.80
42 QwQ-32B Gemma 7B 55.47 30.93 48.61 27.84 60.72 60.35 62.89 69.52 70.68 54.11
43 QwQ-32B ChatGLM2-6B 62.14 27.85 50.32 30.67 60.48 58.73 68.42 69.15 72.86 55.62
44 QwQ-32B Qwen2.5-1.5B 63.58 55.09 60.47 40.82 72.34 75.91 80.28 80.63 81.79 67.88
45 QwQ-32B Phi-3-mini 3.8B 77.26 66.43 84.95 50.38 71.62 73.85 73.09 79.24 83.57 73.38
46 Gemini-2.0 Mixtral-8x7B-Instruct-v0.1 88.34 56.17 95.89 82.03 31.76 86.45 88.92 94.57 87.28 79.05
47 Gemini-2.0 LLaMA3.1-8B-Instruct 90.61 64.83 95.12 77.49 83.95 85.34 89.07 94.26 91.73 85.82
48 Gemini-2.0 LLaMA3.3-70B-Instruct 90.25 67.90 95.47 86.58 80.63 88.14 90.82 94.39 86.04 86.69
49 Gemini-2.0 Qwen2.5-7B-Instruct 91.78 64.29 94.63 70.85 82.41 88.97 89.51 92.16 92.88 85.28
50 Gemini-2.0 Gemma-2-9b-it 90.42 66.55 95.21 83.74 69.38 87.62 90.19 93.05 85.93 84.68
51 Gemini-2.0 Gemini1.5-flash 91.06 61.24 94.75 87.32 78.09 86.80 90.67 91.84 91.37 85.90
52 Gemini-2.0 Gemini1.5-pro 91.49 60.83 96.17 89.46 89.27 86.53 87.95 94.71 89.02 87.27
53 Gemini-2.0 claude-3-5-sonnet-20241022 90.93 69.47 94.38 87.15 75.84 89.26 90.44 96.03 93.65 87.46
54 Gemini-2.0 GPT3.5 91.85 63.72 95.54 86.97 56.29 87.18 88.73 93.82 91.49 83.95
55 Gemini-2.0 GPT4o 88.67 61.08 94.86 72.53 79.64 87.40 86.15 95.28 94.07 84.41
56 Gemini-2.0 GLM-4-9B-Chat 72.56 50.39 83.74 44.82 75.91 88.23 77.68 76.45 78.90 72.08
57 Gemini-2.0 Gemma 7B 53.47 44.12 44.95 31.60 63.87 66.04 62.31 72.89 72.15 56.82
58 Gemini-2.0 ChatGLM2-6B 63.83 34.75 55.28 35.43 66.59 60.17 70.84 73.56 77.38 59.76
59 Gemini-2.0 Qwen2.5-1.5B 63.29 54.68 59.03 45.76 77.41 76.92 83.57 84.20 85.45 70.03
60 Gemini-2.0 Phi-3-mini 3.8B 78.14 68.50 80.62 53.39 74.85 75.08 74.63 80.97 86.31 74.72
61 GLM-zero-preview Mixtral-8x7B-Instruct-v0.1 91.37 59.84 96.15 53.72 44.96 88.23 93.58 82.49 86.03 77.37
62 GLM-zero-preview LLaMA3.1-8B-Instruct 91.62 61.29 83.47 29.15 77.83 85.94 83.76 93.01 91.57 77.52
63 GLM-zero-preview LLaMA3.3-70B-Instruct 92.85 65.73 95.28 80.46 74.59 88.67 94.39 92.18 88.92 85.90
64 GLM-zero-preview Qwen2.5-7B-Instruct 92.14 64.90 97.53 61.38 80.24 86.75 91.87 87.62 89.31 83.53
65 GLM-zero-preview Gemma-2-9b-it 88.45 58.63 94.27 60.84 78.92 85.06 86.79 87.33 86.40 80.74
66 GLM-zero-preview Gemini1.5-flash 91.08 61.52 97.84 78.26 84.73 86.49 92.15 87.97 89.64 85.52
67 GLM-zero-preview Gemini1.5-pro 92.36 62.19 98.03 83.57 86.48 88.82 90.74 93.25 89.07 87.17
68 GLM-zero-preview claude-3-5-sonnet-20241022 91.73 63.85 93.42 79.60 76.31 85.94 90.28 95.67 92.53 85.48
69 GLM-zero-preview GPT3.5 92.96 62.47 98.58 62.93 56.84 89.71 92.46 80.35 92.80 81.01
70 GLM-zero-preview GPT4o 88.52 63.09 98.24 59.78 78.43 89.36 89.15 96.02 96.88 84.39
71 GLM-zero-preview GLM-4-9B-Chat 73.65 51.38 82.97 44.20 74.09 89.57 78.84 77.56 75.91 72.02
72 GLM-zero-preview Gemma 7B 55.43 42.76 43.19 30.68 64.37 65.48 63.07 73.89 71.24 56.68
73 GLM-zero-preview ChatGLM2-6B 64.82 35.14 56.85 37.05 69.28 58.63 69.90 71.73 74.46 59.76
74 GLM-zero-preview Qwen2.5-1.5B 64.50 50.67 57.34 47.82 74.95 75.18 86.27 85.39 88.31 68.34
75 GLM-zero-preview Phi-3-mini 3.8B 76.31 66.48 83.75 54.06 73.89 75.24 75.97 80.62 83.19 74.39
76 o1-preview Mixtral-8x7B-Instruct-v0.1 87.45 54.83 90.17 89.62 32.09 88.74 91.28 87.53 87.96 78.85
77 o1-preview LLaMA3.1-8B-Instruct 91.37 57.29 92.64 67.85 62.43 89.56 92.15 88.71 92.08 81.56
78 o1-preview LLaMA3.3-70B-Instruct 92.63 62.75 92.39 90.24 60.81 84.97 92.86 90.34 89.18 84.02
79 o1-preview Qwen2.5-7B-Instruct 92.14 62.06 92.58 86.93 81.72 85.43 92.07 89.65 88.29 85.65
80 o1-preview Gemma-2-9b-it 92.80 61.47 93.25 88.36 60.92 86.58 94.13 87.84 86.77 83.57
81 o1-preview Gemini1.5-flash 89.51 59.86 91.73 87.09 83.42 82.68 90.95 91.27 90.34 85.21
82 o1-preview Gemini1.5-pro 91.62 65.39 91.84 88.57 88.26 88.93 92.48 92.76 92.03 87.99
83 o1-preview claude-3-5-sonnet-20241022 88.37 64.82 92.15 75.68 80.45 89.61 93.72 95.18 95.87 86.21
84 o1-preview GPT3.5 92.93 59.14 92.86 89.47 53.28 85.79 90.63 90.85 90.42 82.82
85 o1-preview GPT4o 91.25 63.57 92.34 74.92 66.08 87.35 93.41 91.69 94.56 83.91
86 o1-preview GLM-4-9B-Chat 71.84 50.63 80.29 42.75 71.06 85.38 74.89 75.17 75.04 69.67
87 o1-preview Gemma 7B 52.39 40.81 40.56 28.43 62.97 66.12 60.78 71.25 72.90 55.13
88 o1-preview ChatGLM2-6B 62.73 34.28 55.64 34.95 67.81 55.46 64.37 70.82 74.15 57.80
89 o1-preview Qwen2.5-1.5B 62.19 52.47 55.03 40.68 73.52 74.89 83.26 84.73 88.31 68.34
90 o1-preview Phi-3-mini 3.8B 73.65 61.92 80.57 52.34 69.85 72.06 73.48 78.91 82.76 71.73
91 o3-mini Mixtral-8x7B-Instruct-v0.1 91.45 82.17 74.93 87.62 37.85 83.04 90.38 88.71 85.29 80.16
92 o3-mini LLaMA3.1-8B-Instruct 82.36 57.88 82.59 73.14 54.27 85.96 90.73 90.15 91.42 78.72
93 o3-mini LLaMA3.3-70B-Instruct 91.83 73.52 83.47 89.06 66.31 82.58 92.69 91.84 92.37 84.85
94 o3-mini Qwen2.5-7B-Instruct 82.94 55.67 91.28 86.45 73.09 86.74 91.53 87.62 88.91 82.69
95 o3-mini Gemma-2-9b-it 82.15 55.39 91.76 88.83 63.42 84.95 92.18 85.67 86.50 81.21
96 o3-mini Gemini1.5-flash 73.62 55.81 91.04 87.35 79.68 83.27 90.89 89.76 92.03 82.61
97 o3-mini Gemini1.5-pro 82.73 73.95 97.12 90.84 84.57 87.46 88.31 89.59 91.24 87.31
98 o3-mini claude-3-5-sonnet-20241022 88.29 79.43 91.87 85.06 82.39 89.82 92.75 90.48 91.57 87.96
99 o3-mini GPT3.5 91.68 66.34 83.58 79.72 65.18 82.95 88.41 90.93 91.06 82.21
100 o3-mini GPT4o 91.37 73.26 91.49 84.91 76.03 85.64 91.85 91.72 94.08 86.71
101 o3-mini GLM-4-9B-Chat 73.85 62.14 76.92 41.78 63.45 86.50 75.63 76.84 77.39 70.50
102 o3-mini Gemma 7B 55.27 52.90 39.15 31.64 50.83 69.72 61.48 72.36 73.81 56.35
103 o3-mini ChatGLM2-6B 65.43 43.08 50.76 35.29 49.57 58.64 63.92 72.15 75.34 57.13
104 o3-mini Qwen2.5-1.5B 66.58 60.47 51.83 39.26 70.14 77.89 80.05 80.72 85.46 68.04
105 o3-mini Phi-3-mini 3.8B 76.19 72.53 78.64 55.07 59.86 70.21 75.38 80.95 83.27 72.46

Results icon Example

BibTeX

@misc{tie2025tgbench,
            title={One Think, One Generate: Benchmarking Pairwise Language Models for Reasoning},
            author={Guiyao Tie and Zeli Zhao and Tianhe Gu and Chaoran Hu and Hao He and Tianyao Luo and Ruihang Zhang and Sizhe Zhang and Pan Zhou and Lichao Sun},
            howpublished = {Submitted to NeurIPS 2025},
            year={2025},
          }