Media
HKU Business School Releases Latest Report on AI’s Advanced Reasoning Capabilities
15 Oct 2025
HKU Business School today released the "Large Language Model (LLM) Advanced Reasoning Capability Evaluation Report in Chinese-Language Contexts," revealing the current capabilities of selected AI LLMs in advanced reasoning. The report shows that US models generally lead in this area. The Chinese models have achieved breakthroughs in certain domains but still have significant room for improvement in handling complex reasoning tasks.
Since the start of 2025, AI has been rapidly evolving. LLMs are shifting from ‘chatting’ to ‘reasoning’. Nevertheless, AI performance varies considerably in scenarios that require sophisticated reasoning. Challenges include the integration and analysis of cross-modal information (such as images and text) and innovative reasoning when faced with unconventional and complex questions. Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School leads the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en) to develop an integrated evaluation system for multimodal and Olympiad-level reasoning. The study assessed 37 LLMs released in China and the United States up to October 2025, and included 14 reasoning models, 20 general-purpose models, and 3 integrated systems on multimodal and Olympiad-level reasoning.
Evaluation Results
In multimodal reasoning, OpenAI’s GPT series continued to dominate; China’s Doubao 1.5 Pro (Thinking) also reached the global top tier.
- In Olympiad-level reasoning, US models dominated, with GPT-5 (Thinking) leading by a decisive margin.
- Overall, in advanced reasoning evaluations, reasoning models stand out, while general-purpose models lag behind.
- This tiered differentiation aligns closely with industry trends, revealing a pivotal shift in AI from “pursuing broad, all-scenario coverage” to “targeted breakthroughs and efficiency optimisation” in specialised domains—signaling a transition from a phase of breadth expansion to one of depth-focused refinement.
Professor Jiang remarked: “Advanced reasoning capability is vital for expanding AI applications across education, scientific research, business, and decision-making. This research offers valuable insights into the current landscape of advanced AI reasoning capabilities, enabling the industry to precisely identify technical bottlenecks and accelerate the deployment of general AI in high-demand fields. We should target to transform AI from a ‘dialogue assistant’ to a more sophisticated ‘intelligent partner’.”
Evaluation Methodology
Based on the two core capabilities required for advanced reasoning, the study assessed LLM’s multimodal reasoning capability and Olympiad-level reasoning ability.
- Multimodal Reasoning Capability refers to a model’s ability to integrate multiple modalities of information, such as text, images, and charts, and perform cross-modal analysis and logical inference. In the context of education, it can help students connect textbook explanations with diagrams to grasp abstract concepts. This capability is essential for AI to effectively handle complex real-world tasks.
- Olympiad-level Reasoning Capability evaluates models’ performance regarding high-difficulty problems from competitions like the International Mathematical Olympiad (IMO). These problems require complex logical structures, multi-step derivations, and innovative thinking. They often lack a single ‘correct’ answer, but instead test whether AI can ‘think outside the box’ and find optimal solutions. Olympiad-level reasoning is a stringent test for determining whether a model possesses genuine ‘intelligence’.
Multimodal Reasoning Capability Performance and Rankings
The distribution of scores reveals a distinctly tiered landscape, underscoring sharp disparities in multimodal reasoning capability. The GPT family claims four spots out of five in the top tier, while Doubao 1.5 Pro (Thinking Mode) is the only Chinese model among the top five, with negligible differences between its general and thinking modes, indicating that its multimodal reasoning “native capability” has reached an international leading standard.
|
Ranking |
Model Name |
Accuracy |
|
1 |
GPT-5 (Thinking) |
91 |
|
2 |
GPT-4.1 |
90 |
|
3 |
GPT-o3 |
87 |
|
4 |
Doubao 1.5 Pro (Thinking) |
85 |
|
4 |
GPT-5 (Auto) |
85 |
|
6 |
GPT-4o |
84 |
|
7 |
Claude 4 Opus (Thinking) |
83 |
|
8 |
Doubao1.5 Pro |
82 |
|
8 |
Grok 3 (Thinking) |
82 |
|
10 |
Qwen 3 |
81 |
|
11 |
Kimi-k1.5 |
80 |
|
11 |
SenseChat V6 (Thinking) |
80 |
|
11 |
Step R1-V-Mini |
80 |
|
14 |
Grok 4 |
79 |
|
14 |
GPT-4o mini |
79 |
|
14 |
Hunyuan-T1 |
79 |
|
17 |
GLM-4-plus |
78 |
|
17 |
Qwen 3 (Thinking) |
78 |
|
19 |
Gemini 2.5 Flash |
77 |
|
19 |
GLM-Z1-Air |
77 |
|
21 |
Llama 3.3 70B |
76 |
|
22 |
SenseChat V6 Pro |
75 |
|
22 |
Gemini 2.5 Pro |
75 |
|
23 |
Ernie 4.5-Turbo |
74 |
|
24 |
Step 2 |
73 |
|
26 |
Hunyuan-TurboS |
71 |
|
26 |
Claude 4 Opus |
71 |
|
28 |
Spark 4.0 Ultra |
68 |
|
28 |
MiniMax-01 |
68 |
|
30 |
Baichuan4-Turbo |
67 |
|
31 |
Grok 3 |
66 |
|
32 |
Kimi |
63 |
|
*Note: The scores have been rounded to the nearest integer |
||
Table 1: Ranking of Multimodal Reasoning Capability
Olympiad-level Reasoning Capability Performance and Rankings
Based on the evaluation results, US LLMs demonstrate “multi-dimensional leadership” in accuracy, logical coherence, methodological innovation, and puzzle-solving reasoning ability. GPT-5 (Thinking Mode) and Gemini 2.5 Pro significantly lead the rankings, with GPT-o3 and Claude 4 Opus (Thinking Mode) ranking third and fourth, respectively. Among the Chinese models, only Tongyi Qianwen 3 (Thinking Mode) and Step R1_V_mini perform relatively well, highlighting that there is considerable room for improvement in complex reasoning for these models.
Additionally, when comparing the same company’s general-purpose and reasoning model versions, the models operating in Thinking Mode generally perform better across all dimensions of Olympiad-level Reasoning.
|
Ranking |
Model Name |
Correctness |
Logical Coherence |
Methodological |
Overall |
|
1 |
GPT-5 (Thinking) |
48 |
47 |
44 |
48 |
|
2 |
Gemini 2.5 Pro |
48 |
39 |
36 |
44 |
|
3 |
GPT-o3 |
36 |
42 |
39 |
38 |
|
4 |
Claude 4 Opus (Thinking) |
30 |
36 |
39 |
33 |
|
5 |
Gemini 2.5 Flash |
35 |
28 |
31 |
32 |
|
5 |
GPT-o4 mini |
32 |
33 |
33 |
32 |
|
7 |
Qwen 3 (Thinking) |
29 |
25 |
28 |
28 |
|
7 |
Step R1-V-mini |
26 |
33 |
22 |
28 |
|
9 |
GLM_Z1_Air |
27 |
31 |
22 |
27 |
|
9 |
SenseChat V6 (Thinking) |
27 |
28 |
22 |
27 |
|
11 |
Qwen 3 |
25 |
31 |
17 |
26 |
|
12 |
Ernie 4.5-Turbo |
25 |
25 |
19 |
24 |
|
13 |
Grok 3 (Thinking) |
21 |
28 |
25 |
23 |
|
14 |
GPT-5 (Auto) |
22 |
22 |
28 |
22 |
|
14 |
DeepSeek-V3 |
26 |
14 |
22 |
22 |
|
16 |
Claude 4 Opus |
22 |
17 |
31 |
21 |
|
17 |
Doubao 1.5 Pro (Thinking) |
22 |
17 |
22 |
20 |
|
17 |
DeepSeek-R1 |
17 |
25 |
22 |
20 |
|
19 |
Grok 3 |
20 |
19 |
17 |
19 |
|
19 |
Grok 4 |
19 |
17 |
25 |
19 |
|
21 |
Ernie X1-Turbo |
17 |
19 |
14 |
17 |
|
21 |
Hunyuan-T1 |
17 |
17 |
19 |
17 |
|
21 |
Hunyuan-TurboS |
17 |
17 |
19 |
17 |
|
21 |
Kimi-k1.5 |
17 |
19 |
11 |
17 |
|
25 |
Doubao 1.5 Pro |
16 |
17 |
19 |
16 |
|
26 |
GLM-4-plus |
12 |
17 |
8 |
13 |
|
27 |
GPT-4o |
13 |
8 |
19 |
12 |
|
27 |
Spark 4.0 Ultra |
13 |
11 |
14 |
12 |
|
29 |
Baichuan4-Turbo |
8 |
19 |
11 |
11 |
|
29 |
GPT-4.1 |
11 |
8 |
17 |
11 |
|
31 |
Kimi |
6 |
14 |
17 |
9 |
|
31 |
Llama 3.3 70B |
7 |
14 |
6 |
9 |
|
33 |
Yi-Lightning |
6 |
11 |
14 |
8 |
|
33 |
SenseChat V6 Pro |
8 |
8 |
6 |
8 |
|
35 |
MiniMax-01 |
5 |
11 |
8 |
7 |
|
35 |
Step 2 |
6 |
8 |
8 |
7 |
|
35 |
360 Zhinao 2-o1 |
7 |
6 |
8 |
7 |
|
*Note: The scores have been rounded to the nearest integer |
|||||
Table 2 Olympiad-level Reasoning Capability Ranking
Click here to view the complete report.
Overall, this evaluation offers valuable insights into the current landscape of advanced AI reasoning capabilities. On the one hand, US-developed models maintain a clear advantage in this domain, consistently excelling in multimodal and Olympiad-level reasoning performance. In contrast, Chinese-developed models need to address the critical gap in scenarios requiring deep contextual understanding, intricate inference chains, or creative problem-solving. Furthermore, a distinct pattern emerges: models specifically optimised for reasoning tasks outperform general-purpose ones by a significant margin.
Looking ahead, AI must continue to make breakthroughs in multimodal integration and in creative problem-solving under conditions of extreme complexity. Chinese-developed models, leveraging their advantage in local context understanding, have the opportunity to strategically address weaknesses in advanced reasoning and drive AI closer to ‘true intelligence’ in broader and more impactful applications.
Hi-res photos are available here.
For media enquiries, please contact:
|
HKU Business School |
|
|
Viva LIU Communications and Public Relations Tel: +852 3910 3307 Email: changlbs@hku.hk |
Ran ELFASSY Communications and Public Relations Tel: +852 3917 0714 Email: relfassy@hku.hk |