
Key Statistics of our GeoSense
Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense., the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of 65.3. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
# | Model | Definitions | Theorems | Formulas | ALL | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPI | GPA | ACC. | GPI | GPA | ACC | GPI | GPA | ACC | GPI | GPA | ACC | AVG. | ||
Closed-Sourced MLLMs | ||||||||||||||
1 |
Gemini-2.0-pro-flash 🥇
|
64.2 | 47.0 | 73.3 | 72.7 | 59.0 | 72.4 | 87.4 | 60.0 | 77.9 | 72.1 | 49.7 | 74.1 | 65.3 |
2 |
Claude37_Sonnet 🥈
Anthropic |
62.0 | 46.7 | 54.3 | 60.2 | 50.0 | 46.5 | 92.4 | 56.1 | 67.9 | 68.7 | 45.2 | 57.6 | 57.2 |
3 |
Gemini-1.5-pro-flash 🥉
|
60.2 | 43.8 | 53.0 | 58.7 | 51.5 | 45.6 | 85.9 | 55.3 | 56.1 | 67.9 | 44.9 | 55.7 | 56.2 |
4 |
GPT-4o
OpenAI |
56.3 | 46.3 | 48.0 | 54.1 | 49.3 | 37.4 | 90.8 | 58.3 | 61.1 | 64.4 | 45.3 | 51.7 | 53.8 |
5 |
Claude35_Sonnet
Anthropic |
56.5 | 41.2 | 41.9 | 54.9 | 46.8 | 33.8 | 82.8 | 52.5 | 52.9 | 63.2 | 40.8 | 46.1 | 50.0 |
Open-Sourced MLLMs | ||||||||||||||
1 |
Qwen2.5-VL-72B 🥇
Alibaba |
61.5 | 47.5 | 61.5 | 65.1 | 54.8 | 57.5 | 89.7 | 61.5 | 63.8 | 68.5 | 48.1 | 63.8 | 60.1 |
2 |
QVQ-72B-Preview† 🥈
Alibaba |
68.2 | 56.0 | 53.1 | 63.6 | 58.3 | 49.6 | 85.1 | 58.4 | 54.2 | 72.3 | 53.5 | 54.3 | 60.0 |
3 |
Qwen2-VL-72B 🥉
Alibaba |
57.2 | 44.2 | 46.6 | 57.7 | 44.2 | 46.6 | 85.5 | 52.0 | 50.4 | 64.0 | 43.4 | 49.2 | 52.2 |
4 |
Qwen2.5-VL-7B
Alibaba |
57.7 | 45.6 | 43.6 | 57.4 | 51.2 | 37.5 | 85.9 | 60.4 | 53.1 | 63.1 | 44.6 | 46.3 | 51.3 |
5 |
Qwen2.5-VL-3B
Alibaba |
50.5 | 39.9 | 33.5 | 48.8 | 47.0 | 27.7 | 74.8 | 45.0 | 41.2 | 55.2 | 36.5 | 34.9 | 42.2 |
6 |
LLaVA-onevison-72B
Microsoft |
47.9 | 39.0 | 33.7 | 49.6 | 44.8 | 36.4 | 68.3 | 55.9 | 43.1 | 52.5 | 33.2 | 37.2 | 41.0 |
7 |
Deepseek-VL2
DeepSeek |
40.1 | 37.8 | 33.1 | 40.6 | 39.6 | 26.0 | 76.3 | 52.8 | 42.4 | 48.4 | 33.4 | 35.7 | 39.2 |
8 |
InternVL2.5-78B
Shanghai AI Lab |
49.0 | 45.2 | 29.8 | 48.6 | 46.8 | 32.0 | 80.2 | 30.5 | 18.3 | 53.7 | 32.9 | 28.7 | 38.4 |
9 |
InternVL2.5-38B-MPO†
Shanghai AI Lab |
50.7 | 44.6 | 29.7 | 48.2 | 46.4 | 30.0 | 75.6 | 29.3 | 16.0 | 53.9 | 33.6 | 27.7 | 38.4 |
10 |
Llama-vision-90B
Meta |
49.1 | 39.2 | 27.3 | 42.0 | 36.0 | 21.2 | 78.2 | 43.6 | 37.0 | 52.9 | 31.4 | 29.8 | 38.0 |
11 |
InternVL2.5-38B
Shanghai AI Lab |
48.7 | 40.6 | 28.9 | 44.5 | 43.9 | 29.8 | 74.8 | 26.4 | 16.0 | 52.7 | 31.1 | 27.3 | 37.0 |
12 |
Llama-vision-11B
Meta |
43.2 | 36.1 | 22.6 | 37.9 | 35.6 | 18.7 | 74.8 | 37.5 | 29.8 | 47.9 | 29.2 | 24.8 | 34.0 |
13 |
LLaVA-onevison-7B
Microsoft |
36.3 | 38.0 | 22.7 | 39.2 | 39.2 | 22.7 | 72.9 | 40.6 | 42.6 | 41.4 | 26.0 | 22.8 | 30.1 |
14 |
Deepseek-VL2-small
DeepSeek |
25.6 | 35.7 | 23.3 | 26.7 | 36.1 | 19.5 | 67.9 | 48.1 | 30.2 | 34.2 | 23.8 | 26.3 | 28.1 |
† Specially trained for reasoning tasks
🥇🥈🥉 Ranking based on AVG. score within each category
Color legend: Closed-Source Top Open-Source Top
Key Statistics of our GeoSense
Diagram of the top 3 levels of geometric principles (5 levels in total)
Illustration of GenSense evaluation strategy. MLLMs are assessed through three aspects: identification (i.e., GPI), applications (i.e., GPA) of geometric principles, and final answer accuracy.
Mathematical Evaluation on Different Subjects in GeoSense. GPI = Geometric Principles Identification, GPA= Geometric Principles Application, Calculation of Solid Figures = CSF, Understanding of Solid Figures = USF, Transformation and Motion of Plane Figures = TMPF, Calculation of Plane Figures = CPF and Understanding of Plane Figures = UPF.
The performance of MLLMs in different subjects across (a) GPI, (b) GPA, and (c) ACC.
Error Analysis of Leading Closed-Source and Open-Source MLLMs. For each problem, we identify the critical errors in their reasoning process and categorize them into four types: geometric principles identification (GPI) errors, geometric principles application (GPA) errors, calculation errors (CAL), and hallucinations (HAL).
GPI is the primary bottleneck for complex problem solving
Key Limitations:
Recommendations:
Examples of GeoSense-Chinese.
Examples of GeoSense-English.
Model response examples of GeoSense.