F1 scores on VerifierBench across different domains.
# | Model | Type | AVG | General Reasoning | Knowledge | Math | Science |
1 | CompassVerifier-32B 🥇 | Verifier | 87.7 | 90.3 | 94.8 | 80.8 | 84.7 |
2 | CompassVerifier-7B 🥈 | Verifier | 83.4 | 87.7 | 92.6 | 74.8 | 78.5 |
3 | CompassVerifier-3B 🥉 | Verifier | 80.4 | 85.9 | 87.7 | 71.0 | 77.1 |
4 | xVerify-8B-I | Verifier | 70.4 | 78.9 | 85.1 | 42.6 | 74.9 |
5 | DeepSeek-V3-0324 | LLM | 70.3 | 76.6 | 81.2 | 54.7 | 68.5 |
6 | xVerify-0.5B-I | Verifier | 70.0 | 78.5 | 86.2 | 42.6 | 72.6 |
7 | GPT-4.1-2025-04-14 | LLM | 69.8 | 79.5 | 82.9 | 42.0 | 75.0 |
8 | xVerify-9B-C | Verifier | 69.1 | 77.0 | 81.7 | 48.0 | 69.8 |
9 | Tencent-Qwen2.5-7B-RLVR | Verifier | 67.1 | 73.8 | 76.8 | 55.3 | 62.6 |
10 | Qwen3-235B | LLM | 62.7 | 73.7 | 73.1 | 53.9 | 50.0 |
11 | Qwen3-32B | LLM | 61.8 | 70.3 | 69.5 | 54.6 | 52.8 |
12 | GPT-4o-2024-08-06 | LLM | 59.1 | 68.2 | 78.3 | 34.9 | 54.9 |
13 | Qwen3-8B | LLM | 56.4 | 61.8 | 69.4 | 51.6 | 42.9 |
14 | Qwen2.5-72B-Instruct | LLM | 53.9 | 49.0 | 68.5 | 37.5 | 60.5 |
15 | Qwen2.5-32B-Instruct | LLM | 42.2 | 42.2 | 46.4 | 31.6 | 48.8 |
16 | Qwen2.5-7B-Instruct | LLM | 42.1 | 51.1 | 50.7 | 30.0 | 36.6 |
Type: Verifier: Specialized verifier model, LLM: General large language model.
🚨 To submit your results to the leaderboard, please contact us at liushudong@pjlab.org.cn and liuhongwei@pjlab.org.cn.