F1 scores on VerifierBench across different domains.
| # | Model | Type | AVG | General Reasoning | Knowledge | Math | Science |
| 1 | CompassVerifier-32B 🥇 | Verifier | 87.7 | 90.3 | 94.8 | 80.8 | 84.7 |
| 2 | CompassVerifier-7B 🥈 | Verifier | 83.4 | 87.7 | 92.6 | 74.8 | 78.5 |
| 3 | CompassVerifier-3B 🥉 | Verifier | 80.4 | 85.9 | 87.7 | 71.0 | 77.1 |
| 4 | xVerify-8B-I | Verifier | 70.4 | 78.9 | 85.1 | 42.6 | 74.9 |
| 5 | DeepSeek-V3-0324 | LLM | 70.3 | 76.6 | 81.2 | 54.7 | 68.5 |
| 6 | xVerify-0.5B-I | Verifier | 70.0 | 78.5 | 86.2 | 42.6 | 72.6 |
| 7 | GPT-4.1-2025-04-14 | LLM | 69.8 | 79.5 | 82.9 | 42.0 | 75.0 |
| 8 | xVerify-9B-C | Verifier | 69.1 | 77.0 | 81.7 | 48.0 | 69.8 |
| 9 | Tencent-Qwen2.5-7B-RLVR | Verifier | 67.1 | 73.8 | 76.8 | 55.3 | 62.6 |
| 10 | Qwen3-235B | LLM | 62.7 | 73.7 | 73.1 | 53.9 | 50.0 |
| 11 | Qwen3-32B | LLM | 61.8 | 70.3 | 69.5 | 54.6 | 52.8 |
| 12 | GPT-4o-2024-08-06 | LLM | 59.1 | 68.2 | 78.3 | 34.9 | 54.9 |
| 13 | Qwen3-8B | LLM | 56.4 | 61.8 | 69.4 | 51.6 | 42.9 |
| 14 | Qwen2.5-72B-Instruct | LLM | 53.9 | 49.0 | 68.5 | 37.5 | 60.5 |
| 15 | Qwen2.5-32B-Instruct | LLM | 42.2 | 42.2 | 46.4 | 31.6 | 48.8 |
| 16 | Qwen2.5-7B-Instruct | LLM | 42.1 | 51.1 | 50.7 | 30.0 | 36.6 |
Type: Verifier: Specialized verifier model, LLM: General large language model.
🚨 To submit your results to the leaderboard, please contact us at liushudong@pjlab.org.cn and liuhongwei@pjlab.org.cn.