Logo CompassVerifier

A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

1Shanghai AI Laboratory, 2University of Macau
*Equal Contribution, Project Lead

Introduction

CompassVerifier is an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses.

We introduce VerifierBench benchmark comprising question and model outputs collected from multiple data sources from about 1 million predictions from various common-used models and datasets, labeled by human experts. Our verifier models are available in three sizes: CompassVerifier-3B, CompassVerifier-7B, and CompassVerifier-32B.

The verifier outputs simple judgment results: A (correct), B (incorrect), and C (quality problems). We also provide a COT mode that gives detailed analysis for complex problems. CompassVerifier achieves state-of-the-art performance across multiple domains, particularly excelling in mathematical reasoning.

Model Comparison

Performance comparison of CompassVerifier with other models across different domains on VerifierBench.

Leaderboard on VerifierBench

F1 scores on VerifierBench across different domains.

# Model Type AVG General Reasoning Knowledge Math Science
1 CompassVerifier-32B 🥇 Verifier 87.7 90.3 94.8 80.8 84.7
2 CompassVerifier-7B 🥈 Verifier 83.4 87.7 92.6 74.8 78.5
3 CompassVerifier-3B 🥉 Verifier 80.4 85.9 87.7 71.0 77.1
4 xVerify-8B-I Verifier 70.4 78.9 85.1 42.6 74.9
5 DeepSeek-V3-0324 LLM 70.3 76.6 81.2 54.7 68.5
6 xVerify-0.5B-I Verifier 70.0 78.5 86.2 42.6 72.6
7 GPT-4.1-2025-04-14 LLM 69.8 79.5 82.9 42.0 75.0
8 xVerify-9B-C Verifier 69.1 77.0 81.7 48.0 69.8
9 Tencent-Qwen2.5-7B-RLVR Verifier 67.1 73.8 76.8 55.3 62.6
10 Qwen3-235B LLM 62.7 73.7 73.1 53.9 50.0
11 Qwen3-32B LLM 61.8 70.3 69.5 54.6 52.8
12 GPT-4o-2024-08-06 LLM 59.1 68.2 78.3 34.9 54.9
13 Qwen3-8B LLM 56.4 61.8 69.4 51.6 42.9
14 Qwen2.5-72B-Instruct LLM 53.9 49.0 68.5 37.5 60.5
15 Qwen2.5-32B-Instruct LLM 42.2 42.2 46.4 31.6 48.8
16 Qwen2.5-7B-Instruct LLM 42.1 51.1 50.7 30.0 36.6

Type: Verifier: Specialized verifier model, LLM: General large language model.

🚨 To submit your results to the leaderboard, please contact us at liushudong@pjlab.org.cn and liuhongwei@pjlab.org.cn.

BibTeX

@article{liu2025compassverifier,
      title={CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward},
      author={Liu, Shudong and Liu, Hongwei and Liu, Junnan and Xiao, Linchen and Gao, Songyang and Lyu, Chengqi and Gu, Yuzhe and Zhang, Wenwei and Wong, Derek F and Zhang, Songyang and Chen, Kai},
      journal={arXiv preprint arXiv:2508.03686},
      year={2025}
    }