BrowseComp-Plus Leaderboard
BrowseComp-Plus
BrowseComp-Plus is a new Deep-Research evaluation benchmark built on top of BrowseComp.
It features a fixed, carefully curated corpus of web documents with human-verified positives and mined hard negatives.
With BrowseComp-Plus, you can thoroughly evaluate and compare the effectiveness of different components in a deep research system:
- LLM Agent Comparison – Measure how various LLM agents perform when acting as deep-research agents using the same retrieval system.
- Retriever Evaluation – Assess how different retrievers impact the performance of deep-research agents.
For more details about the dataset, please visit the BrowseComp-Plus page on Hugging Face, Paper, and Github Repo.
Leaderboards
This page contains 2 leaderboards:
- Agents: Evaluates the effectiveness of LLM agents paired with different retrievers. Accuracy is based on the generated answer compared to the ground-truth answer.
- Retrieval: Evaluates the effectiveness of retrievers in isolation. Metrics are measured against the human labels for evidence documents and gold documents.
Filter by Retriever
Qwen3-Embedding-0.6B | 71.69 | 70.12 | 78.98 | 21.74 | 19.68 | 1000 | Sep 28, 2025 | BrowseComp-Plus |