BrowseComp-Plus Leaderboard

BrowseComp-Plus

BrowseComp-Plus is a new Deep-Research evaluation benchmark built on top of BrowseComp.
It features a fixed, carefully curated corpus of web documents with human-verified positives and mined hard negatives.

With BrowseComp-Plus, you can thoroughly evaluate and compare the effectiveness of different components in a deep research system:

  1. LLM Agent Comparison – Measure how various LLM agents perform when acting as deep-research agents using the same retrieval system.
  2. Retriever Evaluation – Assess how different retrievers impact the performance of deep-research agents.

For more details about the dataset, please visit the BrowseComp-Plus page on Hugging Face, Paper, and Github Repo.

Leaderboards

This page contains 2 leaderboards:

  1. Agents: Evaluates the effectiveness of LLM agents paired with different retrievers. Accuracy is based on the generated answer compared to the ground-truth answer.
  2. Retrieval: Evaluates the effectiveness of retrievers in isolation. Metrics are measured against the human labels for evidence documents and gold documents.
Filter by Retriever