MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
Abstract
Large language models are evaluated as user simulators through a reproducible benchmark framework that assesses their ability to generate human-like conversational responses across diverse tasks.
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.
Community
MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.
GitHub Repo: https://github.com/SAP/mirrorbench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts (2026)
- ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation (2026)
- Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity (2026)
- Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases (2025)
- ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning (2025)
- EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation (2026)
- Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper