EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
Abstract
EvasionBench introduces a large-scale benchmark for detecting evasive responses in earnings calls using a multi-model annotation framework that leverages disagreement between advanced language models to identify challenging examples, resulting in a highly accurate model with significantly reduced inference costs.
Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
Community
Thanks for featuring our work! ๐ EvasionBench aims to bridge the gap in financial transparency. We've released the Eva-4B model and the 1k human-annotated test set.
๐ Paper: https://arxiv.org/abs/2601.09142
๐ค Model: https://huggingface.co/FutureMa/Eva-4B
๐ฎ Demo: https://huggingface.co/spaces/FutureMa/financial-evasion-detection
Feel free to ask any questions!
I'm sharing our latest work on detecting evasive answers in earnings calls.
Key Highlights:
- EvasionBench: A large-scale benchmark (30k training / 1k human test).
Disagreement Mining: A novel annotation framework where LLM disagreement identifies high-value training samples. - Eva-4B: A lightweight model that achieves 81.3% accuracy, outperforming many closed-source frontier models.
We have open-sourced the model and demo. Happy to answer any questions about the labeling protocol or the financial NLP aspect! ๐น
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper