arxiv:2601.09142

EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

Published on Jan 14

· Submitted by

MaShijian on Jan 16

Upvote

Authors:

Shijian Ma ,

Abstract

EvasionBench introduces a large-scale benchmark for detecting evasive responses in earnings calls using a multi-model annotation framework that leverages disagreement between advanced language models to identify challenging examples, resulting in a highly accurate model with significantly reduced inference costs.

AI-generated summary

Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.

View arXiv page View PDF Add to collection

Community

FutureMa

Paper author Paper submitter 1 day ago

Thanks for featuring our work! 🚀 EvasionBench aims to bridge the gap in financial transparency. We've released the Eva-4B model and the 1k human-annotated test set.
📝 Paper: https://arxiv.org/abs/2601.09142
🤗 Model: https://huggingface.co/FutureMa/Eva-4B
🎮 Demo: https://huggingface.co/spaces/FutureMa/financial-evasion-detection
Feel free to ask any questions!

FutureMa

Paper author Paper submitter about 7 hours ago

I'm sharing our latest work on detecting evasive answers in earnings calls.
Key Highlights:

EvasionBench: A large-scale benchmark (30k training / 1k human test).
Disagreement Mining: A novel annotation framework where LLM disagreement identifies high-value training samples.
Eva-4B: A lightweight model that achieves 81.3% accuracy, outperforming many closed-source frontier models.

We have open-sourced the model and demo. Happy to answer any questions about the labeling protocol or the financial NLP aspect! 💹

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.09142 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.09142 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.