Papers
arxiv:2601.11077

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Published on Jan 16
Β· Submitted by
yangjie
on Jan 20
#1 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

ABC-Bench evaluates LLM agents on realistic backend coding tasks requiring full development lifecycle management from repository exploration to containerized service deployment and API testing.

AI-generated summary

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.

Community

Paper author Paper submitter

Hi everyone, I'm one of the authors of ABC-Bench. (arXiv:2601.11077).

While building Code Agents, we realized that current benchmarks often stop at "generating correct code snippets." But as developers, we know that real-world backend engineering is much more than thatβ€”it's about exploring unfamiliar repos, configuring environments, writing Dockerfiles, and actually deploying a live service. That's why we created ABC-Bench.

✨ Key Features:

  • Full Lifecycle: We evaluate everything from code editing to Containerization and Service Launch.
  • Real Integration Testing: We validate agents by sending actual HTTP requests to the service they deploy.
  • Diverse Stack: 224 tasks from real-world repos (8 languages, 19 frameworks).

πŸ€— Open Source on HF:
We've released the full dataset and fine-tuned models:

Hope this serves as a useful testbed for the community! πŸš€

arXivlens breakdown of this paper πŸ‘‰ https://arxivlens.com/PaperView/Details/abc-bench-benchmarking-agentic-backend-coding-in-real-world-development-2363-988cad6f

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.11077 in a Space README.md to link it from this page.

Collections including this paper 2