M2.1: Multilingual and Multi-Task Coding with Strong Generalization
MiniMax-M2.1 has achieved a significant leap in coding capabilities compared to the previous generation, matching or surpassing the level of global top-tier models on multiple internal and external benchmarks. As an open-source model optimized specifically for agentic scenarios, M2.1 demonstrates exceptional performance in code generation, tool usage, instruction following, and long-range planning. Here, we would like to share some insights and practical experience gained in the process of enhancing coding capabilities for real-world scenarios.
The Gap Between SWE-Bench and Real-World Coding
In 2025, SWE-Bench has become the most authoritative evaluation standard for code generation scenarios. In this evaluation, LLMs must face bugs from real GitHub repositories and fix them through multiple rounds of code reading and testing. The core value of SWE-Bench lies in the fact that the tasks it evaluates are highly close to a programmer's daily work, and the results can be objectively verified via test cases — a feature particularly crucial for reinforcement learning training. We can directly use the test pass rate as a reward signal, continuously optimizing the model in a real code environment without relying on the noise introduced by human labeling or model evaluation.
However, like all evaluation standards, SWE-Bench is not perfect. For a coding agent to be usable in real-world scenarios, there are more capability dimensions beyond SWE-Bench that need attention:
- Limited Language Coverage: SWE-Bench currently only covers Python. In real development scenarios, developers need to handle multiple languages such as Java, Go, TypeScript, Rust, and C++, often collaborating across multiple languages within the same project.
- Restricted Task Types: SWE-Bench only involves bug-fixing tasks. Other real-world capabilities, such as implementing new features, generating test cases, project refactoring, code review, performance optimization, and CI/CD configuration can't be evaluated.
- Scaffold Binding: SWE-Bench usually only evaluates the model's performance on a specific scaffold, so the model's generalization on other scaffolds cannot be accurately observed. Meanwhile, different agent scaffolds design various context management strategies, and the model needs to be able to adapt to these differences.
How to Fill These Gaps
1. Environment Scaling
We often see developers complaining that current coding agents perform well on languages like Python/JavaScript but show lackluster results in more serious enterprise-level development scenarios. If the task involves complex project understanding, the performance degrades further.
To solve this problem, during the training cycle of MiniMax-M2.1, we built a comprehensive data pipeline covering Top 10+ mainstream programming languages. We retrieved a massive number of Issues, PRs, and corresponding test cases from GitHub, and conducted strict filtering, cleaning, and rewriting based on this raw data to ensure the quality of Post Training data. A coding agent is naturally suited for mass-producing this kind of training environment. During this process, we found that for both the M2 model and other frontier models, the success rate of constructing multi-language environments was lower than that of Python. There are several distinct situations here:
- Environmental Complexity of Compiled Languages: Python, as an interpreted language, has relatively simple configuration. However, for compiled languages like Java, Go, Rust, and C++, we need to handle complex compilation toolchains, version compatibility, and cross-compilation issues. A Java project might depend on a specific version of JDK, Maven/Gradle, and numerous third-party libraries; an error in any link can lead to build failure.
- Diverse Test Frameworks: In the Python ecosystem, pytest dominates, but test frameworks in other languages are more fragmented. Java has JUnit and TestNG; JavaScript has Jest, Mocha, and Vitest; Go has the built-in testing package but also extensions like testify; Rust has built-in tests and criterion, etc. We need to design specialized test execution and result parsing logic for each framework.
- Dependency Management & Project Structure: Package managers for different languages differ vastly in dependency resolution, version locking, and private repository support. The nested structure of npm's node_modules, Maven's central repository mechanism, and Cargo's semantic versioning all require targeted handling. Simultaneously, project structure standards vary: Python structures are flexible, but Java projects usually follow strict Maven/Gradle directory standards; Go projects have GOPATH and Go Modules modes; Rust projects have the concept of a workspace. Understanding these dependency management mechanisms and project structures is crucial for correctly locating code and running tests.
- Difficulty in Parsing Error Messages: Error message formats produced by different languages and toolchains vary widely; compile errors, link errors, and runtime errors also manifest differently. We need to train the model to understand these diverse error messages and extract useful debugging clues from them.
Ultimately, we built a multi-language training system covering over ten languages including JS, TS, HTML, CSS, Python, Java, Go, C++, Kotlin, C, and Rust. We obtained over 100,000 environments usable for training and evaluation from real GitHub repositories, with each environment containing complete Issues, code, and test cases. To support such massive Environment Scaling and RL training, we built a high-concurrency sandbox infrastructure capable of launching over 5,000 isolated execution environments within 10 seconds, while supporting the concurrent operation of tens of thousands of environments. This infrastructure allows us to efficiently conduct large-scale multi-language coding agent training.
2. Beyond Bug Fix: Multi-Task Capabilities
Real software development is far more than just fixing bugs. A programmer's daily routine includes writing tests, code reviews, performance optimization, and other tasks. In the training of MiniMax-M2.1, we also conducted targeted optimization for these scenarios, including acquiring high-quality problems and designing corresponding Reward signals:
- Test Generation Capability: Early in the R&D of M1, we discovered that the ability to write tests was a major bottleneck restricting the accuracy of code generated by language models. In the agentless framework, the model generates multiple fix solutions in parallel and then uses its own generated test code to select the final solution. However, due to unreasonable reward design in the RL process for M1, it consistently wrote overly simple test code, causing a large number of incorrect fix solutions to be selected. Generating high-quality test cases requires the model to deeply understand code logic, boundary conditions, and potential failure scenarios. MiniMax-M2.1 synthesized a large volume of training samples to enhance testing ability based on GitHub PRs and self-generated Code Patches, eventually tying with Claude Sonnet 4.5 on SWT-bench, which evaluates testing capabilities.
- Code Performance Optimization: Besides implementation correctness, execution efficiency is also critical in actual development. The model needs to understand low-level knowledge like algorithm complexity, memory usage, and concurrency handling, while also mastering best practices for specific APIs in software development. During training, MiniMax-M2.1 was encouraged to write more efficient code, subsequently achieving significant progress on SWE-Perf, with an average performance boost of 3.1%. In the future, we will apply corresponding optimization methods to other performance-sensitive scenarios like Kernel optimization and database query optimization.
- Code Review Capability: Based on the SWE framework, we built an internal Benchmark called SWE-Review, covering multiple languages and scenarios to evaluate the recall rate and hallucination rate of code defects. A review is judged as correct only if it accurately identifies the target defect without producing any false positives, imposing high requirements on the model's precision.
3. Generalization on OOD Scaffolds
Generalization on OOD scaffolds is vital for a coding agent. Developers use different scaffolds — some use Claude Code, some use Cursor, and others use proprietary agent frameworks. If a model is optimized only for a specific scaffold, its performance will be severely discounted in other environments, strictly limiting its capability in real development scenarios. In MiniMax-M2.1, we believe scaffold generalization primarily tests the model's long-range instruction following ability and adaptability to context management strategies:
- Long-Range Instruction Following: Complex development scenarios require the model to integrate and execute "composite instruction constraints" from multiple sources, including System Prompt, User Query, Memory, Tool Schema, and various specification files (such as Agents.md, Claude.md, Skill.md, etc.). Developers strictly constrain the model's expected behavior by designing these specifications. Once the agent fails to meet a requirement at any step during inference, it may lead to a severe degradation in end-to-end results.
- Adaptability to Context Management: During the early release of M2, the community did not fully understand the design of Interleaved Thinking. When used in many scaffolds, the results were inconsistent with the model's inherent capabilities. At that time, we found that some popular scaffold designs would discard some historical thinking content in multi-turn conversations; this design caused M2's performance to drop by varying degrees across different evaluation sets. In MiniMax-M2.1, on one hand, we still recommend developers use the Interleaved Thinking feature to unleash the full potential of M2.1; on the other hand, we designed corresponding training methods to ensure the model's "IQ" remains online even when users employ all sorts of imaginative context management strategies.
To verify MiniMax-M2.1's scaffold generalization, we directly tested SWE-Bench performance on different scaffolds and also constructed a test set closer to real-world usage to observe if the model meets various scaffold instruction constraints. Ultimately, we found that MiniMax-M2.1 maintained an SWE-Bench score above 67 in mini-swe-agent, Droid, and Claude Code.
| Benchmark | MiniMax-M2.1 | MiniMax-M2 | Claude Sonnet 4.5 | DeepSeek V3.2 |
|---|---|---|---|---|
| SWE-bench Verified (Claude Code) | 74 | 69.2 | 77.2 | 73.1 |
| SWE-bench Verified (Droid) | 71.3 | 68.1 | 72.3 | 67 |
| SWE-bench Verified (mini-swe-agent) | 67 | 61 | 70.6 | 60 |
| OctoCodingBench | 26.1 | 13.3 | 22.8 | 26 |
Compared to M2, MiniMax-M2.1 shows significant improvement across different OOD scaffolds. On OctoCodingbench, M2.1 improved from M2's 13.3 to 26.1, demonstrating strong compliance with scaffold instruction constraints.
2026 TODOs
We believe the development of coding agents still has a long way to go. Therefore, this year we will explore several interesting directions:
- Defining the Reward Signal for Developer Experience: Beyond the optimization directions mentioned above, we hope to further quantify and optimize developer experience. Current evaluation standards mainly focus on whether the task is ultimately completed, ignoring the user experience during the process. We plan to explore richer Reward dimensions: regarding code quality, including readability, modularity, and comment completeness; regarding interaction experience, including response latency, information transparency, and interpretability of intermediate states; regarding engineering standards, including commit message quality, PR description completeness, and code style consistency. Although these metrics are difficult to evaluate fully automatically, we are exploring hybrid solutions combining static analysis tools, Agent-as-a-Verifier, and human preference learning, hoping to make the coding agent not only complete tasks but also deliver high-quality code like an excellent human engineer.
- Improving Problem-Solving Efficiency: MiniMax-M2.1 still has some issues with over-exploration, such as repeatedly reading the same file or executing redundant tests. We plan to optimize efficiency from multiple angles: reducing trial-and-error through better planning capabilities; reducing unnecessary file reads through more precise code localization; avoiding repetitive exploration through better memory mechanisms; and responding quickly to simple tasks through adaptive thinking depth.
- RL Scaling: The Scaling Law of reinforcement learning still holds huge potential for coding agents. We have verified the positive correlation between environment count, training steps, and model capability, but we are far from reaching convergence. We plan to continue exploring in three dimensions: Compute dimension, increasing concurrent environment count and training iterations; Data dimension, building a larger-scale and more diverse training task pool; Algorithm dimension, exploring more efficient exploration strategies, more stable training objectives, and better reward shaping methods. Simultaneously, we are researching how to make the RL training process itself more efficient, including better curriculum learning designs, smarter sample reuse strategies, and cross-task knowledge transfer.
- Coding World Model & User Simulator: As mentioned earlier, the training of this generation of coding agents (M2.1) relies heavily on execution in real environments, which brings massive computational overhead and environment construction costs. We are exploring building a World Model capable of predicting code execution results: given a piece of code and environment state, the model can predict whether tests pass, what error messages will be produced, and how the program will behave. This will enable us to perform large-scale rollout and policy optimization without actually executing code. Meanwhile, we are also building a user behavior simulator to model the patterns of interaction between real developers and the agent—including vague requirement descriptions, mid-stream requirement changes, and feedback on intermediate results—allowing the model to adapt to various user behavior patterns in real scenarios during the training phase.
- Extremely Efficient Data Pipeline: Building a data pipeline capable of automatically discovering, filtering, and generating harder, longer-range tasks to continuously raise the model's ceiling. High-quality training data is a key bottleneck for coding agent progress. We are building an automated data flywheel: automatically discovering high-quality Issues and PRs from GitHub; using models to assess task difficulty and perform stratification; automatically augmenting tasks that the current model can easily solve to make them more challenging; and analyzing failure causes for failed cases to generate targeted training data. The ideal state here is to build an "inexhaustible" source of high-quality tasks, keeping training data difficulty slightly above the model's current capability to maintain optimal learning efficiency. We are also exploring how to automatically generate ultra-long-range tasks that require hours or even days to complete, pushing the model's capability boundaries in complex project understanding and long-term planning.
- More Scenario Coverage: Expanding to more specialized fields such as GPU Kernel development, compiler development, smart contracts, and machine learning. Each field has its unique knowledge system, toolchain, and best practices, while possessing real application scenarios and commercial value. We plan to gradually build training environments and evaluation systems for these professional fields, enabling the coding agent to handle more specialized and high-value development tasks. Looking further ahead, we believe the paradigm of "Define Problem - Define Reward - Environment Construction - Model Training" demonstrated in coding agent training can be transferred to more scenarios requiring complex reasoning and execution feedback.