REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Paper
•
2510.13999
•
Published
•
7
50% REAP-pruned + W4A16 quantized version of MiniMax-M2.1.
| Property | Value |
|---|---|
| Base Model | MiniMax/MiniMax-M2.1 |
| Pruning | 50% experts removed via REAP |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Format | AutoRound (auto_gptq packing) |
| Size | ~59GB |
| Dependency | Version |
|---|---|
| vLLM | 0.13.0 |
| torch | 2.9.0+cu128 |
| transformers | 4.57.3 |
vllm serve 0xSero/MiniMax-M2.1-REAP-50-W4A16 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name minimax-m2.1-reap-50 \
--tensor-parallel-size 8 \
--max-model-len 64000 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--tool-call-parser minimax_m2 \
--enable-auto-tool-choice
Note: Requires 8x GPUs with ~24GB VRAM each (e.g., 8x RTX 3090).
{
"architectures": ["MiniMaxM2ForCausalLM"],
"attention_dropout": 0.0,
"attn_type_list": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"auto_map": {
"AutoConfig": "configuration_minimax_m2.MiniMaxM2Config",
"AutoModelForCausalLM": "modeling_minimax_m2.MiniMaxM2ForCausalLM"
},
"bos_token_id": 1,
"dtype": "bfloat16",
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 3072,
"initializer_range": 0.02,
"intermediate_size": 1536,
"max_position_embeddings": 196608,
"model_type": "minimax_m2",
"mtp_transformer_layers": 1,
"num_attention_heads": 48,
"num_experts_per_tok": 8,
"num_hidden_layers": 62,
"num_key_value_heads": 8,
"num_local_experts": 128,
"num_mtp_modules": 3,
"output_router_logits": false,
"partial_rotary_factor": 0.5,
"qk_norm_type": "per_layer",
"quantization_config": {
"autoround_version": "0.9.4",
"bits": 4,
"damp_percent": 0.01,
"data_type": "int",
"desc_act": false,
"group_size": 128,
"iters": 256,
"lm_head": false,
"nsamples": 256,
"provider": "auto-round",
"quant_method": "auto-round",
"sym": true,
"true_sequential": false,
"packing_format": "auto_round:auto_gptq"
},
"rms_norm_eps": 1e-06,
"rope_theta": 5000000,
"rotary_dim": 64,
"router_aux_loss_coef": 0.001,
"router_jitter_noise": 0.0,
"scoring_func": "sigmoid",
"shared_intermediate_size": 0,
"sliding_window": null,
"tie_word_embeddings": false,
"transformers_version": "4.57.3",
"use_cache": true,
"use_mtp": true,
"use_qk_norm": true,
"use_routing_bias": true,
"vocab_size": 200064,
"torch_dtype": "float16"
}
{
"bits": 4,
"group_size": 128,
"sym": true,
"data_type": "int",
"iters": 256,
"nsamples": 256,
"autoround_version": "0.9.4",
"lm_head": false,
"provider": "auto-round",
"quant_method": "gptq",
"desc_act": false,
"true_sequential": false,
"damp_percent": 0.01
}
REAP (Router-weighted Expert Activation Pruning) by Cerebras selects experts to prune based on:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"0xSero/MiniMax-M2.1-REAP-50-W4A16",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/MiniMax-M2.1-REAP-50-W4A16")