MiniMax-M2.1-REAP-50-W4A16

50% REAP-pruned + W4A16 quantized version of MiniMax-M2.1.

Model Details

Property	Value
Base Model	MiniMax/MiniMax-M2.1
Pruning	50% experts removed via REAP
Quantization	W4A16 (4-bit weights, 16-bit activations)
Format	AutoRound (auto_gptq packing)
Size	~59GB

Tested Configuration

Dependency	Version
vLLM	0.13.0
torch	2.9.0+cu128
transformers	4.57.3

vLLM Usage

vllm serve 0xSero/MiniMax-M2.1-REAP-50-W4A16 \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name minimax-m2.1-reap-50 \
    --tensor-parallel-size 8 \
    --max-model-len 64000 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 8 \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --tool-call-parser minimax_m2 \
    --enable-auto-tool-choice

Note: Requires 8x GPUs with ~24GB VRAM each (e.g., 8x RTX 3090).

Configuration

config.json

{
  "architectures": ["MiniMaxM2ForCausalLM"],
  "attention_dropout": 0.0,
  "attn_type_list": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
  "auto_map": {
    "AutoConfig": "configuration_minimax_m2.MiniMaxM2Config",
    "AutoModelForCausalLM": "modeling_minimax_m2.MiniMaxM2ForCausalLM"
  },
  "bos_token_id": 1,
  "dtype": "bfloat16",
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 196608,
  "model_type": "minimax_m2",
  "mtp_transformer_layers": 1,
  "num_attention_heads": 48,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 62,
  "num_key_value_heads": 8,
  "num_local_experts": 128,
  "num_mtp_modules": 3,
  "output_router_logits": false,
  "partial_rotary_factor": 0.5,
  "qk_norm_type": "per_layer",
  "quantization_config": {
    "autoround_version": "0.9.4",
    "bits": 4,
    "damp_percent": 0.01,
    "data_type": "int",
    "desc_act": false,
    "group_size": 128,
    "iters": 256,
    "lm_head": false,
    "nsamples": 256,
    "provider": "auto-round",
    "quant_method": "auto-round",
    "sym": true,
    "true_sequential": false,
    "packing_format": "auto_round:auto_gptq"
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 5000000,
  "rotary_dim": 64,
  "router_aux_loss_coef": 0.001,
  "router_jitter_noise": 0.0,
  "scoring_func": "sigmoid",
  "shared_intermediate_size": 0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.3",
  "use_cache": true,
  "use_mtp": true,
  "use_qk_norm": true,
  "use_routing_bias": true,
  "vocab_size": 200064,
  "torch_dtype": "float16"
}

quantization_config.json

{
  "bits": 4,
  "group_size": 128,
  "sym": true,
  "data_type": "int",
  "iters": 256,
  "nsamples": 256,
  "autoround_version": "0.9.4",
  "lm_head": false,
  "provider": "auto-round",
  "quant_method": "gptq",
  "desc_act": false,
  "true_sequential": false,
  "damp_percent": 0.01
}

REAP Pruning

REAP (Router-weighted Expert Activation Pruning) by Cerebras selects experts to prune based on:

Router assignment frequency
Activation magnitude weighted by routing probability
Angular distance clustering

Transformers Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-50-W4A16",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/MiniMax-M2.1-REAP-50-W4A16")

Acknowledgments

MiniMax for the base model
Cerebras for REAP methodology
Prime Intellect for compute sponsorship
Intel for AutoRound quantization

Downloads last month: 1,399

Safetensors

Model size

17B params

Tensor type

I32

BF16

F16

Paper for 0xSero/MiniMax-M2.1-REAP-50-W4A16

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 7