MiniMax-M2.1-REAP-50-W4A16

50% REAP-pruned + W4A16 quantized version of MiniMax-M2.1.

Model Details

Property Value
Base Model MiniMax/MiniMax-M2.1
Pruning 50% experts removed via REAP
Quantization W4A16 (4-bit weights, 16-bit activations)
Format AutoRound (auto_gptq packing)
Size ~59GB

Tested Configuration

Dependency Version
vLLM 0.13.0
torch 2.9.0+cu128
transformers 4.57.3

vLLM Usage

vllm serve 0xSero/MiniMax-M2.1-REAP-50-W4A16 \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name minimax-m2.1-reap-50 \
    --tensor-parallel-size 8 \
    --max-model-len 64000 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 8 \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --tool-call-parser minimax_m2 \
    --enable-auto-tool-choice

Note: Requires 8x GPUs with ~24GB VRAM each (e.g., 8x RTX 3090).

Configuration

config.json
{
  "architectures": ["MiniMaxM2ForCausalLM"],
  "attention_dropout": 0.0,
  "attn_type_list": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
  "auto_map": {
    "AutoConfig": "configuration_minimax_m2.MiniMaxM2Config",
    "AutoModelForCausalLM": "modeling_minimax_m2.MiniMaxM2ForCausalLM"
  },
  "bos_token_id": 1,
  "dtype": "bfloat16",
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 196608,
  "model_type": "minimax_m2",
  "mtp_transformer_layers": 1,
  "num_attention_heads": 48,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 62,
  "num_key_value_heads": 8,
  "num_local_experts": 128,
  "num_mtp_modules": 3,
  "output_router_logits": false,
  "partial_rotary_factor": 0.5,
  "qk_norm_type": "per_layer",
  "quantization_config": {
    "autoround_version": "0.9.4",
    "bits": 4,
    "damp_percent": 0.01,
    "data_type": "int",
    "desc_act": false,
    "group_size": 128,
    "iters": 256,
    "lm_head": false,
    "nsamples": 256,
    "provider": "auto-round",
    "quant_method": "auto-round",
    "sym": true,
    "true_sequential": false,
    "packing_format": "auto_round:auto_gptq"
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 5000000,
  "rotary_dim": 64,
  "router_aux_loss_coef": 0.001,
  "router_jitter_noise": 0.0,
  "scoring_func": "sigmoid",
  "shared_intermediate_size": 0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.3",
  "use_cache": true,
  "use_mtp": true,
  "use_qk_norm": true,
  "use_routing_bias": true,
  "vocab_size": 200064,
  "torch_dtype": "float16"
}
quantization_config.json
{
  "bits": 4,
  "group_size": 128,
  "sym": true,
  "data_type": "int",
  "iters": 256,
  "nsamples": 256,
  "autoround_version": "0.9.4",
  "lm_head": false,
  "provider": "auto-round",
  "quant_method": "gptq",
  "desc_act": false,
  "true_sequential": false,
  "damp_percent": 0.01
}

REAP Pruning

REAP (Router-weighted Expert Activation Pruning) by Cerebras selects experts to prune based on:

  • Router assignment frequency
  • Activation magnitude weighted by routing probability
  • Angular distance clustering

Transformers Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-50-W4A16",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/MiniMax-M2.1-REAP-50-W4A16")

Acknowledgments

Downloads last month
1,399
Safetensors
Model size
17B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for 0xSero/MiniMax-M2.1-REAP-50-W4A16