Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.minimax.io/docs/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers local deployment of MiniMax-M2.7 using three inference frameworks: vLLM and SGLang for Linux GPU servers, and MLX for Apple Silicon Mac Studio. For the MiniMax API service, see Text Generation.

GPU Server Deployment (vLLM / SGLang)

System Requirements

  • OS: Linux
  • Python: 3.10 – 3.12
  • GPU: Compute capability 7.0 or higher
ConfigurationTotal KV Cache Capacity
96 GB x 4 GPUUp to 400K tokens
144 GB x 8 GPUUp to 3M tokens
The values above represent the total aggregate hardware KV Cache capacity. The maximum context length per individual sequence remains 196K tokens.

Deploy with vLLM

vLLM is a high-performance LLM inference and serving library that uses PagedAttention for efficient KV cache management, combined with continuous batching and prefix caching for state-of-the-art serving throughput.

Installation

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

Start the Server

vLLM will automatically download and cache the model from Hugging Face.
SAFETENSORS_FAST_GPU=1 vllm serve \
    MiniMaxAI/MiniMax-M2.7 --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

Deploy with SGLang

SGLang is a high-performance serving framework for large language and multimodal models. It leverages RadixAttention for efficient prefix caching and scheduling, delivering low-latency, high-throughput inference.

Installation

uv venv
source .venv/bin/activate
uv pip install sglang

Start the Server

SGLang will automatically download and cache the model from Hugging Face.
python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.7 \
    --tp-size 4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --trust-remote-code \
    --port 8000 \
    --mem-fraction-static 0.85

Verify Deployment

Both vLLM and SGLang expose an OpenAI-compatible API. After startup, call it via curl or the OpenAI SDK:
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M2.7",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Mac Studio Deployment (MLX)

Thanks to Apple Silicon’s unified memory architecture, Mac Studio can load the full model into memory and run inference entirely on-device — no GPU clusters required, with complete data privacy.

System Requirements

  • Mac Studio with Apple Silicon (M4 Max or M3 Ultra)
  • macOS 15.0 (Sequoia) or later
  • Python 3.10+
  • Sufficient unified memory — see Model Selection Guide below

Model Selection Guide

All MLX model variants are available from the mlx-community on Hugging Face. Higher bit quantization preserves more model quality but requires more memory.

Available Model Variants

Model VariantQuantizationModel SizeMin Memory
MiniMax-M2.7BF16 (full precision)457 GB512 GB
MiniMax-M2.7-8bit-gs328-bit (group size 32)257 GB288 GB
MiniMax-M2.7-8bit8-bit243 GB256 GB
MiniMax-M2.7-6bit6-bit186 GB200 GB
MiniMax-M2.7-5bit5-bit157 GB172 GB
MiniMax-M2.7-4bit4-bit129 GB140 GB
MiniMax-M2.7-nvfp4NVFP4129 GB140 GB
MiniMax-M2.7-4bit-mxfp44-bit MXFP4122 GB136 GB
MiniMax-M2.7-3bit3-bit100 GB112 GB
“Min Memory” = model size + ~10–15 GB headroom for KV cache, activations, and OS overhead. Actual memory usage grows with context length.
The smallest MLX variant (3-bit) requires ~112 GB of memory. Mac Studio configurations with less than 128 GB unified memory (M4 Max 36/48/64 GB, M3 Ultra 96 GB) cannot run MiniMax-M2.7 via MLX, but may still work with inference frameworks that support more aggressive quantization such as llama.cpp.
Mac Studio ConfigurationRecommended VariantNotes
M4 Max — 128 GB3-bit (100 GB) or 4-bit-mxfp4 (122 GB)3-bit fits comfortably; 4-bit-mxfp4 is tight with limited context
M3 Ultra — 256 GB6-bit (186 GB)6-bit recommended; 8-bit (243 GB) leaves only 13 GB headroom, risking swap or jetsam kills
M3 Ultra — 512 GB8-bit-gs32 (257 GB)8-bit-gs32 recommended for near-lossless quality with ample headroom; BF16 (457 GB) leaves only ~55 GB, limiting context length

Quick Start

Installation

pip install -U mlx-lm

Usage

Generate text directly from the terminal or via a Python script:
mlx_lm.generate \
  --model mlx-community/MiniMax-M2.7-6bit \
  --prompt "Explain the concept of Mixture of Experts in neural networks." \
  --max-tokens 8192 \
  --temp 1.0

API Server Deployment

For serving MiniMax-M2.7 as a local API (compatible with OpenAI SDK), you can use mlx_lm.server or third-party tools like LM Studio.

mlx_lm.server

Launch an OpenAI-compatible API server:
mlx_lm.server \
  --model mlx-community/MiniMax-M2.7-6bit \
  --port 8080
mlx_lm.server automatically separates the model’s thinking process (<think> tag content) into a reasoning field, while content contains only the final response:
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "\n\nHello! Nice to meet you!",
      "reasoning": "The user greeted me, I should respond in a friendly manner."
    }
  }]
}
Once the server is running, call it via curl or the OpenAI SDK:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/MiniMax-M2.7-6bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0,
    "top_p": 0.95,
    "max_tokens": 8192
  }'

LM Studio

LM Studio provides a desktop GUI for Mac with built-in MLX support:
  1. Download and install LM Studio
  2. Search for MiniMax-M2.7 in the model browser
  3. Select a quantization variant that fits your memory
  4. Click Load and start chatting
LM Studio also exposes an OpenAI-compatible local server for integration with other tools.

Fine-Tuning (LoRA)

mlx-lm supports parameter-efficient fine-tuning (PEFT) using LoRA / QLoRA. For an MoE model like MiniMax-M2.7, use a quantized base model + QLoRA to keep memory requirements manageable.
Fine-tuning MiniMax-M2.7 on Mac Studio is extremely memory-intensive. Even with QLoRA, you need at least 192 GB unified memory (M3 Ultra) and very conservative training settings. Configurations with 128 GB or less are not recommended for fine-tuning.

Install Training Dependencies

pip install "mlx-lm[train]"

Prepare Training Data

Training data uses JSONL format. Create train.jsonl, valid.jsonl, and (optionally) test.jsonl files in your data directory. Chat format (recommended):
{"messages": [{"role": "user", "content": "Your question"}, {"role": "assistant", "content": "Expected answer"}]}
Completions format:
{"prompt": "Input text", "completion": "Expected output"}

Start Fine-Tuning

Run QLoRA fine-tuning with a quantized model and memory optimization flags:
mlx_lm.lora \
  --model mlx-community/MiniMax-M2.7-4bit \
  --train \
  --data ./your-data-dir \
  --batch-size 1 \
  --num-layers 4 \
  --grad-checkpoint \
  --iters 600 \
  --adapter-path ./adapters
Key parameter guidelines:
ParameterRecommendedNotes
--batch-size1Must use minimum batch size for large models
--num-layers4Fine-tune only a few layers to save memory
--grad-checkpointTrades compute for memory, reduces peak usage
--iters600Adjust based on your dataset size

Use the Fine-Tuned Model

Generate with adapters:
mlx_lm.generate \
  --model mlx-community/MiniMax-M2.7-4bit \
  --adapter-path ./adapters \
  --prompt "Your prompt here" \
  --max-tokens 2048
Fuse adapters into the model (optional):
mlx_lm.fuse \
  --model mlx-community/MiniMax-M2.7-4bit \
  --adapter-path ./adapters
After fusing, you can load the merged model directly without specifying an adapter path.
The following parameters are recommended for MiniMax-M2.7 and apply to all inference frameworks:
ParameterRecommended Value
temperature1.0
top_p0.95
top_k40

MiniMax-M2.7 on Hugging Face

Official model weights, documentation, and benchmarks

MiniMax-M2.7 MLX Models

All quantization variants on MLX Community

vLLM Documentation

Official vLLM documentation and deployment guides

SGLang Documentation

Official SGLang documentation and deployment guides