Local Deployment Guide

This guide covers local deployment of MiniMax-M2.7 using three inference frameworks: vLLM and SGLang for Linux GPU servers, and MLX for Apple Silicon Mac Studio. For the MiniMax API service, see Text Generation.

GPU Server Deployment (vLLM / SGLang)

System Requirements

OS: Linux
Python: 3.10 – 3.12
GPU: Compute capability 7.0 or higher

Recommended Configurations

Configuration	Total KV Cache Capacity
96 GB x 4 GPU	Up to 400K tokens
144 GB x 8 GPU	Up to 3M tokens

The values above represent the total aggregate hardware KV Cache capacity. The maximum context length per individual sequence remains 196K tokens.

Deploy with vLLM

vLLM is a high-performance LLM inference and serving library that uses PagedAttention for efficient KV cache management, combined with continuous batching and prefix caching for state-of-the-art serving throughput.

Installation

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

Start the Server

vLLM will automatically download and cache the model from Hugging Face.

SAFETENSORS_FAST_GPU=1 vllm serve \
    MiniMaxAI/MiniMax-M2.7 --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

Deploy with SGLang

SGLang is a high-performance serving framework for large language and multimodal models. It leverages RadixAttention for efficient prefix caching and scheduling, delivering low-latency, high-throughput inference.

Installation

uv venv
source .venv/bin/activate
uv pip install sglang

Start the Server

SGLang will automatically download and cache the model from Hugging Face.

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.7 \
    --tp-size 4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --trust-remote-code \
    --port 8000 \
    --mem-fraction-static 0.85

Verify Deployment

Both vLLM and SGLang expose an OpenAI-compatible API. After startup, call it via curl or the OpenAI SDK:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M2.7",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Mac Studio Deployment (MLX)

Thanks to Apple Silicon’s unified memory architecture, Mac Studio can load the full model into memory and run inference entirely on-device — no GPU clusters required, with complete data privacy.

System Requirements

Mac Studio with Apple Silicon (M4 Max or M3 Ultra)
macOS 15.0 (Sequoia) or later
Python 3.10+
Sufficient unified memory — see Model Selection Guide below

Model Selection Guide

All MLX model variants are available from the mlx-community on Hugging Face. Higher bit quantization preserves more model quality but requires more memory.

Available Model Variants

Model Variant	Quantization	Model Size	Min Memory
MiniMax-M2.7	BF16 (full precision)	457 GB	512 GB
MiniMax-M2.7-8bit-gs32	8-bit (group size 32)	257 GB	288 GB
MiniMax-M2.7-8bit	8-bit	243 GB	256 GB
MiniMax-M2.7-6bit	6-bit	186 GB	200 GB
MiniMax-M2.7-5bit	5-bit	157 GB	172 GB
MiniMax-M2.7-4bit	4-bit	129 GB	140 GB
MiniMax-M2.7-nvfp4	NVFP4	129 GB	140 GB
MiniMax-M2.7-4bit-mxfp4	4-bit MXFP4	122 GB	136 GB
MiniMax-M2.7-3bit	3-bit	100 GB	112 GB

“Min Memory” = model size + ~10–15 GB headroom for KV cache, activations, and OS overhead. Actual memory usage grows with context length.

Recommended Configurations

The smallest MLX variant (3-bit) requires ~112 GB of memory. Mac Studio configurations with less than 128 GB unified memory (M4 Max 36/48/64 GB, M3 Ultra 96 GB) cannot run MiniMax-M2.7 via MLX, but may still work with inference frameworks that support more aggressive quantization such as llama.cpp.

Mac Studio Configuration	Recommended Variant	Notes
M4 Max — 128 GB	3-bit (100 GB) or 4-bit-mxfp4 (122 GB)	3-bit fits comfortably; 4-bit-mxfp4 is tight with limited context
M3 Ultra — 256 GB	6-bit (186 GB)	6-bit recommended; 8-bit (243 GB) leaves only 13 GB headroom, risking swap or jetsam kills
M3 Ultra — 512 GB	8-bit-gs32 (257 GB)	8-bit-gs32 recommended for near-lossless quality with ample headroom; BF16 (457 GB) leaves only ~55 GB, limiting context length

Quick Start

Installation

pip install -U mlx-lm

Usage

Generate text directly from the terminal or via a Python script:

mlx_lm.generate \
  --model mlx-community/MiniMax-M2.7-6bit \
  --prompt "Explain the concept of Mixture of Experts in neural networks." \
  --max-tokens 8192 \
  --temp 1.0

API Server Deployment

For serving MiniMax-M2.7 as a local API (compatible with OpenAI SDK), you can use mlx_lm.server or third-party tools like LM Studio.

mlx_lm.server

Launch an OpenAI-compatible API server:

mlx_lm.server \
  --model mlx-community/MiniMax-M2.7-6bit \
  --port 8080

mlx_lm.server automatically separates the model’s thinking process (<think> tag content) into a reasoning field, while content contains only the final response:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "\n\nHello! Nice to meet you!",
      "reasoning": "The user greeted me, I should respond in a friendly manner."
    }
  }]
}

Once the server is running, call it via curl or the OpenAI SDK:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/MiniMax-M2.7-6bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0,
    "top_p": 0.95,
    "max_tokens": 8192
  }'

LM Studio

LM Studio provides a desktop GUI for Mac with built-in MLX support:

Download and install LM Studio
Search for MiniMax-M2.7 in the model browser
Select a quantization variant that fits your memory
Click Load and start chatting

LM Studio also exposes an OpenAI-compatible local server for integration with other tools.

Fine-Tuning (LoRA)

mlx-lm supports parameter-efficient fine-tuning (PEFT) using LoRA / QLoRA. For an MoE model like MiniMax-M2.7, use a quantized base model + QLoRA to keep memory requirements manageable.

Fine-tuning MiniMax-M2.7 on Mac Studio is extremely memory-intensive. Even with QLoRA, you need at least 192 GB unified memory (M3 Ultra) and very conservative training settings. Configurations with 128 GB or less are not recommended for fine-tuning.

Install Training Dependencies

pip install "mlx-lm[train]"

Prepare Training Data

Training data uses JSONL format. Create train.jsonl, valid.jsonl, and (optionally) test.jsonl files in your data directory. Chat format (recommended):

{"messages": [{"role": "user", "content": "Your question"}, {"role": "assistant", "content": "Expected answer"}]}

Completions format:

{"prompt": "Input text", "completion": "Expected output"}

Start Fine-Tuning

Run QLoRA fine-tuning with a quantized model and memory optimization flags:

mlx_lm.lora \
  --model mlx-community/MiniMax-M2.7-4bit \
  --train \
  --data ./your-data-dir \
  --batch-size 1 \
  --num-layers 4 \
  --grad-checkpoint \
  --iters 600 \
  --adapter-path ./adapters

Key parameter guidelines:

Parameter	Recommended	Notes
`--batch-size`	`1`	Must use minimum batch size for large models
`--num-layers`	`4`	Fine-tune only a few layers to save memory
`--grad-checkpoint`	—	Trades compute for memory, reduces peak usage
`--iters`	`600`	Adjust based on your dataset size

Use the Fine-Tuned Model

Generate with adapters:

mlx_lm.generate \
  --model mlx-community/MiniMax-M2.7-4bit \
  --adapter-path ./adapters \
  --prompt "Your prompt here" \
  --max-tokens 2048

Fuse adapters into the model (optional):

mlx_lm.fuse \
  --model mlx-community/MiniMax-M2.7-4bit \
  --adapter-path ./adapters

After fusing, you can load the merged model directly without specifying an adapter path.

Recommended Parameters

The following parameters are recommended for MiniMax-M2.7 and apply to all inference frameworks:

Parameter	Recommended Value
`temperature`	`1.0`
`top_p`	`0.95`
`top_k`	`40`

MiniMax-M2.7 on Hugging Face

Official model weights, documentation, and benchmarks

MiniMax-M2.7 MLX Models

All quantization variants on MLX Community

vLLM Documentation

Official vLLM documentation and deployment guides

SGLang Documentation

Official SGLang documentation and deployment guides

Get started

Use Guides

FAQs

Terms & Policy

GPU Server Deployment (vLLM / SGLang)

System Requirements

Recommended Configurations

Deploy with vLLM

Installation

Start the Server

Deploy with SGLang

Installation

Start the Server

Verify Deployment

Mac Studio Deployment (MLX)

System Requirements

Model Selection Guide

Available Model Variants

Recommended Configurations

Quick Start

Installation

Usage

API Server Deployment

mlx_lm.server

LM Studio

Fine-Tuning (LoRA)

Install Training Dependencies

Prepare Training Data

Start Fine-Tuning

Use the Fine-Tuned Model

Recommended Parameters

MiniMax-M2.7 on Hugging Face

MiniMax-M2.7 MLX Models

vLLM Documentation

SGLang Documentation

Get started

Use Guides

FAQs

Terms & Policy

Documentation Index

​GPU Server Deployment (vLLM / SGLang)

​System Requirements

​Recommended Configurations

​Deploy with vLLM

​Installation

​Start the Server

​Deploy with SGLang

​Installation

​Start the Server

​Verify Deployment

​Mac Studio Deployment (MLX)

​System Requirements

​Model Selection Guide

​Available Model Variants

​Recommended Configurations

​Quick Start

​Installation

​Usage

​API Server Deployment

​mlx_lm.server

​LM Studio

​Fine-Tuning (LoRA)

​Install Training Dependencies

​Prepare Training Data

​Start Fine-Tuning

​Use the Fine-Tuned Model

​Recommended Parameters

​Related Links

MiniMax-M2.7 on Hugging Face

MiniMax-M2.7 MLX Models

vLLM Documentation

SGLang Documentation

GPU Server Deployment (vLLM / SGLang)

System Requirements

Recommended Configurations

Deploy with vLLM

Installation

Start the Server

Deploy with SGLang

Installation

Start the Server

Verify Deployment

Mac Studio Deployment (MLX)

System Requirements

Model Selection Guide

Available Model Variants

Recommended Configurations

Quick Start

Installation

Usage

API Server Deployment

mlx_lm.server

LM Studio

Fine-Tuning (LoRA)

Install Training Dependencies

Prepare Training Data

Start Fine-Tuning

Use the Fine-Tuned Model

Recommended Parameters

Related Links