Documentation Index
Fetch the complete documentation index at: https://platform.minimax.io/docs/llms.txt
Use this file to discover all available pages before exploring further.
This guide covers local deployment of MiniMax-M2.7 using three inference frameworks: vLLM and SGLang for Linux GPU servers, and MLX for Apple Silicon Mac Studio. For the MiniMax API service, see Text Generation.
GPU Server Deployment (vLLM / SGLang)
System Requirements
- OS: Linux
- Python: 3.10 – 3.12
- GPU: Compute capability 7.0 or higher
Recommended Configurations
| Configuration | Total KV Cache Capacity |
|---|---|
| 96 GB x 4 GPU | Up to 400K tokens |
| 144 GB x 8 GPU | Up to 3M tokens |
Deploy with vLLM
vLLM is a high-performance LLM inference and serving library that uses PagedAttention for efficient KV cache management, combined with continuous batching and prefix caching for state-of-the-art serving throughput.Installation
Start the Server
vLLM will automatically download and cache the model from Hugging Face.Deploy with SGLang
SGLang is a high-performance serving framework for large language and multimodal models. It leverages RadixAttention for efficient prefix caching and scheduling, delivering low-latency, high-throughput inference.Installation
Start the Server
SGLang will automatically download and cache the model from Hugging Face.Verify Deployment
Both vLLM and SGLang expose an OpenAI-compatible API. After startup, call it via curl or the OpenAI SDK:Mac Studio Deployment (MLX)
Thanks to Apple Silicon’s unified memory architecture, Mac Studio can load the full model into memory and run inference entirely on-device — no GPU clusters required, with complete data privacy.System Requirements
- Mac Studio with Apple Silicon (M4 Max or M3 Ultra)
- macOS 15.0 (Sequoia) or later
- Python 3.10+
- Sufficient unified memory — see Model Selection Guide below
Model Selection Guide
All MLX model variants are available from the mlx-community on Hugging Face. Higher bit quantization preserves more model quality but requires more memory.Available Model Variants
| Model Variant | Quantization | Model Size | Min Memory |
|---|---|---|---|
| MiniMax-M2.7 | BF16 (full precision) | 457 GB | 512 GB |
| MiniMax-M2.7-8bit-gs32 | 8-bit (group size 32) | 257 GB | 288 GB |
| MiniMax-M2.7-8bit | 8-bit | 243 GB | 256 GB |
| MiniMax-M2.7-6bit | 6-bit | 186 GB | 200 GB |
| MiniMax-M2.7-5bit | 5-bit | 157 GB | 172 GB |
| MiniMax-M2.7-4bit | 4-bit | 129 GB | 140 GB |
| MiniMax-M2.7-nvfp4 | NVFP4 | 129 GB | 140 GB |
| MiniMax-M2.7-4bit-mxfp4 | 4-bit MXFP4 | 122 GB | 136 GB |
| MiniMax-M2.7-3bit | 3-bit | 100 GB | 112 GB |
Recommended Configurations
The smallest MLX variant (3-bit) requires ~112 GB of memory. Mac Studio configurations with less than 128 GB unified memory (M4 Max 36/48/64 GB, M3 Ultra 96 GB) cannot run MiniMax-M2.7 via MLX, but may still work with inference frameworks that support more aggressive quantization such as llama.cpp.| Mac Studio Configuration | Recommended Variant | Notes |
|---|---|---|
| M4 Max — 128 GB | 3-bit (100 GB) or 4-bit-mxfp4 (122 GB) | 3-bit fits comfortably; 4-bit-mxfp4 is tight with limited context |
| M3 Ultra — 256 GB | 6-bit (186 GB) | 6-bit recommended; 8-bit (243 GB) leaves only 13 GB headroom, risking swap or jetsam kills |
| M3 Ultra — 512 GB | 8-bit-gs32 (257 GB) | 8-bit-gs32 recommended for near-lossless quality with ample headroom; BF16 (457 GB) leaves only ~55 GB, limiting context length |
Quick Start
Installation
Usage
Generate text directly from the terminal or via a Python script:API Server Deployment
For serving MiniMax-M2.7 as a local API (compatible with OpenAI SDK), you can use mlx_lm.server or third-party tools like LM Studio.mlx_lm.server
Launch an OpenAI-compatible API server:mlx_lm.server automatically separates the model’s thinking process (<think> tag content) into a reasoning field, while content contains only the final response:
LM Studio
LM Studio provides a desktop GUI for Mac with built-in MLX support:- Download and install LM Studio
- Search for
MiniMax-M2.7in the model browser - Select a quantization variant that fits your memory
- Click Load and start chatting
Fine-Tuning (LoRA)
mlx-lm supports parameter-efficient fine-tuning (PEFT) using LoRA / QLoRA. For an MoE model like MiniMax-M2.7, use a quantized base model + QLoRA to keep memory requirements manageable.
Install Training Dependencies
Prepare Training Data
Training data uses JSONL format. Createtrain.jsonl, valid.jsonl, and (optionally) test.jsonl files in your data directory.
Chat format (recommended):
Start Fine-Tuning
Run QLoRA fine-tuning with a quantized model and memory optimization flags:| Parameter | Recommended | Notes |
|---|---|---|
--batch-size | 1 | Must use minimum batch size for large models |
--num-layers | 4 | Fine-tune only a few layers to save memory |
--grad-checkpoint | — | Trades compute for memory, reduces peak usage |
--iters | 600 | Adjust based on your dataset size |
Use the Fine-Tuned Model
Generate with adapters:Recommended Parameters
The following parameters are recommended for MiniMax-M2.7 and apply to all inference frameworks:| Parameter | Recommended Value |
|---|---|
temperature | 1.0 |
top_p | 0.95 |
top_k | 40 |
Related Links
MiniMax-M2.7 on Hugging Face
Official model weights, documentation, and benchmarks
MiniMax-M2.7 MLX Models
All quantization variants on MLX Community
vLLM Documentation
Official vLLM documentation and deployment guides
SGLang Documentation
Official SGLang documentation and deployment guides