> ## Documentation Index
> Fetch the complete documentation index at: https://platform.minimax.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Local Deployment Guide

> Deploy MiniMax-M2.7 locally using vLLM, SGLang (Linux GPU), or MLX (Mac Studio), with hardware-specific configuration guides.

<Note>
  This guide covers local deployment of MiniMax-M2.7 using three inference frameworks: [vLLM](https://docs.vllm.ai/en/stable/) and [SGLang](https://github.com/sgl-project/sglang) for Linux GPU servers, and [MLX](https://github.com/ml-explore/mlx) for Apple Silicon Mac Studio. For the MiniMax API service, see [LLM](/guides/text-generation).
</Note>

***

## GPU Server Deployment (vLLM / SGLang)

### System Requirements

* **OS**: Linux
* **Python**: 3.10 – 3.12
* **GPU**: Compute capability 7.0 or higher

#### Recommended Configurations

| Configuration      | Total KV Cache Capacity |
| :----------------- | :---------------------- |
| **96 GB x 4** GPU  | Up to 400K tokens       |
| **144 GB x 8** GPU | Up to 3M tokens         |

<Tip>
  The values above represent the total aggregate hardware KV Cache capacity. The maximum context length per individual sequence remains **196K** tokens.
</Tip>

### Deploy with vLLM

vLLM is a high-performance LLM inference and serving library that uses PagedAttention for efficient KV cache management, combined with continuous batching and prefix caching for state-of-the-art serving throughput.

#### Installation

```bash theme={null}
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```

#### Start the Server

vLLM will automatically download and cache the model from Hugging Face.

<CodeGroup>
  ```bash 4-GPU theme={null}
  SAFETENSORS_FAST_GPU=1 vllm serve \
      MiniMaxAI/MiniMax-M2.7 --trust-remote-code \
      --tensor-parallel-size 4 \
      --enable-auto-tool-choice --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think
  ```

  ```bash 8-GPU theme={null}
  SAFETENSORS_FAST_GPU=1 vllm serve \
      MiniMaxAI/MiniMax-M2.7 --trust-remote-code \
      --enable_expert_parallel --tensor-parallel-size 8 \
      --enable-auto-tool-choice --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think
  ```
</CodeGroup>

### Deploy with SGLang

SGLang is a high-performance serving framework for large language and multimodal models. It leverages RadixAttention for efficient prefix caching and scheduling, delivering low-latency, high-throughput inference.

#### Installation

```bash theme={null}
uv venv
source .venv/bin/activate
uv pip install sglang
```

#### Start the Server

SGLang will automatically download and cache the model from Hugging Face.

<CodeGroup>
  ```bash 4-GPU theme={null}
  python -m sglang.launch_server \
      --model-path MiniMaxAI/MiniMax-M2.7 \
      --tp-size 4 \
      --tool-call-parser minimax-m2 \
      --reasoning-parser minimax-append-think \
      --host 0.0.0.0 \
      --trust-remote-code \
      --port 8000 \
      --mem-fraction-static 0.85
  ```

  ```bash 8-GPU theme={null}
  python -m sglang.launch_server \
      --model-path MiniMaxAI/MiniMax-M2.7 \
      --tp-size 8 \
      --ep-size 8 \
      --tool-call-parser minimax-m2 \
      --reasoning-parser minimax-append-think \
      --host 0.0.0.0 \
      --trust-remote-code \
      --port 8000 \
      --mem-fraction-static 0.85
  ```
</CodeGroup>

### Verify Deployment

Both vLLM and SGLang expose an OpenAI-compatible API. After startup, call it via curl or the OpenAI SDK:

<CodeGroup>
  ```bash curl theme={null}
  curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
          "model": "MiniMaxAI/MiniMax-M2.7",
          "messages": [
              {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
              {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
          ]
      }'
  ```

  ```python Python theme={null}
  from openai import OpenAI

  client = OpenAI(
      base_url="http://localhost:8000/v1",
      api_key="not-needed",
  )

  response = client.chat.completions.create(
      model="MiniMaxAI/MiniMax-M2.7",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Who won the world series in 2020?"},
      ],
      temperature=1.0,
      top_p=0.95,
      max_tokens=8192,
  )

  print(response.choices[0].message.content.strip())
  ```
</CodeGroup>

***

## Mac Studio Deployment (MLX)

Thanks to Apple Silicon's unified memory architecture, Mac Studio can load the full model into memory and run inference entirely on-device — no GPU clusters required, with complete data privacy.

### System Requirements

* **Mac Studio** with Apple Silicon (M4 Max or M3 Ultra)
* **macOS 15.0** (Sequoia) or later
* **Python 3.10+**
* **Sufficient unified memory** — see [Model Selection Guide](#model-selection-guide) below

### Model Selection Guide

All MLX model variants are available from the [mlx-community](https://huggingface.co/mlx-community/models?search=minimax-m2.7) on Hugging Face. Higher bit quantization preserves more model quality but requires more memory.

#### Available Model Variants

| Model Variant                                                                           | Quantization          | Model Size | Min Memory |
| :-------------------------------------------------------------------------------------- | :-------------------- | :--------- | :--------- |
| [MiniMax-M2.7](https://huggingface.co/mlx-community/MiniMax-M2.7)                       | BF16 (full precision) | 457 GB     | 512 GB     |
| [MiniMax-M2.7-8bit-gs32](https://huggingface.co/mlx-community/MiniMax-M2.7-8bit-gs32)   | 8-bit (group size 32) | 257 GB     | 288 GB     |
| [MiniMax-M2.7-8bit](https://huggingface.co/mlx-community/MiniMax-M2.7-8bit)             | 8-bit                 | 243 GB     | 256 GB     |
| [MiniMax-M2.7-6bit](https://huggingface.co/mlx-community/MiniMax-M2.7-6bit)             | 6-bit                 | 186 GB     | 200 GB     |
| [MiniMax-M2.7-5bit](https://huggingface.co/mlx-community/MiniMax-M2.7-5bit)             | 5-bit                 | 157 GB     | 172 GB     |
| [MiniMax-M2.7-4bit](https://huggingface.co/mlx-community/MiniMax-M2.7-4bit)             | 4-bit                 | 129 GB     | 140 GB     |
| [MiniMax-M2.7-nvfp4](https://huggingface.co/mlx-community/MiniMax-M2.7-nvfp4)           | NVFP4                 | 129 GB     | 140 GB     |
| [MiniMax-M2.7-4bit-mxfp4](https://huggingface.co/mlx-community/MiniMax-M2.7-4bit-mxfp4) | 4-bit MXFP4           | 122 GB     | 136 GB     |
| [MiniMax-M2.7-3bit](https://huggingface.co/mlx-community/MiniMax-M2.7-3bit)             | 3-bit                 | 100 GB     | 112 GB     |

<Tip>
  "Min Memory" = model size + \~10–15 GB headroom for KV cache, activations, and OS overhead. Actual memory usage grows with context length.
</Tip>

#### Recommended Configurations

The smallest MLX variant (3-bit) requires \~112 GB of memory. Mac Studio configurations with less than 128 GB unified memory (M4 Max 36/48/64 GB, M3 Ultra 96 GB) cannot run MiniMax-M2.7 via MLX, but may still work with inference frameworks that support more aggressive quantization such as [llama.cpp](https://github.com/ggerganov/llama.cpp).

| Mac Studio Configuration | Recommended Variant                    | Notes                                                                                                                                |
| :----------------------- | :------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------- |
| M4 Max — 128 GB          | 3-bit (100 GB) or 4-bit-mxfp4 (122 GB) | 3-bit fits comfortably; 4-bit-mxfp4 is tight with limited context                                                                    |
| M3 Ultra — 256 GB        | 6-bit (186 GB)                         | **6-bit recommended**; 8-bit (243 GB) leaves only 13 GB headroom, risking swap or jetsam kills                                       |
| M3 Ultra — 512 GB        | 8-bit-gs32 (257 GB)                    | **8-bit-gs32 recommended** for near-lossless quality with ample headroom; BF16 (457 GB) leaves only \~55 GB, limiting context length |

### Quick Start

#### Installation

```bash theme={null}
pip install -U mlx-lm
```

#### Usage

Generate text directly from the terminal or via a Python script:

<CodeGroup>
  ```bash CLI theme={null}
  mlx_lm.generate \
    --model mlx-community/MiniMax-M2.7-6bit \
    --prompt "Explain the concept of Mixture of Experts in neural networks." \
    --max-tokens 8192 \
    --temp 1.0
  ```

  ```python Python theme={null}
  from mlx_lm import load, generate
  from mlx_lm.sample_utils import make_sampler

  # Choose the variant that fits your memory — see Model Selection Guide above
  model, tokenizer = load("mlx-community/MiniMax-M2.7-6bit")

  sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

  messages = [{"role": "user", "content": "Write a Python function to find prime numbers."}]
  prompt = tokenizer.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=True,
  )

  response = generate(
      model,
      tokenizer,
      prompt=prompt,
      max_tokens=8192,
      sampler=sampler,
      verbose=True,
  )
  print(response)
  ```
</CodeGroup>

### API Server Deployment

For serving MiniMax-M2.7 as a local API (compatible with OpenAI SDK), you can use **mlx\_lm.server** or third-party tools like **LM Studio**.

#### mlx\_lm.server

Launch an OpenAI-compatible API server:

```bash theme={null}
mlx_lm.server \
  --model mlx-community/MiniMax-M2.7-6bit \
  --port 8080
```

`mlx_lm.server` automatically separates the model's thinking process (`<think>` tag content) into a `reasoning` field, while `content` contains only the final response:

```json theme={null}
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "\n\nHello! Nice to meet you!",
      "reasoning": "The user greeted me, I should respond in a friendly manner."
    }
  }]
}
```

Once the server is running, call it via curl or the OpenAI SDK:

<CodeGroup>
  ```bash curl theme={null}
  curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mlx-community/MiniMax-M2.7-6bit",
      "messages": [{"role": "user", "content": "Hello!"}],
      "temperature": 1.0,
      "top_p": 0.95,
      "max_tokens": 8192
    }'
  ```

  ```python Python theme={null}
  from openai import OpenAI

  client = OpenAI(
      base_url="http://localhost:8080/v1",
      api_key="not-needed",
  )

  response = client.chat.completions.create(
      model="mlx-community/MiniMax-M2.7-6bit",
      messages=[{"role": "user", "content": "Hello!"}],
      temperature=1.0,
      top_p=0.95,
      max_tokens=8192,
  )

  # content is auto-separated; strip() removes leading newlines
  print(response.choices[0].message.content.strip())
  ```
</CodeGroup>

#### LM Studio

[LM Studio](https://lmstudio.ai/) provides a desktop GUI for Mac with built-in MLX support:

1. Download and install [LM Studio](https://lmstudio.ai/)
2. Search for `MiniMax-M2.7` in the model browser
3. Select a quantization variant that fits your memory
4. Click **Load** and start chatting

LM Studio also exposes an OpenAI-compatible local server for integration with other tools.

### Fine-Tuning (LoRA)

`mlx-lm` supports parameter-efficient fine-tuning (PEFT) using LoRA / QLoRA. For an MoE model like MiniMax-M2.7, use a **quantized base model + QLoRA** to keep memory requirements manageable.

<Warning>
  Fine-tuning MiniMax-M2.7 on Mac Studio is extremely memory-intensive. Even with QLoRA, you need at least **192 GB unified memory** (M3 Ultra) and very conservative training settings. Configurations with 128 GB or less are not recommended for fine-tuning.
</Warning>

#### Install Training Dependencies

```bash theme={null}
pip install "mlx-lm[train]"
```

#### Prepare Training Data

Training data uses JSONL format. Create `train.jsonl`, `valid.jsonl`, and (optionally) `test.jsonl` files in your data directory.

**Chat format (recommended):**

```json theme={null}
{"messages": [{"role": "user", "content": "Your question"}, {"role": "assistant", "content": "Expected answer"}]}
```

**Completions format:**

```json theme={null}
{"prompt": "Input text", "completion": "Expected output"}
```

#### Start Fine-Tuning

Run QLoRA fine-tuning with a quantized model and memory optimization flags:

```bash theme={null}
mlx_lm.lora \
  --model mlx-community/MiniMax-M2.7-4bit \
  --train \
  --data ./your-data-dir \
  --batch-size 1 \
  --num-layers 4 \
  --grad-checkpoint \
  --iters 600 \
  --adapter-path ./adapters
```

Key parameter guidelines:

| Parameter           | Recommended | Notes                                         |
| :------------------ | :---------- | :-------------------------------------------- |
| `--batch-size`      | `1`         | Must use minimum batch size for large models  |
| `--num-layers`      | `4`         | Fine-tune only a few layers to save memory    |
| `--grad-checkpoint` | —           | Trades compute for memory, reduces peak usage |
| `--iters`           | `600`       | Adjust based on your dataset size             |

#### Use the Fine-Tuned Model

**Generate with adapters:**

```bash theme={null}
mlx_lm.generate \
  --model mlx-community/MiniMax-M2.7-4bit \
  --adapter-path ./adapters \
  --prompt "Your prompt here" \
  --max-tokens 2048
```

**Fuse adapters into the model (optional):**

```bash theme={null}
mlx_lm.fuse \
  --model mlx-community/MiniMax-M2.7-4bit \
  --adapter-path ./adapters
```

After fusing, you can load the merged model directly without specifying an adapter path.

***

## Recommended Parameters

The following parameters are recommended for MiniMax-M2.7 and apply to all inference frameworks:

| Parameter     | Recommended Value |
| :------------ | :---------------- |
| `temperature` | `1.0`             |
| `top_p`       | `0.95`            |
| `top_k`       | `40`              |

***

## Related Links

<CardGroup cols={2}>
  <Card title="MiniMax-M2.7 on Hugging Face" icon="file-text" href="https://huggingface.co/MiniMaxAI/MiniMax-M2.7">
    Official model weights, documentation, and benchmarks
  </Card>

  <Card title="MiniMax-M2.7 MLX Models" icon="box" href="https://huggingface.co/mlx-community/models?search=minimax-m2.7">
    All quantization variants on MLX Community
  </Card>

  <Card title="vLLM Documentation" icon="book-open" href="https://docs.vllm.ai/en/stable/">
    Official vLLM documentation and deployment guides
  </Card>

  <Card title="SGLang Documentation" icon="book-open" href="https://docs.sglang.io/">
    Official SGLang documentation and deployment guides
  </Card>
</CardGroup>
