> ## Documentation Index
> Fetch the complete documentation index at: https://platform.minimax.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Control Robot Arm with Conversation: Make Robots Understand You

> <Note> This tutorial will guide you to use MiniMax LLM & MCP visual understanding to build an intelligent robot that can understand natural language instructions and perform complex robotic arm manipulation tasks. </Note>

<div style={{display: 'flex', alignItems: 'center', gap: '12px', marginBottom: '16px'}}>
  <div style={{width: '40px', height: '40px', borderRadius: '50%', background: 'linear-gradient(135deg, #3b82f6, #8b5cf6)', display: 'flex', alignItems: 'center', justifyContent: 'center', color: 'white', fontWeight: 'bold'}}>D</div>

  <div>
    <div style={{fontWeight: 500}}>Devin</div>
    <div style={{fontSize: '0.875rem', color: '#9ca3af'}}>January 26, 2026</div>
  </div>
</div>

<a href="https://github.com/MiniMax-OpenPlatform/MiniMax-Agent-VLA-Demo" target="_blank" style={{display: 'inline-flex', alignItems: 'center', gap: '8px', padding: '8px 16px', borderRadius: '8px', border: '1px solid #e5e7eb', textDecoration: 'none', color: 'inherit', marginBottom: '32px'}}>
  <svg height="20" width="20" viewBox="0 0 16 16" fill="currentColor">
    <path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8z" />
  </svg>

  Download from GitHub
</a>

# Project Overview

**Robot Agent** is built on **MiniMax M2.1** LLM and **Pi05 VLA** (Vision-Language-Action) model, enabling natural language-driven robotic arm manipulation in the LIBERO simulation environment.

<img src="https://filecdn.minimax.chat/public/8eaae90f-5762-424a-9b32-732a58678d77.webp" alt="Demo" style={{borderRadius: '8px', marginBottom: '16px', maxWidth: '100%'}} />

<AccordionGroup>
  <Accordion title=" MiniMax M2.1 Task Planning">
    Understands user's natural language instructions, decomposes complex tasks into executable steps, and coordinates multi-step task execution.
  </Accordion>

  <Accordion title=" MiniMax MCP Visual Understanding">
    Uses MCP to invoke visual understanding capabilities, analyze scene images, verify task execution results, and enable closed-loop feedback control.
  </Accordion>

  <Accordion title=" Pi05 VLA Action Execution">
    A vision-language-action model based on PaliGemma that generates precise robotic arm control actions based on scene images and task instructions.
  </Accordion>

  <Accordion title=" LIBERO Simulation Environment">
    Executes various robot manipulation tasks in a simulation environment powered by the MuJoCo physics engine.
  </Accordion>
</AccordionGroup>

***

## System Architecture

```
User Command → MiniMax LLM (Task Planning) → Pi05 VLA (Action Execution) → LIBERO Sim
                    ↑                                                          ↓
             MCP Visual Understanding ← ────────── Scene Image ←───────────────┘
```

| Module               | Technology      | Description                             |
| -------------------- | --------------- | --------------------------------------- |
| Task Planning        | MiniMax M2.1    | Understand user intent, decompose tasks |
| Visual Understanding | MiniMax MCP     | Scene analysis, result verification     |
| Action Execution     | Pi05 VLA        | Vision-Language-Action model            |
| Simulation           | LIBERO / MuJoCo | Robot manipulation simulation           |

***

## Quick Start

<Steps>
  <Step title="Clone Repository">
    ```bash theme={null}
    git clone https://github.com/MiniMax-OpenPlatform/MiniMax-Agent-VLA-Demo.git
    cd MiniMax-Agent-VLA-Demo
    ```
  </Step>

  <Step title="Configure API Keys">
    ```bash theme={null}
    # MiniMax API Key (for LLM and MCP visual understanding)
    # Get it at: https://platform.minimax.io/
    export ANTHROPIC_API_KEY="your-minimax-api-key"

    # HuggingFace Token (for downloading Pi05 model)
    # Get it at: https://huggingface.co/settings/tokens
    export HF_TOKEN="your-huggingface-token"
    ```
  </Step>

  <Step title="Download Pi05 Model">
    ```bash theme={null}
    # Install huggingface_hub
    pip install huggingface_hub

    # Login to HuggingFace
    huggingface-cli login

    # Download Pi05 LIBERO fine-tuned model
    python -c "
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id='lerobot/pi05_libero',
        local_dir='./models/pi05_libero_finetuned'
    )
    "
    ```

    <Tip>
      Default model path: `./models/pi05_libero_finetuned`. To change it, edit the `MODEL_PATH` variable in `agent_mode.py`.
    </Tip>
  </Step>

  <Step title="Install Dependencies">
    ```bash theme={null}
    # Create virtual environment
    python -m venv .venv
    source .venv/bin/activate

    # Install all dependencies
    pip install -r requirements.txt
    ```

    **Dependencies:**

    | Dependency | Description                             |
    | ---------- | --------------------------------------- |
    | LeRobot    | HuggingFace robotics library (included) |
    | MuJoCo     | DeepMind physics simulation engine      |
    | LIBERO     | Robot manipulation simulation benchmark |
    | MCP        | Model Context Protocol client           |
  </Step>

  <Step title="Run the Agent">
    ```bash theme={null}
    # Set display (VNC environment)
    export DISPLAY=:2

    # Run Agent
    python agent_mode.py
    ```

    Select a task scenario after startup:

    * `libero_object` - Object generalization
    * `libero_spatial` - Spatial relationship understanding
    * `libero_goal` - Different action goals (recommended)
  </Step>
</Steps>

***

## Supported Tasks

In the LIBERO Goal scenario, the Agent supports the following 10 manipulation tasks:

| #  | Task Command                                  | Description                                 |
| -- | --------------------------------------------- | ------------------------------------------- |
| 1  | `open the middle drawer of the cabinet`       | Open the middle drawer of the cabinet       |
| 2  | `put the bowl on the stove`                   | Place the bowl on the stove                 |
| 3  | `put the wine bottle on top of the cabinet`   | Place the wine bottle on top of the cabinet |
| 4  | `open the top drawer and put the bowl inside` | Open the top drawer and put the bowl inside |
| 5  | `put the bowl on top of the cabinet`          | Place the bowl on top of the cabinet        |
| 6  | `push the plate to the front of the stove`    | Push the plate to the front of the stove    |
| 7  | `put the cream cheese in the bowl`            | Put the cream cheese in the bowl            |
| 8  | `turn on the stove`                           | Turn on the stove                           |
| 9  | `put the bowl on the plate`                   | Place the bowl on the plate                 |
| 10 | `put the wine bottle on the rack`             | Place the wine bottle on the rack           |

***

## Core Code Analysis

### Agent Tool Definitions

The Agent interacts with the environment through two core tools:

```python theme={null}
AGENT_TOOLS = [
    {
        "name": "execute_task",
        "description": "Execute a manipulation task using the robot arm.",
        "input_schema": {
            "type": "object",
            "properties": {
                "task": {
                    "type": "string",
                    "description": "A manipulation task instruction"
                }
            },
            "required": ["task"]
        }
    },
    {
        "name": "get_scene_info",
        "description": "Capture camera image and use VLM to analyze the current scene.",
        "input_schema": {
            "type": "object",
            "properties": {},
            "required": []
        }
    }
]
```

### MiniMax M2.1 Task Planning

Using Anthropic-compatible interface to call MiniMax M2.1:

```python theme={null}
from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.minimax.io/anthropic",
    api_key=api_key,
    default_headers={"Authorization": f"Bearer {api_key}"}
)

response = client.messages.create(
    model="MiniMax-M2.1",
    max_tokens=4096,
    system=system_prompt,
    messages=conversation_history,
    tools=AGENT_TOOLS
)
```

### MCP Visual Understanding

Using MCP to invoke MiniMax visual understanding for task verification:

```python theme={null}
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

server_params = StdioServerParameters(
    command="uvx",
    args=["minimax-coding-plan-mcp", "-y"],
    env={
        "MINIMAX_API_KEY": api_key,
        "MINIMAX_API_HOST": "https://api.minimax.io"
    }
)

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        result = await session.call_tool("understand_image", {
            "image_source": image_path,
            "prompt": "Verify if the task was completed successfully."
        })
```

***

## Technical Details

### Pi05 Model Parameters

| Parameter         | Value                                                                    |
| ----------------- | ------------------------------------------------------------------------ |
| Base Model        | PaliGemma                                                                |
| Input             | 2 camera images + robot arm state + language instruction                 |
| Output            | 7-dim action (end-effector position delta + orientation delta + gripper) |
| Control Frequency | 10Hz                                                                     |
| Max Steps         | 280 steps/task                                                           |

### Agent Workflow

1. **User Input**: Receive natural language instructions
2. **Task Planning**: MiniMax M2.1 understands intent and maps to supported tasks
3. **Action Execution**: Pi05 VLA generates robotic arm control sequences
4. **Result Verification**: MCP visual understanding analyzes the scene to confirm task completion
5. **Feedback Loop**: If verification fails, automatically retry the task

***

## FAQ

<AccordionGroup>
  <Accordion title="API call error 'Invalid API Key'">
    Check if `ANTHROPIC_API_KEY` is correctly set to your MiniMax API Key.
  </Accordion>

  <Accordion title="Model loading failed">
    1. Check if `HF_TOKEN` is configured
    2. Check if `MODEL_PATH` path is correct
  </Accordion>

  <Accordion title="MCP visual understanding error">
    Make sure you have installed: `pip install mcp` and the `uvx` command is available.
  </Accordion>

  <Accordion title="Visualization window not showing">
    Set `export DISPLAY=:2` (VNC) or ensure you have an X11 environment.
  </Accordion>
</AccordionGroup>

***

## Application Extensions

Based on the current architecture, developers can explore the following directions:

* **Multi-task Chaining**: Implement automatic decomposition and sequential execution of complex tasks
* **Failure Recovery**: Enhance error detection and automatic recovery capabilities
* **Real-world Deployment**: Transfer simulation policies to physical robotic arms
* **Multi-modal Interaction**: Combine speech recognition to enable voice-controlled robots

***

## Summary

In this tutorial, we demonstrated how to build an intelligent robot using **MiniMax M2.1** and **MCP visual understanding**:

* **MiniMax M2.1** understands user intent and converts natural language instructions into specific manipulation tasks
* **MiniMax MCP** provides visual understanding capabilities to verify task execution results and enable closed-loop control
* **Pi05 VLA** serves as the underlying executor, generating precise robotic arm actions based on visual input
* **LIBERO/MuJoCo** provides a realistic physics simulation environment

This LLM + VLM + VLA collaborative architecture demonstrates the potential of large models in the field of robot control.

***

## Related Resources

<Columns cols={3}>
  <Card title="Anthropic API" icon="book-open" href="/api-reference/text-anthropic-api">
    MiniMax M2.1 Integration Guide
  </Card>

  <Card title="MCP Guide" icon="plug" href="/token-plan/mcp-guide">
    MCP Tool Configuration
  </Card>

  <Card title="LeRobot" icon="github" href="https://github.com/huggingface/lerobot">
    HuggingFace Robotics Library
  </Card>
</Columns>
