> ## Documentation Index > Fetch the complete documentation index at: https://platform.minimax.io/docs/llms.txt > Use this file to discover all available pages before exploring further. # Control Robot Arm with Conversation: Make Robots Understand You > This tutorial will guide you to use MiniMax LLM & MCP visual understanding to build an intelligent robot that can understand natural language instructions and perform complex robotic arm manipulation tasks.

Devin

January 26, 2026

Download from GitHub # Project Overview **Robot Agent** is built on **MiniMax M2.1** LLM and **Pi05 VLA** (Vision-Language-Action) model, enabling natural language-driven robotic arm manipulation in the LIBERO simulation environment. Demo

Understands user's natural language instructions, decomposes complex tasks into executable steps, and coordinates multi-step task execution. Uses MCP to invoke visual understanding capabilities, analyze scene images, verify task execution results, and enable closed-loop feedback control. A vision-language-action model based on PaliGemma that generates precise robotic arm control actions based on scene images and task instructions. Executes various robot manipulation tasks in a simulation environment powered by the MuJoCo physics engine. *** ## System Architecture ``` User Command → MiniMax LLM (Task Planning) → Pi05 VLA (Action Execution) → LIBERO Sim ↑ ↓ MCP Visual Understanding ← ────────── Scene Image ←───────────────┘ ``` | Module | Technology | Description | | -------------------- | --------------- | --------------------------------------- | | Task Planning | MiniMax M2.1 | Understand user intent, decompose tasks | | Visual Understanding | MiniMax MCP | Scene analysis, result verification | | Action Execution | Pi05 VLA | Vision-Language-Action model | | Simulation | LIBERO / MuJoCo | Robot manipulation simulation | *** ## Quick Start ```bash theme={null} git clone https://github.com/MiniMax-OpenPlatform/MiniMax-Agent-VLA-Demo.git cd MiniMax-Agent-VLA-Demo ``` ```bash theme={null} # MiniMax API Key (for LLM and MCP visual understanding) # Get it at: https://platform.minimax.io/ export ANTHROPIC_API_KEY="your-minimax-api-key" # HuggingFace Token (for downloading Pi05 model) # Get it at: https://huggingface.co/settings/tokens export HF_TOKEN="your-huggingface-token" ``` ```bash theme={null} # Install huggingface_hub pip install huggingface_hub # Login to HuggingFace huggingface-cli login # Download Pi05 LIBERO fine-tuned model python -c " from huggingface_hub import snapshot_download snapshot_download( repo_id='lerobot/pi05_libero', local_dir='./models/pi05_libero_finetuned' ) " ``` Default model path: `./models/pi05_libero_finetuned`. To change it, edit the `MODEL_PATH` variable in `agent_mode.py`. ```bash theme={null} # Create virtual environment python -m venv .venv source .venv/bin/activate # Install all dependencies pip install -r requirements.txt ``` **Dependencies:** | Dependency | Description | | ---------- | --------------------------------------- | | LeRobot | HuggingFace robotics library (included) | | MuJoCo | DeepMind physics simulation engine | | LIBERO | Robot manipulation simulation benchmark | | MCP | Model Context Protocol client | ```bash theme={null} # Set display (VNC environment) export DISPLAY=:2 # Run Agent python agent_mode.py ``` Select a task scenario after startup: * `libero_object` - Object generalization * `libero_spatial` - Spatial relationship understanding * `libero_goal` - Different action goals (recommended) *** ## Supported Tasks In the LIBERO Goal scenario, the Agent supports the following 10 manipulation tasks: | # | Task Command | Description | | -- | --------------------------------------------- | ------------------------------------------- | | 1 | `open the middle drawer of the cabinet` | Open the middle drawer of the cabinet | | 2 | `put the bowl on the stove` | Place the bowl on the stove | | 3 | `put the wine bottle on top of the cabinet` | Place the wine bottle on top of the cabinet | | 4 | `open the top drawer and put the bowl inside` | Open the top drawer and put the bowl inside | | 5 | `put the bowl on top of the cabinet` | Place the bowl on top of the cabinet | | 6 | `push the plate to the front of the stove` | Push the plate to the front of the stove | | 7 | `put the cream cheese in the bowl` | Put the cream cheese in the bowl | | 8 | `turn on the stove` | Turn on the stove | | 9 | `put the bowl on the plate` | Place the bowl on the plate | | 10 | `put the wine bottle on the rack` | Place the wine bottle on the rack | *** ## Core Code Analysis ### Agent Tool Definitions The Agent interacts with the environment through two core tools: ```python theme={null} AGENT_TOOLS = [ { "name": "execute_task", "description": "Execute a manipulation task using the robot arm.", "input_schema": { "type": "object", "properties": { "task": { "type": "string", "description": "A manipulation task instruction" } }, "required": ["task"] } }, { "name": "get_scene_info", "description": "Capture camera image and use VLM to analyze the current scene.", "input_schema": { "type": "object", "properties": {}, "required": [] } } ] ``` ### MiniMax M2.1 Task Planning Using Anthropic-compatible interface to call MiniMax M2.1: ```python theme={null} from anthropic import Anthropic client = Anthropic( base_url="https://api.minimax.io/anthropic", api_key=api_key, default_headers={"Authorization": f"Bearer {api_key}"} ) response = client.messages.create( model="MiniMax-M2.1", max_tokens=4096, system=system_prompt, messages=conversation_history, tools=AGENT_TOOLS ) ``` ### MCP Visual Understanding Using MCP to invoke MiniMax visual understanding for task verification: ```python theme={null} from mcp import ClientSession, StdioServerParameters from mcp.client.stdio import stdio_client server_params = StdioServerParameters( command="uvx", args=["minimax-coding-plan-mcp", "-y"], env={ "MINIMAX_API_KEY": api_key, "MINIMAX_API_HOST": "https://api.minimax.io" } ) async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() result = await session.call_tool("understand_image", { "image_source": image_path, "prompt": "Verify if the task was completed successfully." }) ``` *** ## Technical Details ### Pi05 Model Parameters | Parameter | Value | | ----------------- | ------------------------------------------------------------------------ | | Base Model | PaliGemma | | Input | 2 camera images + robot arm state + language instruction | | Output | 7-dim action (end-effector position delta + orientation delta + gripper) | | Control Frequency | 10Hz | | Max Steps | 280 steps/task | ### Agent Workflow 1. **User Input**: Receive natural language instructions 2. **Task Planning**: MiniMax M2.1 understands intent and maps to supported tasks 3. **Action Execution**: Pi05 VLA generates robotic arm control sequences 4. **Result Verification**: MCP visual understanding analyzes the scene to confirm task completion 5. **Feedback Loop**: If verification fails, automatically retry the task *** ## FAQ Check if `ANTHROPIC_API_KEY` is correctly set to your MiniMax API Key. 1. Check if `HF_TOKEN` is configured 2. Check if `MODEL_PATH` path is correct Make sure you have installed: `pip install mcp` and the `uvx` command is available. Set `export DISPLAY=:2` (VNC) or ensure you have an X11 environment. *** ## Application Extensions Based on the current architecture, developers can explore the following directions: * **Multi-task Chaining**: Implement automatic decomposition and sequential execution of complex tasks * **Failure Recovery**: Enhance error detection and automatic recovery capabilities * **Real-world Deployment**: Transfer simulation policies to physical robotic arms * **Multi-modal Interaction**: Combine speech recognition to enable voice-controlled robots *** ## Summary In this tutorial, we demonstrated how to build an intelligent robot using **MiniMax M2.1** and **MCP visual understanding**: * **MiniMax M2.1** understands user intent and converts natural language instructions into specific manipulation tasks * **MiniMax MCP** provides visual understanding capabilities to verify task execution results and enable closed-loop control * **Pi05 VLA** serves as the underlying executor, generating precise robotic arm actions based on visual input * **LIBERO/MuJoCo** provides a realistic physics simulation environment This LLM + VLM + VLA collaborative architecture demonstrates the potential of large models in the field of robot control. *** ## Related Resources MiniMax M2.1 Integration Guide MCP Tool Configuration HuggingFace Robotics Library