API Overview

Get API Key

Pay-as-you-go：Visit API Keys > Create new secret key to get your API Key
Pay-as-you-go supports all modality models, including Text, Video, Speech, and Image.
Coding Plan：Visit API Keys > Create Coding Plan Key to get your API Key
Coding Plan only supports MiniMax text models. See Coding Plan Overview for details.

Text Generation

The text generation API uses MiniMax M2.5, MiniMax M2.5 highspeed, MiniMax M2.1, MiniMax M2.1 highspeed, MiniMax M2 to generate conversational content and trigger tool calls based on the provided context. It can be accessed via HTTP requests, the Anthropic SDK (Recommended), or the OpenAI SDK.

Supported Models

Model Name	Context Window	Description
MiniMax-M2.5	204,800	Peak Performance. Ultimate Value. Master the Complex (output speed approximately 60 tps)
MiniMax-M2.5-highspeed	204,800	M2.5 highspeed: Same performance, faster and more agile (output speed approximately 100 tps)
MiniMax-M2.1	204,800	Powerful Multi-Language Programming Capabilities with Comprehensively Enhanced Programming Experience (output speed approximately 60 tps)
MiniMax-M2.1-highspeed	204,800	Faster and More Agile (output speed approximately 100 tps)
MiniMax-M2	204,800	Agentic capabilities, Advanced reasoning

Please note: The maximum token count refers to the total number of input and output tokens.

Anthropic API Compatible (Recommended)

Use Anthropic SDK with MiniMax models

OpenAI API Compatible

Use OpenAI SDK with MiniMax models

Text to Speech (T2A)

This API provides synchronous text-to-speech (T2A) generation, supporting up to 10,000 characters per request. The interface is stateless: each call only processes the provided input without involving business logic, and the model does not store any user data. Key Features

Access to 300+ system voices and custom cloned voices.
Adjustable volume, pitch, speed, and output formats.
Support for proportional audio mixing.
Configurable fixed time intervals.
Multiple audio formats and specifications supported: mp3, pcm, flac, wav (wav is supported only in non-streaming mode).
Support for streaming output.

Typical Use Cases: short text generation, voice chat, online social interactions.

Supported Models

Model	Description
speech-2.8-hd	Latest HD model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.8-turbo	Latest Turbo model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.6-hd	HD model with outstanding prosody and excellent cloning similarity.
speech-2.6-turbo	Turbo model with support for 40 languages.
speech-02-hd	Superior rhythm and stability, with outstanding performance in replication similarity and sound quality.
speech-02-turbo	Superior rhythm and stability, with enhanced multilingual capabilities and excellent performance.

Available Interfaces

Synchronous speech synthesis provides two interfaces. Choose based on your needs:

HTTP T2A API
WebSocket T2A API

Supported Languages

MiniMax speech synthesis models offer robust multilingual capability, supporting 40 widely used languages worldwide.

Support Languages
1. Chinese	15. Turkish	28. Malay
2. Cantonese	16. Dutch	29. Persian
3. English	17. Ukrainian	30. Slovak
4. Spanish	18. Thai	31. Swedish
5. French	19. Polish	32. Croatian
6. Russian	20. Romanian	33. Filipino
7. German	21. Greek	34. Hungarian
8. Portuguese	22. Czech	35. Norwegian
9. Arabic	23. Finnish	36. Slovenian
10. Italian	24. Hindi	37. Catalan
11. Japanese	25. Bulgarian	38. Nynorsk
12. Korean	26. Danish	39. Tamil
13. Indonesian	27. Hebrew	40. Afrikaans
14. Vietnamese

HTTP T2A API

Synchronous speech synthesis via HTTP

WebSocket T2A API

Streaming speech synthesis via WebSocket

Asynchronous Long-Text Speech Generation (T2A Async)

This API supports asynchronous text-to-speech generation. Each request can handle up to 1 million characters, and the resulting audio can be retrieved asynchronously. Features supported:

Choose from 100+ system voices and cloned voices.
Customize pitch, speed, volume, bitrate, sample rate, and output format.
Retrieve audio metadata, such as duration and file size.
Retrieve precise sentence-level timestamps (subtitles).
Input text directly as a string or via file_id after uploading a text file.
Detect illegal characters:
- If illegal characters are ≤10%, audio is generated normally, with the ratio returned.
- If illegal characters are >10%, no audio will be generated (an error code will be returned).

Note: The returned audio URL is valid for 9 hours (32,400 seconds) from the time it is issued. After expiration, the URL becomes invalid and the generated data will be lost. Use Case: Converting entire books or other long texts into audio.

Supported Models

Model	Description
speech-2.8-hd	Latest HD model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.8-turbo	Latest Turbo model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.6-hd	HD model with outstanding prosody and excellent cloning similarity.
speech-2.6-turbo	Turbo model with support for 40 languages.
speech-02-hd	Superior rhythm and stability, with outstanding performance in replication similarity and sound quality.
speech-02-turbo	Superior rhythm and stability, with enhanced multilingual capabilities and excellent performance.

This feature includes two APIs:

Create a speech generation task (returns task_id).
Query the speech generation task status using task_id.
If the task succeeds, use the returned file_id with the File API to view and download the result.

Create Async Task

Create a long-text speech generation task

Query Task Status

Query speech generation task status

Voice Cloning

This API supports cloning voices from user-uploaded audio files along with optional sample audio to enhance cloning quality. Use cases: fast replication of a target timbre (IP voice recreation, voice cloning) where you need to quickly clone a specific voice. The API supports cloning from mono or stereo audio and can rapidly reproduce speech that matches the timbre of a provided reference file.

Supported Models

Model	Description
speech-2.8-hd	Latest HD model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.8-turbo	Latest Turbo model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.6-hd	HD model with real-time response, intelligent parsing, fluent LoRA voice
speech-2.6-turbo	Turbo model. Ultimate Value, 40 Languages
speech-02-hd	Superior rhythm and stability, with outstanding performance in replication similarity and sound quality.
speech-02-turbo	Superior rhythm and stability, with enhanced multilingual capabilities and excellent performance.

Notes

Using this API to clone a voice does not immediately incur a cloning fee. The fee is charged the first time you synthesize speech with the cloned voice in a T2A synthesis API.
Voices produced via this rapid cloning API are temporary. To keep a cloned voice permanently, call any T2A speech synthesis API with that voice within 168 hours (7 days).

Upload Clone Audio

Upload audio file to clone

Clone Voice

Execute voice cloning

Voice Design

This API supports generating personalized custom voices based on user-provided voice description prompts. The generated voices (voice_id) can then be used in the T2A API and the T2A Async API for speech generation.

Supported Models

It is recommended to use speech-02-hd for the best results.

Model	Description
speech-2.8-hd	Latest HD model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.8-turbo	Latest Turbo model. Perfecting Tonal Nuances. Maximizing Timbre Similarity.
speech-2.6-hd	HD model with real-time response, intelligent parsing, fluent LoRA voice
speech-2.6-turbo	Turbo model. Ultimate Value, 40 Languages
speech-02-hd	Superior rhythm and stability, with outstanding performance in replication similarity and sound quality.
speech-02-turbo	Superior rhythm and stability, with enhanced multilingual capabilities and excellent performance.

Notes

Using this API to generate a voice does not immediately incur a fee. The generation fee will be charged upon the first use of the generated voice in speech synthesis.

Voices generated through this API are temporary. If you wish to keep a voice permanently, you must use it in any speech synthesis API within 168 hours (7 days).

Voice Design API

Generate personalized voices from descriptions

Video Generation

This API supports generating videos based on user-provided text, images (including first frame, last frame, or reference images).

Supported Models

Model	Description
MiniMax-Hailuo-2.3	New video generation model, breakthroughs in body movement, facial expressions, physical realism, and prompt adherence.
MiniMax-Hailuo-2.3-Fast	New Image-to-video model, for value and efficiency.
MiniMax-Hailuo-02	Video generation model supporting higher resolution (1080P), longer duration (10s), and stronger adherence to prompts.

API Usage Guide

Video generation is asynchronous and consists of three APIs: Create Video Generation Task, Query Video Generation Task Status, and File Management. Steps are as follows:

Use the Create Video Generation Task API to start a task. On success, it will return a task_id.
Use the Query Video Generation Task Status API with the task_id to check progress. When the status is success, a file ID (file_id) will be returned.
Use the Download the Video File API with the file_id to view and download the generated video.

Text to Video

Generate video from text description

Image to Video

Generate video from image

Video Generation Agent

This API supports video generation tasks based on user-selected video agent templates and inputs.

Overview

The Video Agent API works asynchronously and includes two endpoints: Create Video Agent Task and Query Video Agent Task Status. Usage steps:

Use the Create Video Agent Task API to create a task and obtain a task_id.
Use the Query Video Agent Task Status API with the task_id to check the task status. Once the status is Success, you can retrieve the corresponding file download URL.

Template List

For details and examples, refer to the Video Agent Template List.

Template ID	Template Name	Description	media_inputs	text_inputs
392747428568649728	Diving	Upload a picture to generate a video of the subject in the picture completing a perfect dive	Required	/
393769180141805569	Run for Life	Upload a photo of your pet and enter a type of wild beast to generate a survival video of your pet in the wilderness.	Required	Required
397087679467597833	Transformers	Upload a photo of a car to generate a transforming car mecha video.	Required	/
393881433990066176	Still rings routine	Upload your photo to generate a video of the subject performing a perfect still rings routine.	Required	/
393498001241890824	Weightlifting	Upload a photo of your pet to generate a video where the subject performs a perfect weightlifting move.	Required	/
393488336655310850	Climbing	Upload a picture to generate a video of the subject in the picture completing a perfect sport climbing	Required	/

Create Video Agent Task

Create a video agent task

Query Task Status

Query video agent task status

Image Generation

This API supports images generations from text or references, allowing custom aspect ratios and resolutions for diverse needs.

API Description

You can generate images by creating an image generation task using text prompts and/or reference images.

Model List

Model	Description
image-01	A high-quality image generation model that produces fine-grained details. Supports both text-to-image and image-to-image generation (with subject reference for people).

Text to Image

Generate image from text description

Image to Image

Generate image from reference image

Music Generation

This API generates a vocal song based on a music description (prompt) and lyrics.

Models

Model	Usage
music-2.0	The latest music generation model. Supports user-provided musical inspiration and lyrics to create AI-generated music.

Music Generation API

Generate music from description and lyrics

File Management

This API is for file management and is used with other MiniMax APIs.

API Description

This API includes 5 endpoints: Upload, List, Retrieve, Retrieve Content, Delete.

Supported File Formats

Type	Format
Document	`pdf`, `docx`, `txt`, `jsonl`
Audio	`mp3`, `m4a`, `wav`

Capacity and Limits

Item	Limit
Total Capacity	100GB
Single Document Size	512MB

Upload File

Upload files to the platform

List Files

Get list of uploaded files

Official MCP

MiniMax provides official Model Context Protocol (MCP) server implementations:

Both support speech synthesis, voice cloning, video generation, and music generation. For details, refer to the MiniMax MCP User Guide.

Using the API

Text

Speech

Video

Image

Music

File

​Get API Key

​Text Generation

​Supported Models

Anthropic API Compatible (Recommended)

OpenAI API Compatible

​Text to Speech (T2A)

​Supported Models

​Available Interfaces

​Supported Languages

HTTP T2A API

WebSocket T2A API

​Asynchronous Long-Text Speech Generation (T2A Async)

​Supported Models

​API Overview

Create Async Task

Query Task Status

​Voice Cloning

​Supported Models

​Notes

Upload Clone Audio

Clone Voice

​Voice Design

​Supported Models

​Notes

Voice Design API

​Video Generation

​Supported Models

​API Usage Guide

Text to Video

Image to Video

​Video Generation Agent

​Overview

​Template List

Create Video Agent Task

Query Task Status

​Image Generation

​API Description

​Model List

Text to Image

Image to Image

​Music Generation

​Models

Music Generation API

​File Management

​API Description

​Supported File Formats

​Capacity and Limits

Upload File

List Files

​Official MCP

Get API Key

Text Generation

Supported Models

Text to Speech (T2A)

Supported Models

Available Interfaces

Supported Languages

Asynchronous Long-Text Speech Generation (T2A Async)

Supported Models

API Overview

Voice Cloning

Supported Models

Notes

Voice Design

Supported Models

Notes

Video Generation

Supported Models

API Usage Guide

Video Generation Agent

Overview

Template List

Image Generation

API Description

Model List

Music Generation

Models

File Management

API Description

Supported File Formats

Capacity and Limits

Official MCP