curl --request POST \
--url https://api.minimax.io/v1/voice_clone \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: <content-type>' \
--data '
{
"file_id": 123456789,
"voice_id": "<voice_id>",
"clone_prompt": {
"prompt_audio": 987654321,
"prompt_text": "This voice sounds natural and pleasant."
},
"text": "A gentle breeze sweeps across the soft grass(breath), carrying the fresh scent along with the songs of birds.",
"model": "speech-2.8-hd",
"text_validation": "This voice sounds natural and pleasant.",
"accuracy": 0.7,
"need_noise_reduction": false,
"need_volume_normalization": false,
"aigc_watermark": false
}
'{
"input_sensitive": false,
"input_sensitive_type": 0,
"demo_audio": "",
"extra_info": {
"audio_length": 11124,
"audio_sample_rate": 32000,
"audio_size": 179926,
"bitrate": 128000,
"word_count": 18,
"usage_characters": 18
},
"base_resp": {
"status_code": 0,
"status_msg": "success"
}
}Voice Clone
Use this API for rapid voice cloning. If a cloned voice is not used within 7 days, the system will delete it.
curl --request POST \
--url https://api.minimax.io/v1/voice_clone \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: <content-type>' \
--data '
{
"file_id": 123456789,
"voice_id": "<voice_id>",
"clone_prompt": {
"prompt_audio": 987654321,
"prompt_text": "This voice sounds natural and pleasant."
},
"text": "A gentle breeze sweeps across the soft grass(breath), carrying the fresh scent along with the songs of birds.",
"model": "speech-2.8-hd",
"text_validation": "This voice sounds natural and pleasant.",
"accuracy": 0.7,
"need_noise_reduction": false,
"need_volume_normalization": false,
"aigc_watermark": false
}
'{
"input_sensitive": false,
"input_sensitive_type": 0,
"demo_audio": "",
"extra_info": {
"audio_length": 11124,
"audio_sample_rate": 32000,
"audio_size": 179926,
"bitrate": 128000,
"word_count": 18,
"usage_characters": 18
},
"base_resp": {
"status_code": 0,
"status_msg": "success"
}
}Authorizations
HTTP: Bearer Auth
- Security Scheme Type: http
- HTTP Authorization Scheme:
Bearer API_key, can be found in Account Management>API Keys.
Headers
The media type of the request body. Must be set to application/json to ensure the data is sent in JSON format.
application/json Body
Voice clone request parameters
The file_id of the audio to be cloned, obtained through the File Upload API.
Uploaded files must comply with the following rules:
- Accepted audio formats: mp3, m4a, wav
- Audio duration: at least 10 seconds, no longer than 5 minutes
- File size: no larger than 20 MB
- If this parameter is used, both child attributes (prompt_audio, prompt_text) are required
The voice_id of the cloned voice. Example: "MiniMax001". When defining a custom voice_id, note the following rules:
- Length range: [8, 256]
- Must start with an English letter
- Can contain letters, digits,
-, and_ - Cannot end with
-or_ - Must not duplicate an existing
voice_id, otherwise an error will occur
Voice cloning parameters. Providing this field helps improve the similarity and stability of synthesized voice. If used, you must also upload a short sample audio clip (less than 8s, supported formats: mp3, m4a, wav) along with its corresponding transcript.
Show child attributes
Show child attributes
Optional preview text, up to 1000 characters. The cloned voice will be used to read the text, and an audio preview link will be returned. Note: Preview requests are charged based on character count, consistent with T2A pricing.
- Interjection tags: Only supported when using
speech-2.8-hdorspeech-2.8-turbomodels. Supported interjections:(laughs),(chuckle),(coughs),(clear-throat),(groans),(breath),(pant),(inhale),(exhale),(gasps),(sniffs),(sighs),(snorts),(burps),(lip-smacking),(humming),(hissing),(emm),(whistles),(sneezes),(crying),(applause).
Specifies which voice synthesis model to use for generating the preview audio. Required when the text field is provided.
speech-2.8-hd, speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo Controls whether recognition for specific minority languages and dialects is enhanced. Default is null. If the language type is unknown, set to "auto" and the model will automatically detect it.
Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto Optional. The expected transcript of the cloning sample audio (matching the content of file_id or clone_prompt.prompt_audio). When provided, the audio is sent to ASR and the recognized text is compared against text_validation. If the similarity is below accuracy, the request is rejected with status code 1043 (The asr similarity check failed). Maximum length: 200 characters.
200Optional. Similarity threshold used by the ASR validation triggered by text_validation. Valid range: [0, 1]. When omitted or set to 0, defaults to 0.7.
0 <= x <= 1Indicates whether to enable noise reduction.
Indicates whether to enable volume normalization.
Indicates whether to append an AIGC watermark tone to the end of the synthesized preview audio.
Response
Successful response
Content safety check result
Show child attributes
Show child attributes
If both text and model are provided, this field returns a URL to the preview audio. Otherwise, it will be empty.
Preview audio metadata and billing info. Returned only when text is provided (i.e. preview synthesis happened and was billed). Field shape is aligned with /v1/t2a_v2.
Show child attributes
Show child attributes
Show child attributes
Show child attributes