Skip to main content
Build Next-Generation Customer Engagement with MiniMax Text & Speech API. This case study demonstrates how leading enterprises leverage MiniMax Text-01 and Speech-02 APIs to automate outbound calling — rapidly transforming intent-based scripts into ultra-realistic, emotionally rich voice conversations. By combining intelligent chatbot text generation with ultra-human-like speech synthesis, sales and customer service teams can significantly boost outreach efficiency, standardize brand messaging, and achieve large-scale customer communication without sacrificing quality.

Pain Points

High Labor Costs

Manual outbound calls require large teams, leading to high recruitment and training expenses.

Inconsistent Service Quality

Differences in tone, vocabulary, and customer handling among agents lead to inconsistent brand experiences.

Poor Scalability

Manual agents struggle to maintain stable performance during peak hours.

Insufficient Personalization

Fixed scripts cannot flexibly adapt to real-time customer responses.

Core Objectives

Ensure Script Consistency

Maintain a unified tone and brand image across all outbound calls.

Achieve Dynamic Script Generation

Automatically generate natural, contextually appropriate dialogue based on customer segments and campaign goals.

Enhance Operational Efficiency

Replace repetitive manual work with automated, high-quality AI outbound calls, supporting millions of calls.

Solution

Text-01 Intelligent Script Generation

  1. Generate customized outbound call scripts for different industries and customer intents.
  2. Includes personalized greetings, needs assessment, objection handling, and closing remarks.

Speech-02 Voice Agent Creation

  1. Upload target voice samples to clone a brand-exclusive agent voice.
  2. Achieve ultra-human-like TTS synthesis with natural intonation, emotional expression, and rhythm control.

Real-time Linkage and Playback

  1. Directly connect Text-01 output to Speech-02 for real-time dialogue playback.
  2. Dynamically adjust speech rate, tone, and emphasis based on call content.

Business Value

Significantly Reduced Costs

Outbound call center labor costs reduced by up to 80%.

Unified Brand Voice

Every outbound call conveys the same professional, friendly image.

Large-scale Personalization

Provide customized communication experiences for millions of potential customers.

Rapid Campaign Launch

From planning to outbound call launch in hours, not weeks.

Core API Capabilities

  1. MiniMax-Text-01 Intelligent Script Generation:
    • Functionality: Automatically generates complete scripts for different industries (e.g., insurance renewals, recruitment invitations) and customer intents, including personalized greetings, needs assessment, objection handling, and closing remarks.
  2. Speech-02 Brand-Exclusive Voice Agent Creation:
    • Functionality: Supports uploading specific voice samples (e.g., an excellent employee as a “voice model”) to clone a unique, brand-exclusive AI agent voice. Through ultra-human-like TTS synthesis technology, it achieves natural intonation, emotional expression, and rhythm control, eliminating rigid “robot voices.”
  3. Real-time Linkage and Dynamic Response:
    • Functionality: Streams Text-01 generated text directly to Speech-02, enabling a low-latency “generate-as-you-play” dialogue experience. The system can dynamically adjust speech rate, tone, and emphasis based on real-time call content, achieving highly human-like interaction.
    • API Integration Example (Text-to-Speech Streaming Interface): The code logic demonstrates how to call the chatcompletion_v2 interface, enable the speech_output option in the request, and set voice_id. The API will return both text (content) and audio data (audio_content) as a data stream (SSE), allowing the frontend to receive and play the audio stream in real-time for seamless dialogue.

Usage Example

import requests
import json
import subprocess
from typing import Iterator
from datetime import datetime
import logging

# Log configuration
logging.basicConfig(level=logging.INFO, format='[%(asctime)s] - [%(levelname)s]  %(message)s')

# For streaming audio playback, mpv player needs to be downloaded (for Linux/Mac systems)
mpv_command = ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"]
mpv_process = subprocess.Popen(
    mpv_command,
    stdin=subprocess.PIPE,
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)

token = 'Enter your token obtained from the MiniMax developer platform, go to https://platform.minimax.io/user-center/basic-information/interface-key to get it'

def ccv2_audio_stream(text) -> Iterator[bytes] :
    payload = {
    "model": "MiniMax-Text-01",
    "messages": [
        {
        "role": "system",
        "name": "MM Smart Assistant",
        "content": "MM Smart Assistant is an intelligent helper"
        },
        {
        "role": "user",
        "name": "User",
        "content": text
        },
    ],
    "stream": True,
    "tools": [
        {"type":"web_search"}
    ],
    "tool_choice": "auto",
    "max_tokens": 1024,
    "stream_options": { # Enable voice output
        "speech_output": True
    },
    "voice_setting":{
        "model":"speech-01-turbo-240228",
        "voice_id":"female-tianmei"
    }
    }
    headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {token}',
    }

    logging.info(f"【Text Input】{text}")

    response = requests.post("http://api.minimax.io/v1/chat/completions", headers=headers, json=payload, stream=True)

    logging.info(f"Get response, trace-id: {response.headers.get('Trace-Id')}")
    i = 0
    for line in response.iter_lines(decode_unicode=True):
        if not line.startswith("data:"):
            continue
        i+=1
        logging.debug(f"[sse] data chunck-{i}")
        resp = json.loads(line.strip("data:"))
        if resp.get("choices") and resp["choices"][0].get("delta"):
            delta = resp["choices"][0]["delta"]
            if delta.get("role") == "assistant": # AI assistant reply
                if delta.get("content"):  logging.info(f"【Text Output】 {delta['content']}")
                if delta.get("audio_content") and delta["audio_content"] != "": yield delta["audio_content"]
                if delta.get("tool_calls") : logging.info(f"【Searching】...")


# Stream audio and save to local
def audio_play(audio_stream: Iterator[bytes]) :
    audio = b""
    for chunk in audio_stream:
        if chunk is not None and chunk != '\n':
            decoded_hex = bytes.fromhex(chunk)
            mpv_process.stdin.write(decoded_hex)  # type: ignore
            mpv_process.stdin.flush()
            audio += decoded_hex

    if not audio:
        return

    now = datetime.now().strftime('%Y%m%d-%H%M%S')
    file_name = f'ccv2_audio_{now}.mp3'
    with open(file_name, 'wb') as file:
        file.write(audio)
    logging.info(f"Audio file saved successfully: {file_name}")

if __name__ == '__main__':
    audio_play(ccv2_audio_stream("Please introduce yourself"))