Testing and Evaluation Standards - Speech Models

Reference Success Criteria

By defining clear, feasible objectives and combining objective data with user experience assessment, evaluation can be made more rigorous and less biased. The following criteria are reference standards for speech models and may be adapted to specific use cases.

Task Completion Assesses the model’s ability to accurately convert input text into speech. Word Error Rate (WER) is a key metric, computed by converting synthesized speech into text via Automatic Speech Recognition (ASR), then comparing it against the reference text to count substitution, insertion, and deletion errors.
Voice Similarity Assesses the similarity between synthesized audio and a reference recording in terms of speaker characteristics. Similarity (SIM) is calculated by extracting embeddings from both synthesized and reference audio, then computing cosine similarity between the embeddings.
Perceptual Quality Measures the perceived quality of synthesized audio. PESQ is a standard objective metric that compares the synthesized audio against a high-quality reference, approximating human auditory perception.
Intelligibility Evaluates the degree to which synthesized speech can be understood. STOI is an established objective metric for intelligibility, quantifying how well listeners can comprehend sentence-level content.
Subjective Preference Captures user perception of synthesized audio. Common approaches include ELO rating and Comparative MOS (CMOS):
- ELO Rating: Pairwise A/B tests are conducted, where listeners select preferred samples. Scores are updated using the ELO formula to reflect relative preference across models.
- CMOS: Listeners score the quality difference between two samples in A/B tests. Average scores across participants indicate relative performance.
Instruction Compliance Assesses whether the model follows input constraints when generating speech, including emotion control and timbre specification. Evaluation can involve feature comparison with target instructions or subjective A/B testing.
Cost Evaluates the economic feasibility of using the model, considering per-call costs and expected usage frequency.
Latency Measures time efficiency from input to audio output. For streaming synthesis, first-packet latency is a key metric, defined as the interval from receiving the full input to generating the first playable audio frame.

Evaluation Scenarios

Speech models can be applied across various scenarios. To comprehensively evaluate performance, testing should consider behavior across different use cases.

Voice Cloning

Voice cloning is the core of personalized speech generation. This scenario tests the model’s ability and stability to replicate a target speaker’s timbre, intonation, and speaking style under zero-shot or few-shot conditions.

Multilingual Generation

This scenario evaluates whether the model performs consistently across different languages, capturing and synthesizing unique pronunciation and prosodic features of each language.

Cross-Lingual Synthesis

Tests the model’s ability to transfer a speaker’s voice to other languages, ensuring timbre consistency while generating fluent, natural, and accent-appropriate audio in the target language.

Emotion Control

Emotional expression significantly affects speech engagement and user experience. This scenario evaluates whether the model can synthesize speech in specified emotional tones (e.g., anger, happiness, sadness) while maintaining naturalness and intelligibility.

Text-Driven Voice Creation

Beyond cloning existing voices, models can also generate entirely new timbres based on natural language descriptions. This scenario evaluates whether the model can create personalized voices aligned with textual prompts.

Results and Example Test Cases

MiniMax Speech-02 Objective Results

Voice Cloning
The MiniMax Speech-02 model achieves high-quality voice cloning with short reference samples and transcripts. Results in both Chinese and English show low WER and high SIM, indicating strong cloning fidelity.

Multilingual Synthesis
The MiniMax Speech-02 model supports 32 languages with high accuracy and strong similarity preservation.

Cross-Lingual Synthesis
The MiniMax Speech-02 model demonstrates strong cross-lingual ability, generating speech in other languages from short audio clips. Experiments with Chinese as the source show zero-shot cloning has higher accuracy, while one-shot cloning yields better timbre similarity.

MiniMax Speech-02 Test Cases

Voice Cloning
- Description: A Compelling and Persuasive Speaker Voice
  - Source Audio
  - Generated Audio
- Description: A Robotic Voice with Rich Bass Resonance and Spatial Presence
  - Source Audio
  - Generated Audio
Multilingual Capabilities
- Thai
  - Source Audio
  - Generated Audio
- Vietnamese
  - Source Audio
  - Generated Audio
Cross-Lingual Capabilities
- English + Spanish
  - Source Audio (English)
  - Generated Audio
- Japanese + Korean
  - Source Audio (Japanese)
  - Generated Audio
Emotion Control
- Surprised
  - Source Audio
  - Generated Audio
- Sad
  - Source Audio
  - Generated Audio
Voice Design
- Example1
  - Prompt: A stereotypical, larger-than-life gruff pirate captain, characterized by a deep, extremely gravelly, and raspy timbre with rough, stylized articulation that includes piratical clichés and exaggerated ‘R’ sounds. His speech is loud, boisterous, and declamatory, delivered with a swaggering, rolling cadence, a low, rumbling pitch, and punctuated by hearty exclamations or growls, conveying an intimidating, adventurous, and fiercely independent persona ideal for boasting, issuing gruff commands, or demanding treasure.
  - Generated Audio
- Example2
  - Prompt: Whispering sultry adult female, reminiscent of a femme fatale ASMR artist, characterized by soft, slightly breathy articulation and a slow, deliberate pace. Her low to mid-range pitch features a seductive, meandering intonation with downward glides and lingering vowels, while her warm, breathy, smooth, and husky timbre, often close-miked, creates an alluring, intimate, and mysterious atmosphere designed to entice and charm the listener.
  - Generated Audio
- Example3
  - Prompt: A classic, high-energy male announcer voice. The pace is rapid and almost breathless, with a loud, projecting delivery designed to grab attention.The pitch is dynamic and generally in a higher range, using exaggerated, rising intonation to build excitement, urgency, and a highly persuasive, hard-sell tone.
  - Generated Audio
- Example4
  - Prompt:An English-speaking man terrified of going insane. His voice is generally low-pitched, but with a wide and unnatural range of variation. The overall pace of his speech is slow, yet highly variable, punctuated by frequent pauses of inconsistent lengths that create a sense of urgency. His voice is torn and hoarse, trembling with fear.
  - Generated Audio