speech-2.5-hd-preview

The text speech 2.5 model by ByteDance offers studio-grade 48kHz audio and native expressive prosody. It supports zero-shot voice cloning and sub-200ms latency, making it ideal for real-time applications and professional content creation at scale.

$ 0

$ 60

$ 100

text

audio

$ 0

text

$ 60

$ 100

audio

Playground

JSON

API

Input

Text*

Voice_id*

Speed

Volume

Pitch

Emotion

English_normalization

This parameter supports English text normalization, which improves performance in number-reading scenarios.

Sample_rate

Bitrate

Channel

Format

Language_boost

Enable_base64_output

If enabled, the output will be encoded into a BASE64 string instead of a URL. This property is only available through the API.

Enable_sync_mode

If set to true, the function will wait for the result to be generated and uploaded before returning the response. It allows you to get the result directly in the response. This property is only available through the API.

Related Models

speech 2.5 turbo preview

$ 36

$ 60

MiniMax

speech 2.5 turbo preview voice clone

speech 2.5 hd preview voice clone

$ 0.5003

$ 0.8338

Core text speech 2.5 Features

Technical highlights that make the 2.5 HD preview model a leader in synthetic voice technology.

Zero-Shot Voice Cloning

Replicate any target voice with 94% similarity using only a 5-second sample, supporting cross-lingual synthesis.

Studio-Grade 48kHz Audio

High-definition output designed for broadcasting, providing superior clarity compared to 24kHz standard models.

Sub-200ms Latency

Optimized processing pipeline ensures rapid TTFB, making it the fastest choice for conversational AI applications.

Native Expressive Prosody

The model automatically interprets text context to add natural pauses, sighs, and emotional depth without manual SSML tags.

Build with speech 2.5 hd preview in Minutes

Follow these simple steps to set up your account, get credits, and start sending API requests to speech 2.5 hd preview via GPT Proto.

Create your free GPT Proto account to begin. You can set up an organization for your team at any time.

Top up

Your balance can be used across all models on the platform, including speech 2.5 hd preview, giving you the flexibility to experiment and scale as needed.

Generate your API key

In your dashboard, create an API key — you'll need it to authenticate when making requests to speech 2.5 hd preview.

Make your first API call

Use your API key with our sample code to send a request to speech 2.5 hd preview via GPT Proto and see instant AI-powered results.

Get API Key

text speech 2.5 FAQ

What is the typical latency for text speech 2.5 requests?

The model is optimized for high-speed performance, delivering a Time To First Byte (TTFB) of approximately 180ms. This makes the text speech 2.5 model significantly faster than OpenAI TTS-1-HD, providing a seamless experience for real-time voice assistants and interactive applications that require immediate verbal feedback.

How long of an audio sample is needed for voice cloning?

To achieve high-fidelity zero-shot voice cloning, you only need a 5 to 10-second audio reference. The text speech 2.5 system uses this sample to extract vocal characteristics, allowing the cloned voice to read any text input with high similarity (94.1%) across 30+ supported languages while maintaining the original speaker's identity.

Does text speech 2.5 support high-definition output?

Yes, it supports studio-grade 48kHz output in various formats including MP3, WAV, PCM, and Opus. This resolution is superior to the 24kHz industry standard, ensuring that synthesized speech sounds professional and clear, which is essential for podcasts, audiobooks, and high-end digital human animations.

What are the input limits for text synthesis?

Each individual request can process text up to 4,096 characters. For longer documents or extensive scripts, we recommend chunking the text into smaller segments. This prevents processing timeouts and ensures the neural engine maintains consistent prosody and emotional inflection throughout the entire audio generation process.

Is my text data used to train the underlying models?

Privacy is a core priority. Any text or audio samples sent through our API to the text speech 2.5 preview model are excluded from ByteDance's training datasets by default. Your proprietary scripts and voice cloning data remain secure and are used solely for generating your specific request outputs.

How does pricing work for HD and cloned voices?

Standard voice synthesis is priced at $15.00 per 1 million characters. For high-definition or cloned voices, the rate is $30.00 per 1 million characters. GPTProto.com provides unified billing, allowing you to access these advanced features without managing multiple separate enterprise contracts with individual vendors.

More Blogs

Minimax Speech 02: Realism & API Latency

Master high-fidelity voice synthesis with minimax speech 02. Learn to build low-latency, emotional AI audio applications today.

GPT-4o Mini TTS: OpenAI's Text-to-Speech Technology

Learn about GPT-4o Mini TTS, OpenAI's text-to-speech model that provides natural-sounding voices, emotional expression, and fast response times.

Suno AI API: Complete Guide to Turn Text Into Music in Seconds in 2026

Learn how to integrate Suno API for AI music generation. Complete guide to v5, pricing, integration, and alternative access methods. Updated for 2026.

Core text speech 2.5 Features

Zero-Shot Voice Cloning

Studio-Grade 48kHz Audio

Sub-200ms Latency

Native Expressive Prosody

Build with speech 2.5 hd preview in Minutes

Create your free GPT Proto account to begin. You can set up an organization for your team at any time.

Your balance can be used across all models on the platform, including speech 2.5 hd preview, giving you the flexibility to experiment and scale as needed.

In your dashboard, create an API key — you'll need it to authenticate when making requests to speech 2.5 hd preview.

Use your API key with our sample code to send a request to speech 2.5 hd preview via GPT Proto and see instant AI-powered results.

text speech 2.5 FAQ

What is the typical latency for text speech 2.5 requests?

How long of an audio sample is needed for voice cloning?

Does text speech 2.5 support high-definition output?

What are the input limits for text synthesis?

Is my text data used to train the underlying models?

How does pricing work for HD and cloned voices?

Related Articles

Minimax Speech 02: Realism & API Latency

GPT-4o Mini TTS: OpenAI's Text-to-Speech Technology

Suno AI API: Complete Guide to Turn Text Into Music in Seconds in 2026