NeuTTS Air
On-Device Text-to-Speech with Instant Voice Cloning
NeuTTS Air is the first super-realistic, on-device text-to-speech model with instant voice cloning capabilities. Built on a 0.5B LLM backbone, it brings natural-sounding speech, real-time performance, built-in security, and speaker cloning to your local device.
What is NeuTTS Air?
NeuTTS Air is an offline text-to-speech (TTS) AI model developed by Neuphonic that runs entirely on a local device without internet access. It converts text into natural-sounding speech, ensures full data privacy, and can instantly clone a person’s voice using just a few seconds of reference audio.

Unlike traditional voice AI systems that depend on cloud servers, NeuTTS Air is designed to be lightweight, fast, and efficient—making high-quality voice generation possible even on personal computers. This makes it ideal for creators, developers, and users who want powerful voice capabilities without sacrificing speed, control, or privacy.
Model Overview
Specification | Details |
---|---|
Model Name | NeuTTS Air |
Developer | Neuphonic |
Base Architecture | Qwen 0.5B LLM |
Audio Codec | NeuCodec (50Hz neural codec) |
License | Apache 2.0 (Open Source) |
Supported Languages | English |
Context Window | 2048 tokens (~30 seconds audio) |
Format | GGML for on-device inference |
Inference Speed | Real-time on mid-range devices |
Voice Cloning | 3-15 seconds reference audio |
Key Features
Natural Voice Quality
NeuTTS Air produces ultra-realistic voices that sound genuinely human. The model achieves best-in-class realism for its size, delivering natural intonation, proper pacing, and emotional nuance that makes synthesized speech nearly indistinguishable from real human voice recordings.
On-Device Deployment
The model is optimized to run directly on your hardware without requiring cloud connectivity. Available in GGML format, NeuTTS Air can run on phones, laptops, or even Raspberry Pi devices. This ensures your data stays private, reduces latency, and eliminates ongoing API costs.
Instant Voice Cloning
Create custom speakers with minimal reference audio. NeuTTS Air can clone a voice using just 3-15 seconds of audio input, capturing the unique characteristics, tone, and style of the speaker. This enables personalized applications while maintaining quality and naturalness.
Efficient Architecture
Built on a simple LM + codec architecture with a 0.5B backbone, NeuTTS Air represents the optimal balance between speed, model size, and output quality for real-world applications. The NeuCodec audio codec achieves exceptional quality at low bitrates using a single codebook at 50Hz.
Real-Time Performance
Experience speech generation in real-time on mid-range devices. The model is optimized for both speed and quality, making it suitable for interactive applications like voice assistants, live narration, and real-time translation systems.
Built-in Watermarking
Every audio file generated by NeuTTS Air includes Perth (Perceptual Threshold) Watermarker for responsible AI usage. This ensures generated content can be identified and traced, promoting ethical use of voice synthesis technology.
How NeuTTS Air Works
NeuTTS Air employs a sophisticated yet efficient pipeline to transform text into natural-sounding speech. Understanding this process helps you make the most of the model's capabilities.
Step 1: Reference Audio Processing
The model begins by analyzing your reference audio sample. This can be as short as 3 seconds or up to 15 seconds for best results. The audio should be clean, mono channel, at 16-44 kHz sample rate, and saved as a WAV file. The model extracts key voice characteristics including tone, pitch, speaking style, and other unique features that define the speaker's voice.
Step 2: Text Understanding
Your input text is processed through the Qwen 0.5B language model backbone. This lightweight yet capable model handles text understanding and generation, ensuring proper pronunciation, emphasis, and natural flow. The 2048 token context window allows processing of substantial text segments, approximately 30 seconds of audio including the prompt.
Step 3: Speech Synthesis
The model combines the voice characteristics from your reference audio with the text content to generate speech codes. NeuCodec, the specialized 50Hz neural audio codec, converts these codes into high-quality audio at low bitrates using a single codebook. This efficient encoding ensures excellent quality while maintaining fast processing speeds.
Step 4: Audio Output
The final audio is generated in real-time and automatically includes the Perth watermark for responsible usage tracking. The output maintains the voice characteristics of your reference while speaking the new text with natural intonation and pacing. The entire process happens on your local device, ensuring privacy and eliminating network dependencies.
Try NeuTTS Air
Experience NeuTTS Air in action with our interactive demo. Upload a reference audio sample and enter text to hear the instant voice cloning capabilities.
Applications and Use Cases
NeuTTS Air unlocks a new category of voice-enabled applications that run entirely on-device, ensuring privacy and reducing costs.
Voice Assistants
Build embedded voice agents and personal assistants that work offline with personalized voices, maintaining complete privacy without cloud dependencies.
Accessibility Tools
Create screen readers and communication aids with natural-sounding voices customized to user preferences, enabling better accessibility for those with visual or speech impairments.
Content Creation
Generate voiceovers for videos, podcasts, and audiobooks with consistent voice quality. Clone voices for character dialogue or narrative storytelling.
Educational Applications
Develop language learning tools, interactive tutorials, and educational content with clear, natural narration in customizable voices.
Smart Toys
Power interactive toys and games with natural speech capabilities that work without internet connectivity, ensuring child safety and privacy.
Compliance-Safe Apps
Build applications for industries with strict data privacy requirements, where voice data cannot leave the device due to regulatory constraints.
Technical Specifications
Model Architecture
- Backbone:Qwen 0.5B LLM
- Audio Codec:NeuCodec 50Hz
- Parameters:0.5 Billion
- Format:GGML / GGUF
Audio Requirements
- Reference Length:3-15 seconds
- Sample Rate:16-44 kHz
- Format:WAV (mono)
- Quality:Clean audio
Performance Metrics
- Inference Speed:Real-time
- Context Window:2048 tokens
- Audio Duration:~30 seconds
- Device Support:CPU/GPU
System Support
- Python Version:3.11+
- Platforms:Linux/Mac/Windows
- Hardware:Phones to PCs
- Acceleration:CUDA/MPS
Voice Cloning Best Practices
To achieve the best results with NeuTTS Air's voice cloning capability, follow these guidelines when preparing your reference audio samples.
Audio Quality Requirements
- Mono Channel: Use single-channel audio for consistent results. Stereo files should be converted to mono before use.
- Sample Rate: 16-44 kHz is ideal. Higher rates work but don't significantly improve quality while increasing processing time.
- Duration: 3-15 seconds provides optimal results. Less than 3 seconds may not capture enough characteristics; more than 15 seconds adds no benefit.
- File Format: WAV format is required. Convert from other formats (MP3, AAC) as needed.
- Clean Recording: Minimal background noise is crucial. Avoid echo, music, or environmental sounds.
- Natural Speech: Continuous speaking works best, like a monologue or conversation. Avoid long pauses or unnatural delivery.
Recording Tips
- Record in a quiet environment with minimal echo
- Maintain consistent volume throughout the recording
- Speak naturally with normal pacing and intonation
- Include variety in tone to capture the speaker's range
- Avoid extreme emotions or unusual speaking patterns unless that's the target voice style
- Test multiple reference samples to find the best match for your use case
Frequently Asked Questions
Ready to Get Started?
NeuTTS Air brings professional-quality text-to-speech with voice cloning to your local device. Start building voice-enabled applications with complete privacy and control.