NeuTTS Air

On-Device Text-to-Speech with Instant Voice Cloning

NeuTTS Air is the first super-realistic, on-device text-to-speech model with instant voice cloning capabilities. Built on a 0.5B LLM backbone, it brings natural-sounding speech, real-time performance, built-in security, and speaker cloning to your local device.

What is NeuTTS Air?

NeuTTS Air is an offline text-to-speech (TTS) AI model developed by Neuphonic that runs entirely on a local device without internet access. It converts text into natural-sounding speech, ensures full data privacy, and can instantly clone a person’s voice using just a few seconds of reference audio.

NeuTTS Air Model

Unlike traditional voice AI systems that depend on cloud servers, NeuTTS Air is designed to be lightweight, fast, and efficient—making high-quality voice generation possible even on personal computers. This makes it ideal for creators, developers, and users who want powerful voice capabilities without sacrificing speed, control, or privacy.

Model Overview

SpecificationDetails
Model NameNeuTTS Air
DeveloperNeuphonic
Base ArchitectureQwen 0.5B LLM
Audio CodecNeuCodec (50Hz neural codec)
LicenseApache 2.0 (Open Source)
Supported LanguagesEnglish
Context Window2048 tokens (~30 seconds audio)
FormatGGML for on-device inference
Inference SpeedReal-time on mid-range devices
Voice Cloning3-15 seconds reference audio

Key Features

Natural Voice Quality

NeuTTS Air produces ultra-realistic voices that sound genuinely human. The model achieves best-in-class realism for its size, delivering natural intonation, proper pacing, and emotional nuance that makes synthesized speech nearly indistinguishable from real human voice recordings.

On-Device Deployment

The model is optimized to run directly on your hardware without requiring cloud connectivity. Available in GGML format, NeuTTS Air can run on phones, laptops, or even Raspberry Pi devices. This ensures your data stays private, reduces latency, and eliminates ongoing API costs.

Instant Voice Cloning

Create custom speakers with minimal reference audio. NeuTTS Air can clone a voice using just 3-15 seconds of audio input, capturing the unique characteristics, tone, and style of the speaker. This enables personalized applications while maintaining quality and naturalness.

Efficient Architecture

Built on a simple LM + codec architecture with a 0.5B backbone, NeuTTS Air represents the optimal balance between speed, model size, and output quality for real-world applications. The NeuCodec audio codec achieves exceptional quality at low bitrates using a single codebook at 50Hz.

Real-Time Performance

Experience speech generation in real-time on mid-range devices. The model is optimized for both speed and quality, making it suitable for interactive applications like voice assistants, live narration, and real-time translation systems.

Built-in Watermarking

Every audio file generated by NeuTTS Air includes Perth (Perceptual Threshold) Watermarker for responsible AI usage. This ensures generated content can be identified and traced, promoting ethical use of voice synthesis technology.

How NeuTTS Air Works

NeuTTS Air employs a sophisticated yet efficient pipeline to transform text into natural-sounding speech. Understanding this process helps you make the most of the model's capabilities.

Step 1: Reference Audio Processing

The model begins by analyzing your reference audio sample. This can be as short as 3 seconds or up to 15 seconds for best results. The audio should be clean, mono channel, at 16-44 kHz sample rate, and saved as a WAV file. The model extracts key voice characteristics including tone, pitch, speaking style, and other unique features that define the speaker's voice.

Step 2: Text Understanding

Your input text is processed through the Qwen 0.5B language model backbone. This lightweight yet capable model handles text understanding and generation, ensuring proper pronunciation, emphasis, and natural flow. The 2048 token context window allows processing of substantial text segments, approximately 30 seconds of audio including the prompt.

Step 3: Speech Synthesis

The model combines the voice characteristics from your reference audio with the text content to generate speech codes. NeuCodec, the specialized 50Hz neural audio codec, converts these codes into high-quality audio at low bitrates using a single codebook. This efficient encoding ensures excellent quality while maintaining fast processing speeds.

Step 4: Audio Output

The final audio is generated in real-time and automatically includes the Perth watermark for responsible usage tracking. The output maintains the voice characteristics of your reference while speaking the new text with natural intonation and pacing. The entire process happens on your local device, ensuring privacy and eliminating network dependencies.

Try NeuTTS Air

Experience NeuTTS Air in action with our interactive demo. Upload a reference audio sample and enter text to hear the instant voice cloning capabilities.

Applications and Use Cases

NeuTTS Air unlocks a new category of voice-enabled applications that run entirely on-device, ensuring privacy and reducing costs.

Voice Assistants

Build embedded voice agents and personal assistants that work offline with personalized voices, maintaining complete privacy without cloud dependencies.

Accessibility Tools

Create screen readers and communication aids with natural-sounding voices customized to user preferences, enabling better accessibility for those with visual or speech impairments.

Content Creation

Generate voiceovers for videos, podcasts, and audiobooks with consistent voice quality. Clone voices for character dialogue or narrative storytelling.

Educational Applications

Develop language learning tools, interactive tutorials, and educational content with clear, natural narration in customizable voices.

Smart Toys

Power interactive toys and games with natural speech capabilities that work without internet connectivity, ensuring child safety and privacy.

Compliance-Safe Apps

Build applications for industries with strict data privacy requirements, where voice data cannot leave the device due to regulatory constraints.

Technical Specifications

Model Architecture

  • Backbone:Qwen 0.5B LLM
  • Audio Codec:NeuCodec 50Hz
  • Parameters:0.5 Billion
  • Format:GGML / GGUF

Audio Requirements

  • Reference Length:3-15 seconds
  • Sample Rate:16-44 kHz
  • Format:WAV (mono)
  • Quality:Clean audio

Performance Metrics

  • Inference Speed:Real-time
  • Context Window:2048 tokens
  • Audio Duration:~30 seconds
  • Device Support:CPU/GPU

System Support

  • Python Version:3.11+
  • Platforms:Linux/Mac/Windows
  • Hardware:Phones to PCs
  • Acceleration:CUDA/MPS

Voice Cloning Best Practices

To achieve the best results with NeuTTS Air's voice cloning capability, follow these guidelines when preparing your reference audio samples.

Audio Quality Requirements

  • Mono Channel: Use single-channel audio for consistent results. Stereo files should be converted to mono before use.
  • Sample Rate: 16-44 kHz is ideal. Higher rates work but don't significantly improve quality while increasing processing time.
  • Duration: 3-15 seconds provides optimal results. Less than 3 seconds may not capture enough characteristics; more than 15 seconds adds no benefit.
  • File Format: WAV format is required. Convert from other formats (MP3, AAC) as needed.
  • Clean Recording: Minimal background noise is crucial. Avoid echo, music, or environmental sounds.
  • Natural Speech: Continuous speaking works best, like a monologue or conversation. Avoid long pauses or unnatural delivery.

Recording Tips

  • Record in a quiet environment with minimal echo
  • Maintain consistent volume throughout the recording
  • Speak naturally with normal pacing and intonation
  • Include variety in tone to capture the speaker's range
  • Avoid extreme emotions or unusual speaking patterns unless that's the target voice style
  • Test multiple reference samples to find the best match for your use case

Frequently Asked Questions

Ready to Get Started?

NeuTTS Air brings professional-quality text-to-speech with voice cloning to your local device. Start building voice-enabled applications with complete privacy and control.