Installation Guide

This comprehensive guide will walk you through installing and setting up NeuTTS Air on your system. The model works on Linux, macOS, and Windows with Python 3.11 or higher.

Note: NeuTTS Air runs entirely on your local device. No cloud API keys or internet connection required after installation.

Prerequisites

Python 3.11 or higher installed on your system
Git for cloning the repository
Sufficient disk space for model weights (approximately 2-3 GB)
Optional: CUDA-capable GPU for faster inference (CPU works fine too)

Step 1: Clone the Repository

Open your terminal and run the following commands:

git clone https://github.com/neuphonic/neutts-air.git
cd neutts-air

Step 2: Install eSpeak (Required Dependency)

eSpeak is a required dependency for phoneme conversion. Install it based on your operating system:

macOS

brew install espeak

Mac users may need to set the library path. Add these lines at the top of neutts.py:

from phonemizer.backend.espeak.wrapper import EspeakWrapper
_ESPEAK_LIBRARY = '/opt/homebrew/Cellar/espeak/1.48.04_1/lib/libespeak.1.1.48.dylib'
EspeakWrapper.set_library(_ESPEAK_LIBRARY)

Ubuntu/Debian Linux

sudo apt install espeak

Windows

Download and install eSpeak NG from the official repository, then set environment variables in PowerShell:

$env:PHONEMIZER_ESPEAK_LIBRARY = "c:\Program Files\eSpeak NG\libespeak-ng.dll"
$env:PHONEMIZER_ESPEAK_PATH = "c:\Program Files\eSpeak NG"
setx PHONEMIZER_ESPEAK_LIBRARY "c:\Program Files\eSpeak NG\libespeak-ng.dll"
setx PHONEMIZER_ESPEAK_PATH "c:\Program Files\eSpeak NG"

Step 3: Install Python Dependencies

Install the required Python packages:

pip install -r requirements.txt

The requirements file includes dependencies for running the model with PyTorch. When using ONNX decoder or GGML model, some dependencies like PyTorch may not be required.

Step 4: Optional - Install Additional Components

For GGUF Models (Recommended for Performance)

Install llama-cpp-python for optimized GGUF model support:

pip install llama-cpp-python

For GPU acceleration with CUDA or MPS support, refer to the llama-cpp-python documentation for platform-specific installation instructions.

For ONNX Decoder

If you want to use the ONNX decoder for additional performance:

pip install onnxruntime

Step 5: Running Your First Synthesis

Test the installation with the basic example:

python -m examples.basic_example \
  --input_text "My name is Dave, and um, I'm from London" \
  --ref_audio samples/dave.wav \
  --ref_text samples/dave.txt

This command will synthesize speech using the provided reference audio and text. The output will be saved as an audio file in your working directory.

Using NeuTTS Air in Your Code

Here's a simple example to get started with NeuTTS Air in your Python projects:

from neuttsair.neutts import NeuTTSAir
import soundfile as sf

# Initialize the model
tts = NeuTTSAir(
    backbone_repo="neuphonic/neutts-air",  # or 'neutts-air-q4-gguf'
    backbone_device="cpu",
    codec_repo="neuphonic/neucodec",
    codec_device="cpu"
)

# Your input text
input_text = "My name is Dave, and um, I'm from London."

# Reference files
ref_text = "samples/dave.txt"
ref_audio_path = "samples/dave.wav"

# Load reference text
ref_text = open(ref_text, "r").read().strip()

# Encode reference audio
ref_codes = tts.encode_reference(ref_audio_path)

# Generate speech
wav = tts.infer(input_text, ref_codes, ref_text)

# Save output
sf.write("output.wav", wav, 24000)

Model Variants

NeuTTS Air is available in several formats to suit different performance needs:

Model Variant	Description	Use Case
neuphonic/neutts-air	Standard PyTorch model	Full-featured, best quality
neutts-air-q8-gguf	8-bit quantized GGUF	Balanced speed and quality
neutts-air-q4-gguf	4-bit quantized GGUF	Maximum speed, lower memory

Optimizing Performance

For the best performance on your device:

Use GGUF models: These quantized versions run faster with minimal quality loss
Pre-encode references: Encode your reference audio once and reuse the codes for multiple generations
Use ONNX decoder: The ONNX codec decoder can provide additional speed improvements
Enable GPU acceleration: If you have a CUDA-compatible GPU or Apple Silicon, enable GPU support for faster inference

Preparing Reference Audio

For optimal voice cloning results, your reference audio should meet these criteria:

Mono channel - Single audio channel
16-44 kHz sample rate - Standard quality range
3-15 seconds duration - Optimal length for voice capture
WAV format - Uncompressed audio file
Clean recording - Minimal background noise
Natural speech - Continuous speaking with few pauses

Troubleshooting

eSpeak Library Not Found

If you get an error about eSpeak library not being found, ensure the library path is correctly set in your environment variables or at the top of the script as shown in Step 2.

CUDA/GPU Issues

If GPU acceleration is not working, verify that your CUDA installation matches your PyTorch version. You can always fall back to CPU inference by setting device="cpu".

Poor Voice Quality

Check your reference audio quality. Ensure it's clean, properly formatted, and within the recommended duration range. Better reference audio produces better results.

Advanced Usage

For streaming output, batch processing, and other advanced features, check the examples folder in the repository. Additional Jupyter notebooks are available demonstrating various use cases.

Developer Contributions

If you want to contribute to the NeuTTS Air project, install the pre-commit hooks:

pip install pre-commit
pre-commit install

Installation Complete

You're now ready to use NeuTTS Air for text-to-speech synthesis with voice cloning. Experiment with different reference voices and text inputs to explore the model's capabilities.

Need Help? For additional examples and detailed documentation, visit the official repository or check the examples folder included with your installation.