Nvidia open-sources AI text-to-sound model, capable of generating 30 seconds of sound in just 3.7 seconds.

Jan 13, 2025#AI199

AI Translation

This post is translated from Chinese into English through AI.View Original

AI-generated summary

The text discusses the TangoFlux model, which is designed for generating sound effects like wind, rain, and airplane noises. Key features include: - **Efficient Generation**: TangoFlux can produce 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU, offering significant speed advantages over other models. - **Flow Matching and Rectified Flows**: The model uses a flow matching framework that maintains audio quality while reducing sampling steps, enhancing efficiency and stability. - **Clap Ranking Preference Optimization (CRPO)**: This technique improves the alignment between generated audio and text descriptions, ensuring the output meets user expectations. - **Multimodal Diffusion Transformer Architecture**: Built on MMDiT and DiT, it integrates text prompts and duration embeddings to create detailed audio of varying lengths. Links to the GitHub project, a trial on Hugging Face, and the research paper are also provided.

This text-to-audio model is primarily used to generate some sound effects, such as the sounds of wind and rain, the sound of silver needles hitting the ground, and the roar of an airplane taking off.

Technical Features#

Efficient Generation Capability:
TangoFlux can generate up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. Compared to other models, it has a significant advantage in generation speed, providing high-quality audio output in a shorter time, greatly improving the efficiency of audio generation.
Flow Matching and Rectified Flows:
The model employs a flow matching framework, specifically Rectified Flows, which is a direct path from noise to target distribution, maintaining audio quality while reducing sampling steps. This technology makes the model more efficient and stable during the generation process, reducing the demand for computational resources.
Clap Ranking Preference Optimization (CRPO):
TangoFlux introduces CRPO technology, using the CLAP model as a proxy reward model to enhance the model's alignment capability through iterative generation and optimization of preference data. CRPO effectively improves the match between generated audio and text descriptions, making the audio content more aligned with user intentions and expectations.
Multimodal Diffusion Transformer Architecture:
The model is built on a Multimodal Diffusion Transformer (MMDiT) and Diffusion Transformer (DiT), combining text prompts and duration embeddings to generate audio with varying lengths and rich details. This architecture enhances the model's ability to handle complex text descriptions and generate diverse audio content.

Project Link#

GitHub Project Link

Try It Out Link#

Hugging Face Try It Out Link

Paper Link#

Paper Link