What comes after building generative AI technology for image and code generation? Stability AI has the answer – text-to-audio generation. Stability AI, the organization famous for its Stable Diffusion text-to-image generation AI technology, has announced the initial public release of Stable Audio technology. This new capability enables users to generate short audio clips by simply using text prompts. With its previous success in image composition and code generation, Stability AI is now venturing into the world of music and audio generation.
Stability AI’s VP of Audio, Ed Newton-Rex, shared insights into the inspiration behind Stable Audio. Newton-Rex, who previously built his own startup called Jukedeck, which was later acquired by TikTok, recognized the potential of computer-generated music. However, the roots of Stable Audio do not lie in Jukedeck but instead in Stability AI’s internal music generation research studio called Harmonai. Zach Evans, the creator of Harmonai, explained that the technology takes concepts from image generation and applies them to audio.
While individuals have been able to generate base audio tracks using symbolic generation techniques in the past, Stable Audio is a leap forward. Symbolic generation commonly works with MIDI files, representing basic musical elements like drum rolls. However, Stable Audio goes beyond repetitive notes and MIDI files, enabling users to create new music. By working directly with raw audio samples, Stable Audio guarantees higher quality output. The model was trained on a vast amount of licensed music from audio library AudioSparks, ensuring the availability of comprehensive and high-quality audio data.
While image generation models allow users to create images in the style of specific artists, Stable Audio takes a different approach. Users cannot request the AI model to generate music that sounds like a classic Beatles tune, for example. Instead, the focus is on empowering musicians to be more creative. Ed Newton-Rex pointed out that in his experience, most musicians prefer to start a new audio piece with a blank canvas rather than aiming for the style of a specific musical group.
With approximately 1.2 billion parameters, the Stable Audio model demonstrates its diffusion capabilities. This is comparable to the original release of Stable Diffusion for image generation, highlighting the complexity and power of the AI model. Stability AI built and trained the text model specifically for audio generation. The model utilizes Contrastive Language Audio Pretraining (CLAP) to provide accurate responses to text prompts. To assist users in generating the desired audio files, Stability AI is also releasing a prompt guide, ensuring a seamless user experience.
Stable Audio will be accessible to users through two options: a free version and a $12/month Pro plan. The free version allows for 20 generations per month, with each track limited to 20 seconds. On the other hand, the Pro version offers greater flexibility, allowing for 500 generations per month and extending the track length to 90 seconds. The goal for Stability AI is to make Stable Audio available to everyone, encouraging experimentation and creativity.
Stability AI’s launch of Stable Audio marks a significant milestone in the realm of text-based AI technology. With its roots in image generation and code generation, Stability AI has expanded its horizons to empower users to create music and audio clips effortlessly. By utilizing diffusion models and training on vast amounts of audio data, Stable Audio opens up a realm of possibilities for musicians and audio enthusiasts. With a focus on creativity and accessibility, Stability AI aims to inspire and support the exploration of new soundscapes.