Voice cloning, fueled by the rapid advancements in generative AI, has become a prominent field of study. It involves reproducing a person’s unique vocal characteristics, including pitch, timbre, rhythms, mannerisms, and even pronunciations, through technological means. While startups like ElevenLabs have received substantial funding for their work in this domain, Meta Platforms, the parent company of Facebook, Instagram, WhatsApp, and Oculus VR, has taken a step further by introducing its own voice cloning program, Audiobox. However, there seems to be a catch.
Today, Meta Platforms’ website showcased the release of Audiobox, presented by researchers from the Facebook AI Research (FAIR) lab. According to Meta, Audiobox is described as a “new foundation research model for audio generation.” It builds upon the company’s previous advancements in voice cloning, particularly Voicebox. The webpage further explains that Audiobox can generate voices and sound effects, combining voice inputs and natural language text prompts. This makes it effortless to create customized audio for a wide range of applications. Users can type a sentence they want a cloned voice to say or provide a description of a sound they want to generate, and Audiobox will handle the rest. Additionally, users can record their voices and have them cloned by Audiobox. Interestingly, Meta states that it has developed a “family of models.” One model focuses on speech mimicry, while another generates ambient sounds and sound effects like dogs barking, sirens, or children playing. All of these models are built on the shared self-supervised model, Audiobox SSL.
Meta Platforms’ researchers have adopted a unique approach called self-supervised learning (SSL) for training Audiobox. SSL is a deep learning technique in machine learning where AI algorithms generate their labels for unlabeled data. In contrast, supervised learning relies on pre-labeled data. The researchers explained their methodology in a scientific paper, stating that their strategy was influenced by the scarcity and quality issues of labeled data. They believed that data scaling played a crucial role in generalization. To achieve this, they trained the foundation model using audio without any supervision, such as transcripts, captions, or attribute labels.
It is worth noting that most leading generative AI models heavily rely on human-generated data for training. Audiobox is no exception. The FAIR researchers utilized approximately “160K hours of speech (primarily English), 20K hours of music, and 6K hours of sound samples.” The speech portion covers a wide range of audio sources, including audiobooks, podcasts, conversations, and in-the-wild recordings. This data collection ensures diversity and representation across different languages and countries. However, the research paper does not explicitly mention whether this data was sourced from the public domain or obtained through other means. Given the ongoing legal challenges faced by AI companies for training on copyrighted material without proper consent, clarifying the data’s origins becomes crucial.
To demonstrate the capabilities of Audiobox, Meta Platforms has released several interactive demos for users to experience. One demo allows users to record their voices speaking a sentence and then replicates their voice. Users can input text that they want their cloned voice to say and listen to their own voice replicated back to them. The similarity between the AI-generated cloned audio and the actual voice is remarkable, as confirmed by family members of those who tried it. Meta also enables users to create new voices based on text descriptions, such as a “deep feminine voice” or a “high-pitched masculine speaker from the U.S.” Furthermore, they can restyle their recorded voices or generate entirely new sounds based on text prompts. For example, typing “dogs barking” resulted in indistinguishable representations of real dog barks.
Before diving into Audiobox’s capabilities, it is vital to address its limitations and restrictions. Meta Platforms includes a disclaimer stating that the interactive demos are purely for research purposes and cannot be used for commercial endeavors. Moreover, the usage of Audiobox is restricted within the states of Illinois and Texas due to state laws governing audio collection practices. Although Meta initially seemed committed to open-source initiatives, with the introduction of Audiobox, it has deviated from this trend. Audiobox remains a proprietary tool, unlike Meta’s previous release of its Llama 2 family of large language models (LLMs). We have reached out to a Meta spokesperson for more information regarding the open-source nature of Audiobox and will update accordingly.
At the moment, Audiobox cannot be utilized for commercial purposes and is inaccessible to residents of Illinois and Texas. However, as AI continues to progress at a rapid pace, it is reasonable to anticipate the emergence of commercial voice cloning solutions in the near future. While Meta’s involvement in this field is significant, other companies may also develop their own versions of voice cloning technology. As the technology matures, it presents both exciting possibilities and ethical concerns. The ability to replicate anyone’s voice raises questions about consent, privacy, and copyright issues. It remains crucial for companies and researchers to address these concerns as voice cloning technology becomes more widely accessible.