AudioGen: Textually Guided Audio Generation

Felix Kreuk1, Gabriel Synnaeve1, Adam Polyak1, Uriel Singer1, Alexandre Défossez1
Jade Copet1, Devi Parikh1, Yaniv Taigman1, Yossi Adi1,2

1FAIR Team, Meta AI

2The Hebrew University of Jerusalem

[paper] [code]

Abstract

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.

Samples

Your browser does not support the video tag.

Architecture

Samples: comparison to prior work

desc AudioGen-large with mixing AudioGen-base with mixing AudioGen-base no mixing DiffSound Ground Truth
a man speaks as birds chirp and dogs bark
whistling with wind blowing
male speaking with many people cheering in background
a man is speaking while typing on a keyboard
sirens and a humming engine approach and pass
male speech with horns honking in the background
drums and music playing with a man speaking
beep then male speaking multiple times
a cat meowing and young female speaking
a man speaking followed by another man speaking in the background as a motorcycle engine runs idle
a duck quacking as birds chirp and a pigeon cooing
a baby continuously crying
continuous laughter and chuckling

Classifier-free guidance

An ablation study on the affect of the guidance scale parameter.
desc gamma=1 gamma=2 gamma=3 gamma=4 gamma=5
a man speaks as birds chirp and dogs bark
whistling with wind blowing
male speaking with many people cheering in background
a man is speaking while typing on a keyboard
sirens and a humming engine approach and pass
male speech with horns honking in the background
drums and music playing with a man speaking
beep then male speaking multiple times
a cat meowing and young female speaking
a man speaking followed by another man speaking in the background as a motorcycle engine runs idle
a duck quacking as birds chirp and a pigeon cooing
a baby continuously crying
continuous laughter and chuckling

Audio continuation: 1 second audio prompt + text/no-text

Examples of audio continuation given 1 second audio prompts and various text conditioning settings. Under the 'random audio prompt + text condition' column we use condition the model on a random audio prompt together with the text condition under the 'desc' column.
desc no text text condition random audio prompt + text condition
thundering sounds while rain pours
a man is speaking while typing on a keyboard
speech and a goat bleating
subway train blowing its horn.
a baby continuously crying
a bird cawing followed by an infant crying
a crowd applauds followed by a woman and a man speaking
a faint siren followed by a man speaking and wooing

Multi-stream modeling

desc 1 stream 2 streams 4 streams
typing on a typewriter
the siren of an emergency vehicle sounds
the rhythmic and repeated ticktock of a clock
several gunshots firing followed by two men talking then music playing
railroad crossing signal followed by a train passing and blowing horn
pigeons coo with some rustling