AudioGen: Textually Guided Audio Generation

Felix Kreuk¹, Gabriel Synnaeve¹, Adam Polyak¹, Uriel Singer¹, Alexandre Défossez¹
Jade Copet¹, Devi Parikh¹, Yaniv Taigman¹, Yossi Adi^1,2

¹FAIR Team, Meta AI

²The Hebrew University of Jerusalem

[paper] [code]

Abstract

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.

Samples: comparison to prior work

desc

AudioGen-large with mixing

AudioGen-base with mixing

AudioGen-base no mixing

DiffSound

Ground Truth

a man speaks as birds chirp and dogs bark

whistling with wind blowing

male speaking with many people cheering in background

a man is speaking while typing on a keyboard

sirens and a humming engine approach and pass

male speech with horns honking in the background

drums and music playing with a man speaking

beep then male speaking multiple times

a cat meowing and young female speaking

a man speaking followed by another man speaking in the background as a motorcycle engine runs idle

a duck quacking as birds chirp and a pigeon cooing

a baby continuously crying

continuous laughter and chuckling

Classifier-free guidance

desc

gamma=1

gamma=2

gamma=3

gamma=4

gamma=5

a man speaks as birds chirp and dogs bark

whistling with wind blowing

male speaking with many people cheering in background

a man is speaking while typing on a keyboard

sirens and a humming engine approach and pass

male speech with horns honking in the background

drums and music playing with a man speaking

beep then male speaking multiple times

a cat meowing and young female speaking

a man speaking followed by another man speaking in the background as a motorcycle engine runs idle

a duck quacking as birds chirp and a pigeon cooing

a baby continuously crying

continuous laughter and chuckling

Audio continuation: 1 second audio prompt + text/no-text

desc

no text

text condition

random audio prompt + text condition

thundering sounds while rain pours

a man is speaking while typing on a keyboard

speech and a goat bleating

subway train blowing its horn.

a baby continuously crying

a bird cawing followed by an infant crying

a crowd applauds followed by a woman and a man speaking

a faint siren followed by a man speaking and wooing

Multi-stream modeling

desc

1 stream

2 streams

4 streams

typing on a typewriter

the siren of an emergency vehicle sounds

the rhythmic and repeated ticktock of a clock

several gunshots firing followed by two men talking then music playing

railroad crossing signal followed by a train passing and blowing horn

pigeons coo with some rustling

AudioGen: Textually Guided Audio Generation

Abstract

Samples

Architecture

Samples: comparison to prior work

Classifier-free guidance

Audio continuation: 1 second audio prompt + text/no-text

Multi-stream modeling