我们介绍了 MusicLM,这是一种从文本描述中生成高保真音乐的模型,例如“由扭曲的吉他即兴重复段支持的平静的小提琴旋律”. MusicLM 将条件音乐生成过程视为分层的序列到序列建模任务,它以 24 kHz 的频率生成音乐,并在几分钟内保持一致。我们的实验表明,MusicLM 在音频质量和对文本描述的遵守方面都优于以前的系统。此外,我们证明 MusicLM 可以同时以文本和旋律为条件,因为它可以根据文本标题中描述的风格转换口哨和哼唱的旋律。为了支持未来的研究,我们公开发布了 MusicCaps,这是一个由 5.5k 音乐文本对组成的数据集,其中包含由人类专家提供的丰富文本描述。
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank
Google Research
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.