Benchmarking Generative Latent Variable Models for Speech

Jakob D. Havtorn, Lasse Borgholt, Søren Hauberg, Jes Frellsen, Lars Maaløe

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

17 Downloads (Pure)


Stochastic latent variable models (LVMs) achieve state-of-the-art performance on natural image generation but are still inferior to deterministic models on speech. In this paper, we develop a speech benchmark of popular temporal LVMs and compare them against state-of-the-art deterministic models. We report the likelihood, which is a much used metric in the image domain, but rarely, or incomparably, reported for speech models. To assess the quality of the learned representations, we also compare their usefulness for phoneme recognition. Finally, we adapt the Clockwork VAE, a state-of-the-art temporal LVM for video generation, to the speech domain. Despite being autoregressive only in latent space, we find that the Clockwork VAE can outperform previous LVMs and reduce the gap to deterministic models by using a hierarchy of latent variables.
Original languageEnglish
Title of host publicationProceedings of ICLR Workshop on Deep Generative Models for Highly Structured Data
Number of pages23
Publication date2022
Publication statusPublished - 2022
EventICLR Workshop on Deep Generative Models for Highly Structured Data - Los Angeles, United States
Duration: 29 Apr 202229 Apr 2022


ConferenceICLR Workshop on Deep Generative Models for Highly Structured Data
Country/TerritoryUnited States
CityLos Angeles


Dive into the research topics of 'Benchmarking Generative Latent Variable Models for Speech'. Together they form a unique fingerprint.

Cite this