Projects per year
An ubiquitous source of nuisance in data analysis is missing values, where individual data points can be incomplete with observed parts and missing parts. This nuisance is not new and a vast literature has developed in the field over the past 50 years. Missing values forces the analyst to make choices, either explicitly or implicitly, about how to proceed with a given analysis. These choices are mostly perceived to be of practical nature, but often tacetly imply analytical assumptions. The challenges that missing data imposes can be handled in a statistically principled manner by marginalizing over the missing data in probabilistic models. Probabilistic generative models with latent variables have shown their usefulness in the past for dimensionality reduction and density modelling and have recently been used successfully in a diverse set of complex problem domains, with and without missing data. We develop and analyze probabilistic generative models with latent variables in missing data problems and consider their use from several angles. Missing values can be seen as a loss of information leading to an increased uncertainty about the individual datapoint. This loss of information leads to a decrease in performance when learning model parameters over a dataset. In an analytically tractable case the effect of missing values on parameter estimates is investigated. The missing mechanisms giving rise to the missing data may be arbitrarily complex and potentially entangled with the values of the missing data, had they been observed. A missing mask is used to indicate which data elements are observed or missing and a model of the mask is used as an approximation to the true missing mechanism. Under some assumptions about the missing model it can be ignored during data analysis, but when these assumptions are unwarranted, ignoring the missing model leads to biased inference and learning. We introduce a modelling approach utilizing the tools of amortized variational inference to model the observed data and the missing mask jointly. Finally, in supervised learning no distribution over the covariates is typically assumed and the usual approach of marginalizing over the missing features is not possible. We investigate different methods for handling missing values in the covariates and propose an approach to marginalizing over missing features in a joint model of targets and covariates, while keeping the discriminative model architecture intact.
|Publisher||Technical University of Denmark|
|Number of pages||106|
|Publication status||Published - 2020|