Scalable Tensor Factorizations with Missing Data

Evrim Acar, Daniel M. Dunlavy, Tamara G. Kolda, Morten Mørup

    Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review


    The problem of missing data is ubiquitous in domains such as biomedical signal processing, network trac analysis, bibliometrics, social network analysis, chemometrics, computer vision, and communication networks|all domains in which data collection is subject to occasional errors. Moreover, these data sets can be quite large and have more than two axes of variation, e.g., sender, receiver, time. Many applications in those domains aim to capture the underlying latent structure of the data; in other words, they need to factorize data sets with missing entries. If we cannot address the problem of missing data, many important data sets will be discarded or improperly analyzed. Therefore, we need a robust and scalable approach for factorizing multi-way arrays (i.e., tensors) in the presence of missing data. We focus on one of the most well-known tensor factorizations, CANDECOMP/PARAFAC (CP), and formulate the CP model as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) using a rst-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factor tensors with noise and up to 70% missing data. Moreover, our approach is significantly faster than the leading alternative and scales to larger problems. To show the real-world usefulness of CP-WOPT, we illustrate its applicability on a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes.
    Original languageEnglish
    Title of host publicationProceedings of the 2010 SIAM International Conference on Data Mining
    Publication date2010
    Publication statusPublished - 2010
    EventSiam Datamining 2010 (SDM 2010) -
    Duration: 1 Jan 2010 → …


    ConferenceSiam Datamining 2010 (SDM 2010)
    Period01/01/2010 → …


    • missing data, tensor factorization, CANDECOMP/PARAFAC, optimization

    Cite this