Projects per year
Abstract
The focus of the thesis is a puzzling capability of deep neural networks to "work well" on previously unseen data, i.e., to generalize well. The long-lived open problem in machine learning with links to the cognitive abilities of biological neural networks. The thesis consists of an introductory chapter, three theoretical chapters, addressing the main topic in increasing depth and complexity, and two complementary papers. After the introductory chapter sets the scene, the thesis addresses the problem with increasing complexity. It starts with presenting a novel perspective on an evolution of the eigenspectrum of general loss function during the gradient descent (GD) optimization of the neural network model parameters. The following chapters specialize in a broad class of Bregman divergence losses, that includes common DL objectives. Several novel theoretical and experimental results are presented. These include the formulation of the so-called Self Regularized Bregman Objective (SeReBrO) and its equivalence to stochastic gradient descent optimizing Bregman losses and proving the presence of a latent regularizer.
Next, the following fundamental research question is addressed: Could a small gradient noise, possibly negligible when updating individual weight parameters by mini-batch back-prop, have a significant effect by imposing a large model variance prior through its non-negligible norm concentrating away from zero in high dimensional (overparameterized) settings?
The main contribution of this thesis is a positive answer to this question obtained in a sequence of new theoretical results for isotropic and further generalized to an arbitrary gradient noise covariance matrix. In the result, it is shown that thegeneralization impact of an overparameterization and noise is limited by the rankof a noisy gradient covariance matrix. Further, it is put into perspective, sheddinglight on existing experimental and theoretical challenges in generalization indeep learning, and demonstrating that it provides also a practical tool leadingto better generalizing models.An experiment on denoising auto-encoders (DAE) is presented in which recommended explicit regularizers are replaced by stochasticity and overparameterization to boost the rank of gradient noise covariance matrix along the lines of thisthesis, and, opposing the previous expectation, are shown to learn to generalizewell.The thesis concludes with a paper on unsupervised Bayesian learning on graphs bymaking use of generative modeling of uncertainty that allows for inference, verifiedon real-world data sets and images. The experiment section demonstrates thatBayesian Cut outperforms the popular spectral and modularity-based methodsand renders itself as their probabilistic alternative. Bayesian Cut source codehas been made available.Overall, the thesis benefits from taking a probabilistic approach when addressingdeep learning in the first part and unsupervised learning in the Bayesian Cut.Intriguingly, a deeper link emerged in a form of hypergeometric functions, thatin both cases present the solution to underlying mathematical problems.
Next, the following fundamental research question is addressed: Could a small gradient noise, possibly negligible when updating individual weight parameters by mini-batch back-prop, have a significant effect by imposing a large model variance prior through its non-negligible norm concentrating away from zero in high dimensional (overparameterized) settings?
The main contribution of this thesis is a positive answer to this question obtained in a sequence of new theoretical results for isotropic and further generalized to an arbitrary gradient noise covariance matrix. In the result, it is shown that thegeneralization impact of an overparameterization and noise is limited by the rankof a noisy gradient covariance matrix. Further, it is put into perspective, sheddinglight on existing experimental and theoretical challenges in generalization indeep learning, and demonstrating that it provides also a practical tool leadingto better generalizing models.An experiment on denoising auto-encoders (DAE) is presented in which recommended explicit regularizers are replaced by stochasticity and overparameterization to boost the rank of gradient noise covariance matrix along the lines of thisthesis, and, opposing the previous expectation, are shown to learn to generalizewell.The thesis concludes with a paper on unsupervised Bayesian learning on graphs bymaking use of generative modeling of uncertainty that allows for inference, verifiedon real-world data sets and images. The experiment section demonstrates thatBayesian Cut outperforms the popular spectral and modularity-based methodsand renders itself as their probabilistic alternative. Bayesian Cut source codehas been made available.Overall, the thesis benefits from taking a probabilistic approach when addressingdeep learning in the first part and unsupervised learning in the Bayesian Cut.Intriguingly, a deeper link emerged in a form of hypergeometric functions, thatin both cases present the solution to underlying mathematical problems.
Original language | English |
---|
Publisher | Technical University of Denmark |
---|---|
Number of pages | 149 |
Publication status | Published - 2022 |
Fingerprint
Dive into the research topics of 'Generalization in Deep Learning and Bayesian Graph Cut'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Federated deep learning for privacy preserving mobile data modelling
Taborsky, P. (PhD Student), Jenssen, R. (Examiner), Tan, Z.-H. (Examiner), Hansen, L. K. (Main Supervisor), Nielsen, F. Å. (Supervisor) & Schmidt, M. N. (Examiner)
01/01/2018 → 14/12/2022
Project: PhD