## Abstract

Classification of multivariate observations into two or more populations based on a mixture of categorical and continuous variables, is a problem that is often solved by transforming variables to be all either continuous or categorical and then

applying a classification method. We deal with the problem of classification of observations into two populations with binary and continuous variables using the ratio of two decomposable tree-structured conditional Gaussian (CG) densities as classification rule, where the tree structure and density for each population are estimated independently. The simplicity of CG densities with tree structure alleviates the problem of the need of large sample sizes, whereas the decomposability property ensures the existence of analytic expressions of the maximum likelihood estimators of the CG distribution and the use of a modified version of the Kruskal’s algorithm to find the minimum spanning tree for the structure estimation. Since the selection of features often improves the classification performance of some methods, a stepwise procedure based on the cross-entropy loss is also proposed. We compare the empirical performance of the proposed method with that of other methods, classical and modern, using test error rates for a real data set and for simulated samples of different sizes from a CG density in each population. The empirical performance of the method in the real data was four among various methods. In the simulation, the proposed method was able to recover the structure of the CG densities from which the samples were generated and produced the lowest error rate; it was also observed that the error rates for all the methods were substantially larger than the population Bayes error for small sample sizes. The results suggest that the ratio of two CG densities with a tree structure is a good method, sufficiently fast computationally, worth considering for the classification of observations with mixtures of variables.

applying a classification method. We deal with the problem of classification of observations into two populations with binary and continuous variables using the ratio of two decomposable tree-structured conditional Gaussian (CG) densities as classification rule, where the tree structure and density for each population are estimated independently. The simplicity of CG densities with tree structure alleviates the problem of the need of large sample sizes, whereas the decomposability property ensures the existence of analytic expressions of the maximum likelihood estimators of the CG distribution and the use of a modified version of the Kruskal’s algorithm to find the minimum spanning tree for the structure estimation. Since the selection of features often improves the classification performance of some methods, a stepwise procedure based on the cross-entropy loss is also proposed. We compare the empirical performance of the proposed method with that of other methods, classical and modern, using test error rates for a real data set and for simulated samples of different sizes from a CG density in each population. The empirical performance of the method in the real data was four among various methods. In the simulation, the proposed method was able to recover the structure of the CG densities from which the samples were generated and produced the lowest error rate; it was also observed that the error rates for all the methods were substantially larger than the population Bayes error for small sample sizes. The results suggest that the ratio of two CG densities with a tree structure is a good method, sufficiently fast computationally, worth considering for the classification of observations with mixtures of variables.

Original language | English |
---|

Publisher | Technical University of Denmark |
---|---|

Number of pages | 24 |

Publication status | Published - 2024 |

Series | DTU Compute Technical Report |
---|---|

ISSN | 1601-2321 |

## Keywords

- Classification methods
- Conditional Gaussian distribution
- Decomposable tree graphs
- Deep neural networks
- LassoNet
- Logistic regression
- Mixed graphical models
- Mixed data