Abstract
A representative subset of protein chains were selected from the CATH 2.4 database [C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH - a hierarchic classification of protein domain structures, Structure 5 (8) (1997) 1093-1108], and were used for training a feed-forward neural network in order to predict protein fold classes by using as input the dipeptide frequency matrix and as output a novel representation of the protein chains in R30 space, based on knot invariant values [P. Røgen, B. Fain, Automatic classification of protein structure by using Gauss integrals, Proceedings of the National Academy of Sciences of the United States of America 100 (1) (2003) 119-124; P. Røgen, H.G. Bohr, A new family of global protein shape descriptors, Mathematical Biosciences 182 (2) (2003) 167-181]. In the general case when excluding singletons (proteins representing a topology or a sequence homology as unique members of these sets), the success rates for the predictions were 77% for class level, 60% for architecture, and 48% for topology. The total number of fold classes that are included in the present data set (∼500) is ten times that which has been reported in earlier attempts, so this result represents an improvement on previous work (reporting on a few handpicked folds). Furthermore, distance analysis of the network outputs resulting from singletons shows that it is possible to detect novel topologies with very high confidence (∼85%), and the network can in these cases be used as a sorting mechanism that identifies sequences which might need special attention. Also, a direct measure of prediction confidence may be obtained from such distance analysis.
Original language | English |
---|---|
Journal | Mathematical and Computer Modelling |
Volume | 43 |
Issue number | 3-4 |
Pages (from-to) | 401-412 |
ISSN | 0895-7177 |
DOIs | |
Publication status | Published - 2006 |
Keywords
- Proteins fold class prediction CATH neural networks