Abstract
Language independent `bag-of-words' representations are
surprisingly effective for text classification. The generic BOW
approach is based on a high-dimensional vocabulary which may
reduce the generalization performance of subsequent classifiers,
e.g., based on ill-posed principal component transformations. In
this communication our aim is to study the effect of sensitivity
based pruning of the bag-of-words representation. We consider
neural network based sensitivity maps for determination of term
relevancy, when pruning the vocabularies. With reduced
vocabularies documents are classified using a latent semantic
indexing representation and a probabilistic neural network
classifier. Pruning the vocabularies to approximately 20% of the
original size, we find consistent context recognition enhancement
for two mid size data-sets for a range of training set sizes. We
also study the applicability of the sensitivity measure for
automated keyword generation.
Original language | English |
---|---|
Title of host publication | Proceedings of 17th International Conference on Pattern Recognition (ICPR 2004) |
Volume | 2 |
Publication date | 2004 |
Pages | 483-486 |
Publication status | Published - 2004 |
Event | 17th International Conference on Pattern Recognition - Cambridge, United Kingdom Duration: 26 Aug 2004 → 26 Aug 2004 Conference number: 17 |
Conference
Conference | 17th International Conference on Pattern Recognition |
---|---|
Number | 17 |
Country/Territory | United Kingdom |
City | Cambridge |
Period | 26/08/2004 → 26/08/2004 |