Abstract
Language independent `bag-of-words' representations are
surprisingly effective for text classification. The representation
is high dimensional though, containing many non-consistent words
for text categorization. These non-consistent words result in
reduced generalization performance of subsequent classifiers,
e.g., from ill-posed principal component transformations. In this
communication our aim is to study the effect of reducing the least
relevant words from the bag-of-words representation. We consider a
new approach, using neural network based sensitivity maps and
information gain for determination of term relevancy, when pruning
the vocabularies. With reduced vocabularies documents are
classified using a latent semantic indexing representation and a
probabilistic neural network classifier. Reducing the bag-of-words
vocabularies with 90%-98%, we find consistent classification
improvement using two mid size data-sets. We also study the
applicability of information gain and sensitivity maps for
automated keyword generation.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2004 IEEE International Joint Conference on Neural Networks : special session on machine learning for text mining |
Publisher | IEEE Press |
Publication date | 2004 |
Pages | 483-485 |
DOIs | |
Publication status | Published - 2004 |
Event | 2004 IEEE International Joint Conference on Neural Networks - Budapest, Hungary Duration: 25 Jul 2004 → 29 Jul 2004 |
Conference
Conference | 2004 IEEE International Joint Conference on Neural Networks |
---|---|
Country/Territory | Hungary |
City | Budapest |
Period | 25/07/2004 → 29/07/2004 |