Vocabulary Pruning for Improved Context Recognition

Rasmus Elsborg Madsen, Sigurdur Sigurdsson, Lars Kai Hansen, Jan Larsen

    Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

    188 Downloads (Pure)


    Language independent `bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
    Original languageEnglish
    Title of host publicationProceedings of the International Joint Conference on Neural Networks : special session on machine learning for text mining
    PublisherIEEE Press
    Publication date2004
    Publication statusPublished - 2004


    Dive into the research topics of 'Vocabulary Pruning for Improved Context Recognition'. Together they form a unique fingerprint.

    Cite this