Pruning the vocabulary for better context recognition

Rasmus Elsborg Madsen, Sigurdur Sigurdsson, Lars Kai Hansen, Jan Larsen

    Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

    432 Downloads (Pure)


    Language independent 'bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many inconsistent words for text categorization. These inconsistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies, documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
    Original languageEnglish
    Title of host publicationProceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.
    Publication date2004
    ISBN (Print)0-7695-2128-2
    Publication statusPublished - 2004
    Event17th International Conference on Pattern Recognition - Cambridge, United Kingdom
    Duration: 26 Aug 200426 Aug 2004
    Conference number: 17


    Conference17th International Conference on Pattern Recognition
    Country/TerritoryUnited Kingdom

    Bibliographical note

    Copyright: 2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE


    Dive into the research topics of 'Pruning the vocabulary for better context recognition'. Together they form a unique fingerprint.

    Cite this