Improving information extraction using a probability-based approach

S. Kim, Saeema Ahmed, K. Wallace

    Research output: Contribution to journalJournal articleResearchpeer-review


    Information plays a crucial role during the entire life-cycle of a product. It has been shown that engineers frequently consult colleagues to obtain the information they require to solve problems. However, the industrial world is now more transient and key personnel move to other companies or retire. It is becoming essential to retrieve vital information from archived product documents, if it is available. There is, therefore, great interest in ways of extracting relevant and sharable information from documents. A keyword-based search is commonly used, but studies have shown that these searches often prove unsuccessful. Searches can be improved if domain entities of interest, e.g., 'gas turbine; are explicitly associated with their types, i.e., gas turbine is a type of engine, thus reducing the ambiguity of referring to the entities using various different ways of expressing them. It would be helpful to compile a full list of entities associated with the relevant types before identifying them in texts. However, due to the various ways of referring entities in the texts, manually defined identification rules tend to produce high precision but with low recall. In order to increase the recall, while maintaining the high precision, a learning approach that makes identification decisions based on a probability model, rather than simply looking up the presence of the pre-defined variations, looks promising. This paper presents the results of developing such a probability-based entity-identification approach. Tests show that the proposed approach achieves improved recall, i.e., from 53% to 80%, with comparable precision.
    Original languageEnglish
    JournalStrojniski Vestnik
    Issue number7-8
    Pages (from-to)429-441
    Publication statusPublished - 2007


    • probability methods
    • natural language processing
    • name entity identification
    • taxonomy
    • information searches


    Dive into the research topics of 'Improving information extraction using a probability-based approach'. Together they form a unique fingerprint.

    Cite this