Information plays a crucial role during the entire life-cycle of a product. It has been shown that engineers frequently consult colleagues to obtain the information they require to solve problems. However, the industrial world is now more transient and key personnel move to other companies or retire. It is becoming essential to retrieve vital information from archived product documents, if it is available. There is, therefore, great interest in ways of extracting relevant and sharable information from documents. A keyword-based search is commonly used, but studies have shown that these searches often prove unsuccessful. Searches can be improved if domain entities of interest, e.g., 'gas turbine; are explicitly associated with their types, i.e., gas turbine is a type of engine, thus reducing the ambiguity of referring to the entities using various different ways of expressing them. It would be helpful to compile a full list of entities associated with the relevant types before identifying them in texts. However, due to the various ways of referring entities in the texts, manually defined identification rules tend to produce high precision but with low recall. In order to increase the recall, while maintaining the high precision, a learning approach that makes identification decisions based on a probability model, rather than simply looking up the presence of the pre-defined variations, looks promising. This paper presents the results of developing such a probability-based entity-identification approach. Tests show that the proposed approach achieves improved recall, i.e., from 53% to 80%, with comparable precision.
|Publication status||Published - 2007|
- probability methods
- natural language processing
- name entity identification
- information searches