Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus

Mohamad Mehdi, Chitu Okoli, Mostafa Mesgari, Finn Årup Nielsen, Arto Lanamäki

Research output: Contribution to journalJournal articleResearchpeer-review

131 Downloads (Pure)

Abstract

Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.
Original languageEnglish
JournalInformation Processing & Management
Volume53
Issue number2
Pages (from-to)505–529
Number of pages25
ISSN0306-4573
DOIs
Publication statusPublished - 2017

Keywords

  • Information retrieval
  • Information extraction
  • Natural language processing
  • Ontologies
  • Wikipedia
  • Literature review

Cite this

Mehdi, Mohamad ; Okoli, Chitu ; Mesgari, Mostafa ; Nielsen, Finn Årup ; Lanamäki, Arto. / Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus. In: Information Processing & Management. 2017 ; Vol. 53, No. 2. pp. 505–529.
@article{506cdbd1178347ddb77a30e44072d6b1,
title = "Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus",
abstract = "Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.",
keywords = "Information retrieval, Information extraction, Natural language processing, Ontologies, Wikipedia, Literature review",
author = "Mohamad Mehdi and Chitu Okoli and Mostafa Mesgari and Nielsen, {Finn {\AA}rup} and Arto Lanam{\"a}ki",
year = "2017",
doi = "10.1016/j.ipm.2016.07.003",
language = "English",
volume = "53",
pages = "505–529",
journal = "Information Processing & Management",
issn = "0306-4573",
publisher = "Elsevier",
number = "2",

}

Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus. / Mehdi, Mohamad; Okoli, Chitu; Mesgari, Mostafa ; Nielsen, Finn Årup; Lanamäki, Arto.

In: Information Processing & Management, Vol. 53, No. 2, 2017, p. 505–529.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus

AU - Mehdi, Mohamad

AU - Okoli, Chitu

AU - Mesgari, Mostafa

AU - Nielsen, Finn Årup

AU - Lanamäki, Arto

PY - 2017

Y1 - 2017

N2 - Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.

AB - Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.

KW - Information retrieval

KW - Information extraction

KW - Natural language processing

KW - Ontologies

KW - Wikipedia

KW - Literature review

U2 - 10.1016/j.ipm.2016.07.003

DO - 10.1016/j.ipm.2016.07.003

M3 - Journal article

VL - 53

SP - 505

EP - 529

JO - Information Processing & Management

JF - Information Processing & Management

SN - 0306-4573

IS - 2

ER -