Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus

Mohamad Mehdi, Chitu Okoli, Mostafa Mesgari, Finn Årup Nielsen, Arto Lanamäki

Research output: Contribution to journalJournal articleResearchpeer-review

509 Downloads (Pure)

Abstract

Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.
Original languageEnglish
JournalInformation Processing & Management
Volume53
Issue number2
Pages (from-to)505–529
Number of pages25
ISSN0306-4573
DOIs
Publication statusPublished - 2017

Keywords

  • Information retrieval
  • Information extraction
  • Natural language processing
  • Ontologies
  • Wikipedia
  • Literature review

Fingerprint

Dive into the research topics of 'Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus'. Together they form a unique fingerprint.

Cite this