A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds

G. Schiavo, Francesca Bertolini, G Galimberti, S Bovo, S Dall'Olio, L. Nanni Costa, M Gallo, L. Fontanesi*

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100% for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.
Original languageEnglish
JournalAnimal
Volume14
Issue number2
Pages (from-to)223-232
ISSN1751-7311
DOIs
Publication statusPublished - 2020

Keywords

  • Allocation
  • Random forest
  • Selection signature
  • Single nucleotide polymorphism
  • Sus scrofa

Cite this

Schiavo, G. ; Bertolini, Francesca ; Galimberti, G ; Bovo, S ; Dall'Olio, S ; Nanni Costa, L. ; Gallo, M ; Fontanesi, L. / A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds. In: Animal. 2020 ; Vol. 14, No. 2. pp. 223-232.
@article{9c6e788916b1477584e7d5bd487452de,
title = "A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds",
abstract = "Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100{\%} for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.",
keywords = "Allocation, Random forest, Selection signature, Single nucleotide polymorphism, Sus scrofa",
author = "G. Schiavo and Francesca Bertolini and G Galimberti and S Bovo and S Dall'Olio and {Nanni Costa}, L. and M Gallo and L. Fontanesi",
year = "2020",
doi = "10.1017/S1751731119002167",
language = "English",
volume = "14",
pages = "223--232",
journal = "Animal",
issn = "1751-7311",
publisher = "Cambridge University Press",
number = "2",

}

A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds. / Schiavo, G.; Bertolini, Francesca; Galimberti, G; Bovo, S; Dall'Olio, S; Nanni Costa, L.; Gallo, M; Fontanesi, L.

In: Animal, Vol. 14, No. 2, 2020, p. 223-232.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds

AU - Schiavo, G.

AU - Bertolini, Francesca

AU - Galimberti, G

AU - Bovo, S

AU - Dall'Olio, S

AU - Nanni Costa, L.

AU - Gallo, M

AU - Fontanesi, L.

PY - 2020

Y1 - 2020

N2 - Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100% for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.

AB - Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100% for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.

KW - Allocation

KW - Random forest

KW - Selection signature

KW - Single nucleotide polymorphism

KW - Sus scrofa

U2 - 10.1017/S1751731119002167

DO - 10.1017/S1751731119002167

M3 - Journal article

C2 - 31603060

VL - 14

SP - 223

EP - 232

JO - Animal

JF - Animal

SN - 1751-7311

IS - 2

ER -