SNP variable selection by generalized graph domination

Shuzhen Sun, Zhuqi Miao, Blaise Ratcliffe, Polly Campbell, Bret Pasch, Yousry A. El-Kassaby, Balabhaskar Balasundaram, Charles Chen

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Background High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the pn problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. Methods and findings K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/ transgenomicsosu/SNP-SELECT).

Original languageEnglish
Article numbere0203242
JournalPLoS ONE
Volume14
Issue number1
DOIs
StatePublished - 1 Jan 2019

Fingerprint

Polymorphism
single nucleotide polymorphism
Single Nucleotide Polymorphism
Nucleotides
culling (plants)
Pseudotsuga
Abies
Grasshoppers
linkage disequilibrium
Pseudotsuga menziesii
Linkage Disequilibrium
pedigree
Pedigree
Identification (control systems)
Cluster Analysis
Noise
Biomedical Research
Throughput
Theoretical Models
Technology

Cite this

Sun, S., Miao, Z., Ratcliffe, B., Campbell, P., Pasch, B., El-Kassaby, Y. A., ... Chen, C. (2019). SNP variable selection by generalized graph domination. PLoS ONE, 14(1), [e0203242]. https://doi.org/10.1371/journal.pone.0203242
Sun, Shuzhen ; Miao, Zhuqi ; Ratcliffe, Blaise ; Campbell, Polly ; Pasch, Bret ; El-Kassaby, Yousry A. ; Balasundaram, Balabhaskar ; Chen, Charles. / SNP variable selection by generalized graph domination. In: PLoS ONE. 2019 ; Vol. 14, No. 1.
@article{aecd5f4e6d9e4fa89bb1f179178ca8a0,
title = "SNP variable selection by generalized graph domination",
abstract = "Background High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the pn problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. Methods and findings K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/ transgenomicsosu/SNP-SELECT).",
author = "Shuzhen Sun and Zhuqi Miao and Blaise Ratcliffe and Polly Campbell and Bret Pasch and El-Kassaby, {Yousry A.} and Balabhaskar Balasundaram and Charles Chen",
year = "2019",
month = "1",
day = "1",
doi = "10.1371/journal.pone.0203242",
language = "English",
volume = "14",
journal = "PLoS ONE",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "1",

}

Sun, S, Miao, Z, Ratcliffe, B, Campbell, P, Pasch, B, El-Kassaby, YA, Balasundaram, B & Chen, C 2019, 'SNP variable selection by generalized graph domination', PLoS ONE, vol. 14, no. 1, e0203242. https://doi.org/10.1371/journal.pone.0203242

SNP variable selection by generalized graph domination. / Sun, Shuzhen; Miao, Zhuqi; Ratcliffe, Blaise; Campbell, Polly; Pasch, Bret; El-Kassaby, Yousry A.; Balasundaram, Balabhaskar; Chen, Charles.

In: PLoS ONE, Vol. 14, No. 1, e0203242, 01.01.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - SNP variable selection by generalized graph domination

AU - Sun, Shuzhen

AU - Miao, Zhuqi

AU - Ratcliffe, Blaise

AU - Campbell, Polly

AU - Pasch, Bret

AU - El-Kassaby, Yousry A.

AU - Balasundaram, Balabhaskar

AU - Chen, Charles

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Background High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the pn problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. Methods and findings K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/ transgenomicsosu/SNP-SELECT).

AB - Background High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the pn problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. Methods and findings K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/ transgenomicsosu/SNP-SELECT).

UR - http://www.scopus.com/inward/record.url?scp=85060495599&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0203242

DO - 10.1371/journal.pone.0203242

M3 - Article

C2 - 30677030

AN - SCOPUS:85060495599

VL - 14

JO - PLoS ONE

JF - PLoS ONE

SN - 1932-6203

IS - 1

M1 - e0203242

ER -

Sun S, Miao Z, Ratcliffe B, Campbell P, Pasch B, El-Kassaby YA et al. SNP variable selection by generalized graph domination. PLoS ONE. 2019 Jan 1;14(1). e0203242. https://doi.org/10.1371/journal.pone.0203242