Predicting breast cancer survivability: A comparison of three data mining methods

Dursun Delen, Glenn Walker, Amit Kadam

Research output: Contribution to journalArticle

562 Citations (Scopus)

Abstract

Objective: The prediction of breast cancer survivability has been a challenging research problem for many researchers. Since the early dates of the related research, much advancement has been recorded in several related fields. For instance, thanks to innovative biomedical technologies, better explanatory prognostic factors are being measured and recorded; thanks to low cost computer hardware and software technologies, high volume better quality data is being collected and stored automatically; and finally thanks to better analytical methods, those voluminous data is being processed effectively and efficiently. Therefore, the main objective of this manuscript is to report on a research project where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. Methods and material: We used two popular data mining algorithms (artificial neural networks and decision trees) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (more than 200,000 cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. Results: The results indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), artificial neural networks came out to be the second with 91.2% accuracy and the logistic regression models came out to be the worst of the three with 89.2% accuracy. Conclusion: The comparative study of multiple prediction models for breast cancer survivability using a large dataset along with a 10-fold cross-validation provided us with an insight into the relative prediction ability of different data mining methods. Using sensitivity analysis on neural network models provided us with the prioritized importance of the prognostic factors used in the study.

Original languageEnglish
Pages (from-to)113-127
Number of pages15
JournalArtificial Intelligence in Medicine
Volume34
Issue number2
DOIs
StatePublished - 1 Jun 2005
Externally publishedYes

Fingerprint

Data Mining
Data mining
Breast Neoplasms
Decision Trees
Logistic Models
Research
Decision trees
Neural networks
Logistics
Biomedical Technology
Neural Networks (Computer)
Software
Research Personnel
Technology
Computer hardware
Sensitivity analysis
Costs and Cost Analysis
Statistical methods
Costs

Keywords

  • Breast cancer survivability
  • Data mining
  • k-Fold cross-validation
  • SEER

Cite this

@article{7706754cd6a44057a65d0c98eb8c701a,
title = "Predicting breast cancer survivability: A comparison of three data mining methods",
abstract = "Objective: The prediction of breast cancer survivability has been a challenging research problem for many researchers. Since the early dates of the related research, much advancement has been recorded in several related fields. For instance, thanks to innovative biomedical technologies, better explanatory prognostic factors are being measured and recorded; thanks to low cost computer hardware and software technologies, high volume better quality data is being collected and stored automatically; and finally thanks to better analytical methods, those voluminous data is being processed effectively and efficiently. Therefore, the main objective of this manuscript is to report on a research project where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. Methods and material: We used two popular data mining algorithms (artificial neural networks and decision trees) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (more than 200,000 cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. Results: The results indicated that the decision tree (C5) is the best predictor with 93.6{\%} accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), artificial neural networks came out to be the second with 91.2{\%} accuracy and the logistic regression models came out to be the worst of the three with 89.2{\%} accuracy. Conclusion: The comparative study of multiple prediction models for breast cancer survivability using a large dataset along with a 10-fold cross-validation provided us with an insight into the relative prediction ability of different data mining methods. Using sensitivity analysis on neural network models provided us with the prioritized importance of the prognostic factors used in the study.",
keywords = "Breast cancer survivability, Data mining, k-Fold cross-validation, SEER",
author = "Dursun Delen and Glenn Walker and Amit Kadam",
year = "2005",
month = "6",
day = "1",
doi = "10.1016/j.artmed.2004.07.002",
language = "English",
volume = "34",
pages = "113--127",
journal = "Artificial Intelligence in Medicine",
issn = "0933-3657",
publisher = "Elsevier",
number = "2",

}

Predicting breast cancer survivability : A comparison of three data mining methods. / Delen, Dursun; Walker, Glenn; Kadam, Amit.

In: Artificial Intelligence in Medicine, Vol. 34, No. 2, 01.06.2005, p. 113-127.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Predicting breast cancer survivability

T2 - A comparison of three data mining methods

AU - Delen, Dursun

AU - Walker, Glenn

AU - Kadam, Amit

PY - 2005/6/1

Y1 - 2005/6/1

N2 - Objective: The prediction of breast cancer survivability has been a challenging research problem for many researchers. Since the early dates of the related research, much advancement has been recorded in several related fields. For instance, thanks to innovative biomedical technologies, better explanatory prognostic factors are being measured and recorded; thanks to low cost computer hardware and software technologies, high volume better quality data is being collected and stored automatically; and finally thanks to better analytical methods, those voluminous data is being processed effectively and efficiently. Therefore, the main objective of this manuscript is to report on a research project where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. Methods and material: We used two popular data mining algorithms (artificial neural networks and decision trees) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (more than 200,000 cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. Results: The results indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), artificial neural networks came out to be the second with 91.2% accuracy and the logistic regression models came out to be the worst of the three with 89.2% accuracy. Conclusion: The comparative study of multiple prediction models for breast cancer survivability using a large dataset along with a 10-fold cross-validation provided us with an insight into the relative prediction ability of different data mining methods. Using sensitivity analysis on neural network models provided us with the prioritized importance of the prognostic factors used in the study.

AB - Objective: The prediction of breast cancer survivability has been a challenging research problem for many researchers. Since the early dates of the related research, much advancement has been recorded in several related fields. For instance, thanks to innovative biomedical technologies, better explanatory prognostic factors are being measured and recorded; thanks to low cost computer hardware and software technologies, high volume better quality data is being collected and stored automatically; and finally thanks to better analytical methods, those voluminous data is being processed effectively and efficiently. Therefore, the main objective of this manuscript is to report on a research project where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. Methods and material: We used two popular data mining algorithms (artificial neural networks and decision trees) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (more than 200,000 cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. Results: The results indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), artificial neural networks came out to be the second with 91.2% accuracy and the logistic regression models came out to be the worst of the three with 89.2% accuracy. Conclusion: The comparative study of multiple prediction models for breast cancer survivability using a large dataset along with a 10-fold cross-validation provided us with an insight into the relative prediction ability of different data mining methods. Using sensitivity analysis on neural network models provided us with the prioritized importance of the prognostic factors used in the study.

KW - Breast cancer survivability

KW - Data mining

KW - k-Fold cross-validation

KW - SEER

UR - http://www.scopus.com/inward/record.url?scp=19344364327&partnerID=8YFLogxK

U2 - 10.1016/j.artmed.2004.07.002

DO - 10.1016/j.artmed.2004.07.002

M3 - Article

C2 - 15894176

AN - SCOPUS:19344364327

VL - 34

SP - 113

EP - 127

JO - Artificial Intelligence in Medicine

JF - Artificial Intelligence in Medicine

SN - 0933-3657

IS - 2

ER -