A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition

Dech Thammasiri, Dursun Delen, Phayung Meesad, Nihat Kasap

Research output: Contribution to journalArticle

52 Citations (Scopus)

Abstract

Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques - over-sampling, under-sampling and synthetic minority over-sampling (SMOTE) - along with four popular classification methods - logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates.

Original languageEnglish
Pages (from-to)321-330
Number of pages10
JournalExpert Systems with Applications
Volume41
Issue number2
DOIs
StatePublished - 1 Jan 2014
Externally publishedYes

Fingerprint

Students
Sampling
Support vector machines
Decision trees
Neurons
Logistics

Keywords

  • Attrition
  • Imbalanced class distribution
  • Prediction
  • Sampling
  • Sensitivity analysis
  • SMOTE
  • Student retention

Cite this

@article{42e62b75a7d94e57800cbf5e366f1cab,
title = "A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition",
abstract = "Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques - over-sampling, under-sampling and synthetic minority over-sampling (SMOTE) - along with four popular classification methods - logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24{\%} overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates.",
keywords = "Attrition, Imbalanced class distribution, Prediction, Sampling, Sensitivity analysis, SMOTE, Student retention",
author = "Dech Thammasiri and Dursun Delen and Phayung Meesad and Nihat Kasap",
year = "2014",
month = "1",
day = "1",
doi = "10.1016/j.eswa.2013.07.046",
language = "English",
volume = "41",
pages = "321--330",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Ltd",
number = "2",

}

A critical assessment of imbalanced class distribution problem : The case of predicting freshmen student attrition. / Thammasiri, Dech; Delen, Dursun; Meesad, Phayung; Kasap, Nihat.

In: Expert Systems with Applications, Vol. 41, No. 2, 01.01.2014, p. 321-330.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A critical assessment of imbalanced class distribution problem

T2 - The case of predicting freshmen student attrition

AU - Thammasiri, Dech

AU - Delen, Dursun

AU - Meesad, Phayung

AU - Kasap, Nihat

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques - over-sampling, under-sampling and synthetic minority over-sampling (SMOTE) - along with four popular classification methods - logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates.

AB - Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques - over-sampling, under-sampling and synthetic minority over-sampling (SMOTE) - along with four popular classification methods - logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates.

KW - Attrition

KW - Imbalanced class distribution

KW - Prediction

KW - Sampling

KW - Sensitivity analysis

KW - SMOTE

KW - Student retention

UR - http://www.scopus.com/inward/record.url?scp=84885955704&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2013.07.046

DO - 10.1016/j.eswa.2013.07.046

M3 - Article

AN - SCOPUS:84885955704

VL - 41

SP - 321

EP - 330

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

IS - 2

ER -