A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets

Saeed Piri, Dursun Delen, Tieming Liu

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Developing decision support systems (DSS) based on imbalanced datasets is one the critical challenges in data mining and decision-analytics. A dataset is called imbalanced when the number of examples from one class outnumbers the number of the instances from another class. Learning from imbalanced datasets is one of the major challenges in machine learning. While a standard classifier could have a very good performance on a balanced dataset, when applied to an imbalanced dataset, its performance deteriorates dramatically. This poor performance is rather troublesome, especially in detecting the minority class, which usually is the class of interest. Therefore, the poor performance of machine learning techniques, which are used to develop DSS, negatively affect the practicality of DSS in real word problems. Over-sampling the minority class is one of the most promising remedies for imbalanced data learning. In this study, we propose a new synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine (SVM). In this algorithm, first SVM is applied to the original imbalanced dataset, then, minority examples close to the SVM decision boundary, as the informative minority examples are over-sampled. We also developed another version of SIMO and call it weighted SIMO (W-SIMO). W-SIMO is different from SIMO in the degree of over-sampling the informative minority examples. In W-SIMO, incorrectly classified informative minority examples are over-sampled with a higher degree compared to the correctly classified informative minority examples. In this way, there is more focus on incorrectly classified minority examples. The over-sampled dataset can be used to train any classifier. We applied these algorithms to the 15 publicly available benchmark imbalanced datasets and assessed their performance in comparison with existing approaches in the area of imbalanced data learning. The results showed that our algorithms had the best performance in all datasets compared to other approaches.

Original languageEnglish
Pages (from-to)15-29
Number of pages15
JournalDecision Support Systems
Volume106
DOIs
StatePublished - 1 Feb 2018
Externally publishedYes

Fingerprint

Support vector machines
Learning
Sampling
Decision support systems
Learning systems
Classifiers
Data mining
Support Vector Machine
Datasets
Support vector machine
Minorities
Benchmarking
Data Mining

Keywords

  • Imbalanced data
  • Machine learning
  • Over-sampling
  • Performance metrics
  • Predictive modeling
  • Support vector machines

Cite this

@article{85e06d5a45634cc2896dcd358b4f2136,
title = "A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets",
abstract = "Developing decision support systems (DSS) based on imbalanced datasets is one the critical challenges in data mining and decision-analytics. A dataset is called imbalanced when the number of examples from one class outnumbers the number of the instances from another class. Learning from imbalanced datasets is one of the major challenges in machine learning. While a standard classifier could have a very good performance on a balanced dataset, when applied to an imbalanced dataset, its performance deteriorates dramatically. This poor performance is rather troublesome, especially in detecting the minority class, which usually is the class of interest. Therefore, the poor performance of machine learning techniques, which are used to develop DSS, negatively affect the practicality of DSS in real word problems. Over-sampling the minority class is one of the most promising remedies for imbalanced data learning. In this study, we propose a new synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine (SVM). In this algorithm, first SVM is applied to the original imbalanced dataset, then, minority examples close to the SVM decision boundary, as the informative minority examples are over-sampled. We also developed another version of SIMO and call it weighted SIMO (W-SIMO). W-SIMO is different from SIMO in the degree of over-sampling the informative minority examples. In W-SIMO, incorrectly classified informative minority examples are over-sampled with a higher degree compared to the correctly classified informative minority examples. In this way, there is more focus on incorrectly classified minority examples. The over-sampled dataset can be used to train any classifier. We applied these algorithms to the 15 publicly available benchmark imbalanced datasets and assessed their performance in comparison with existing approaches in the area of imbalanced data learning. The results showed that our algorithms had the best performance in all datasets compared to other approaches.",
keywords = "Imbalanced data, Machine learning, Over-sampling, Performance metrics, Predictive modeling, Support vector machines",
author = "Saeed Piri and Dursun Delen and Tieming Liu",
year = "2018",
month = "2",
day = "1",
doi = "10.1016/j.dss.2017.11.006",
language = "English",
volume = "106",
pages = "15--29",
journal = "Decision Support Systems",
issn = "0167-9236",
publisher = "Elsevier",

}

TY - JOUR

T1 - A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets

AU - Piri, Saeed

AU - Delen, Dursun

AU - Liu, Tieming

PY - 2018/2/1

Y1 - 2018/2/1

N2 - Developing decision support systems (DSS) based on imbalanced datasets is one the critical challenges in data mining and decision-analytics. A dataset is called imbalanced when the number of examples from one class outnumbers the number of the instances from another class. Learning from imbalanced datasets is one of the major challenges in machine learning. While a standard classifier could have a very good performance on a balanced dataset, when applied to an imbalanced dataset, its performance deteriorates dramatically. This poor performance is rather troublesome, especially in detecting the minority class, which usually is the class of interest. Therefore, the poor performance of machine learning techniques, which are used to develop DSS, negatively affect the practicality of DSS in real word problems. Over-sampling the minority class is one of the most promising remedies for imbalanced data learning. In this study, we propose a new synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine (SVM). In this algorithm, first SVM is applied to the original imbalanced dataset, then, minority examples close to the SVM decision boundary, as the informative minority examples are over-sampled. We also developed another version of SIMO and call it weighted SIMO (W-SIMO). W-SIMO is different from SIMO in the degree of over-sampling the informative minority examples. In W-SIMO, incorrectly classified informative minority examples are over-sampled with a higher degree compared to the correctly classified informative minority examples. In this way, there is more focus on incorrectly classified minority examples. The over-sampled dataset can be used to train any classifier. We applied these algorithms to the 15 publicly available benchmark imbalanced datasets and assessed their performance in comparison with existing approaches in the area of imbalanced data learning. The results showed that our algorithms had the best performance in all datasets compared to other approaches.

AB - Developing decision support systems (DSS) based on imbalanced datasets is one the critical challenges in data mining and decision-analytics. A dataset is called imbalanced when the number of examples from one class outnumbers the number of the instances from another class. Learning from imbalanced datasets is one of the major challenges in machine learning. While a standard classifier could have a very good performance on a balanced dataset, when applied to an imbalanced dataset, its performance deteriorates dramatically. This poor performance is rather troublesome, especially in detecting the minority class, which usually is the class of interest. Therefore, the poor performance of machine learning techniques, which are used to develop DSS, negatively affect the practicality of DSS in real word problems. Over-sampling the minority class is one of the most promising remedies for imbalanced data learning. In this study, we propose a new synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine (SVM). In this algorithm, first SVM is applied to the original imbalanced dataset, then, minority examples close to the SVM decision boundary, as the informative minority examples are over-sampled. We also developed another version of SIMO and call it weighted SIMO (W-SIMO). W-SIMO is different from SIMO in the degree of over-sampling the informative minority examples. In W-SIMO, incorrectly classified informative minority examples are over-sampled with a higher degree compared to the correctly classified informative minority examples. In this way, there is more focus on incorrectly classified minority examples. The over-sampled dataset can be used to train any classifier. We applied these algorithms to the 15 publicly available benchmark imbalanced datasets and assessed their performance in comparison with existing approaches in the area of imbalanced data learning. The results showed that our algorithms had the best performance in all datasets compared to other approaches.

KW - Imbalanced data

KW - Machine learning

KW - Over-sampling

KW - Performance metrics

KW - Predictive modeling

KW - Support vector machines

UR - http://www.scopus.com/inward/record.url?scp=85039742713&partnerID=8YFLogxK

U2 - 10.1016/j.dss.2017.11.006

DO - 10.1016/j.dss.2017.11.006

M3 - Article

AN - SCOPUS:85039742713

VL - 106

SP - 15

EP - 29

JO - Decision Support Systems

JF - Decision Support Systems

SN - 0167-9236

ER -