A scalable classification algorithm for very large datasets

Dursun Delen, Marilyn G. Kletke, Jin Hwa Kim

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

Original languageEnglish
Pages (from-to)83-94
Number of pages12
JournalJournal of Information and Knowledge Management
Volume4
Issue number2
DOIs
StatePublished - 1 Dec 2005
Externally publishedYes

Fingerprint

electronic business
knowledge
Chucks
discriminant analysis
Discriminant analysis
neural network
performance
transaction
Logistics
customer
logistics
Neural networks
regression
Testing
learning
Industry

Keywords

  • classification
  • data mining
  • knowledge bases
  • Massive datasets
  • refinement techniques
  • rule induction

Cite this

Delen, Dursun ; Kletke, Marilyn G. ; Kim, Jin Hwa. / A scalable classification algorithm for very large datasets. In: Journal of Information and Knowledge Management. 2005 ; Vol. 4, No. 2. pp. 83-94.
@article{6fe025d8f14d488ab2a2b44c83b831ee,
title = "A scalable classification algorithm for very large datasets",
abstract = "Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.",
keywords = "classification, data mining, knowledge bases, Massive datasets, refinement techniques, rule induction",
author = "Dursun Delen and Kletke, {Marilyn G.} and Kim, {Jin Hwa}",
year = "2005",
month = "12",
day = "1",
doi = "10.1142/S0219649205001092",
language = "English",
volume = "4",
pages = "83--94",
journal = "Journal of Information and Knowledge Management",
issn = "0219-6492",
publisher = "World Scientific Publishing Co.",
number = "2",

}

A scalable classification algorithm for very large datasets. / Delen, Dursun; Kletke, Marilyn G.; Kim, Jin Hwa.

In: Journal of Information and Knowledge Management, Vol. 4, No. 2, 01.12.2005, p. 83-94.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A scalable classification algorithm for very large datasets

AU - Delen, Dursun

AU - Kletke, Marilyn G.

AU - Kim, Jin Hwa

PY - 2005/12/1

Y1 - 2005/12/1

N2 - Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

AB - Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

KW - classification

KW - data mining

KW - knowledge bases

KW - Massive datasets

KW - refinement techniques

KW - rule induction

UR - http://www.scopus.com/inward/record.url?scp=84862961324&partnerID=8YFLogxK

U2 - 10.1142/S0219649205001092

DO - 10.1142/S0219649205001092

M3 - Article

AN - SCOPUS:84862961324

VL - 4

SP - 83

EP - 94

JO - Journal of Information and Knowledge Management

JF - Journal of Information and Knowledge Management

SN - 0219-6492

IS - 2

ER -