A scalable classification algorithm for very large datasets

Dursun Delen, Marilyn G. Kletke, Jin Hwa Kim

Research output: Contribution to journalArticle

1 Scopus citations

Abstract

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

Original languageEnglish
Pages (from-to)83-94
Number of pages12
JournalJournal of Information and Knowledge Management
Volume4
Issue number2
DOIs
StatePublished - 1 Dec 2005
Externally publishedYes

    Fingerprint

Keywords

  • classification
  • data mining
  • knowledge bases
  • Massive datasets
  • refinement techniques
  • rule induction

Cite this