Abstract
Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.
Original language | English |
---|---|
Pages (from-to) | 83-94 |
Number of pages | 12 |
Journal | Journal of Information and Knowledge Management |
Volume | 4 |
Issue number | 2 |
DOIs | |
State | Published - 1 Dec 2005 |
Externally published | Yes |
Keywords
- classification
- data mining
- knowledge bases
- Massive datasets
- refinement techniques
- rule induction