Data quality in very large, multiple-source, secondary datasets for data mining applications

Marilyn G. Kletke, Dursun Delen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The data mining research community is increasingly addressing data quality issues, including problems of dirty data. Hand, Blunt, Kelly and Adams (2000) have identified high-level and low-level quality issues in data mining. Kim, Choi, Hong, Kim and Lee (2003) have compiled a useful, complete taxonomy of dirty data that provides a starting point for research in effective techniques and fast algorithms for preprocessing data, and ways to approach the problems of dirty data. In this study we create a classification scheme for data errors by transforming their general taxonomy to apply to very large multiple-source secondary datasets. These types of datasets are increasingly being compiled by organizations for use in their data mining applications. We contribute this classification scheme to the body of research addressing quality issues in the very large multiple-source secondary datasets that are being built through today's global organizations' massive data collection from the Internet.

Original languageEnglish
Title of host publicationAssociation for Information Systems - 11th Americas Conference on Information Systems, AMCIS 2005
Subtitle of host publicationA Conference on a Human Scale
Pages501-505
Number of pages5
StatePublished - 1 Dec 2005
Externally publishedYes
Event11th Americas Conference on Information Systems, AMCIS 2005 - Omaha, NE, United States
Duration: 11 Aug 200515 Aug 2005

Publication series

NameAssociation for Information Systems - 11th Americas Conference on Information Systems, AMCIS 2005: A Conference on a Human Scale
Volume1

Conference

Conference11th Americas Conference on Information Systems, AMCIS 2005
Country/TerritoryUnited States
CityOmaha, NE
Period11/08/0515/08/05

Keywords

  • Data mining
  • Data quality
  • Dirty data
  • Multiple-source data
  • Preprocessing
  • Very large dataset

Fingerprint

Dive into the research topics of 'Data quality in very large, multiple-source, secondary datasets for data mining applications'. Together they form a unique fingerprint.

Cite this