Rabu, 10 Oktober 2012

Data Mining: Missing Data and Imputation Solutions

Missing Data has been one of my constant problems especially when it comes to data mining, it will have a great effect on my data sets on its accuracy and the correct model used when using IBM data modeler, we have been discussing whether an imputation solution would be good to use, but my colleague still has his hesitations upon using that method. It is then that I move to my second option to generate more data sets to satisfy the appropriate number of records that we can use. Our director is very demanding on our outputs also.



so here are certain Imputations that one can use, ask first your colleagues or boss if they would allow it. IBM data modeler has also an option when your lacking in records just choose (Reduce or Boost) nodes in the model when using C & R Decision Tree.

I have always wanted to use default values to replace the missing values, like for age for example, if it ranged from 18 to 21, I would probably be putting a value of 17 for these fields so it would make my data more robust.


MULTIPLE IMPUTATION in IBM-SPSS Command Additional Features

The command syntax language also allows you to:
• Specify a subset of variables for which descriptive statistics are shown (IMPUTATIONSUMMARIES subcommand).
• Specify both an analysis of missing patterns and imputation in a single run of the procedure.
• Specify the maximum number of model parameters allowed when imputing any variable (MAXMODELPARAM keyword)


Download IBM DATA MODELER 12.0 

Tidak ada komentar:

Posting Komentar