Kamis, 18 Oktober 2012

One Platform, less headaches

by Albert Anthony D. Gavino

When doing process flow in data mining its best to stick to one platform or one database, if its MS-SQL Server 2008, MS-Access 2005 or any other platform just use one, hence you won't have problems connecting your tables and your queries will be much more simple.

Lets hope your not using MS-Excel or other low spreadsheet files since their rows and columns are limited, trust me excel file was not built for tables and small database files. Excel files are for financial statements and balance sheets and that's probably it.

TIP: DO NOT USE EXCEL FOR DATA MINING

In SPSS data modeler, they have tools where you can have queries straight from your SQL server, this makes life much more easier.

Rabu, 17 Oktober 2012

STAR schema and Snowflake Schema

Item
SNOWFLAKE SCHEMA
STAR SCHEMA
Ease of use
More complex queries and hence less easy to understand
Less complex queries and easy to understand
Query performance
More foreign keys-and hence more query execution time
Less no. of foreign keys and hence lesser query execution time
Normalization
Has normalized tables
Has De-normalized tables
Type of datawarehouse
Good to use for small datawarehouses/datamarts
Good for large datawarehouses
Joins
Higher number of Joins
Fewer Joins
Dimension table
It may have more than one dimension table for each dimension
Contains only single dimension table for each dimension
When to use
When dimension table is relatively big in size, snow-flaking is better as it reduces space.
When dimension table contains less number of rows, we can go for Star schema.
Ease of maintenance/change
No redundancy and hence more easy to maintain and change
Has redundant data and hence less easy to maintain/change

Unstructured tables and Foreign Keys

by Albert Anthony D. Gavino

To be able to data mine correctly, one must have organized tables and structures, usually designed by the Database Administrator (DBA).

One of my worst nightmares is dealing with Manual tables and files coming from excel that is, not only did this bring me a problem of efficiency, the design of its foreign key was also faulty, imagine a foreign key that resets every term and every school year, hence the problem of creating Schemas for the data structure.

let me lecture on Important aspects of tables

if your foreign key repeats itself and its status, create another foreign key that will not repeat, but this key should also be reflected in your other tables so that you will be able to connect them altogether.

say for example:

SCHOOLYEAR+TERM+STATUS+CASENO

  • 2011-2012-1-FR-1234 (freshman)
  • 2011-2012-1-TR-1234 (transferee)
  • 2011-2012-1-UG-1234 (2nd UG degree)

SQL Server 2008
  1. Primary Key Constraint: Primary Keys constraints prevents duplicate values for columns and provides unique identifier to each column, as well it creates clustered index on the columns.
  2. Foreign Key Constraint: When a FOREIGN KEY constraint is added to an existing column or columns in the table SQL Server, by default checks the existing data in the columns to ensure that all values, except NULL, exist in the column(s) of the referenced PRIMARY KEY or UNIQUE constraint.
  3. Default Constraint: Default constraint when created on some column will have the default data which is given in the constraint when no records or data is inserted in that column.


Rabu, 10 Oktober 2012

Data Mining: Missing Data and Imputation Solutions

Missing Data has been one of my constant problems especially when it comes to data mining, it will have a great effect on my data sets on its accuracy and the correct model used when using IBM data modeler, we have been discussing whether an imputation solution would be good to use, but my colleague still has his hesitations upon using that method. It is then that I move to my second option to generate more data sets to satisfy the appropriate number of records that we can use. Our director is very demanding on our outputs also.



so here are certain Imputations that one can use, ask first your colleagues or boss if they would allow it. IBM data modeler has also an option when your lacking in records just choose (Reduce or Boost) nodes in the model when using C & R Decision Tree.

I have always wanted to use default values to replace the missing values, like for age for example, if it ranged from 18 to 21, I would probably be putting a value of 17 for these fields so it would make my data more robust.


MULTIPLE IMPUTATION in IBM-SPSS Command Additional Features

The command syntax language also allows you to:
• Specify a subset of variables for which descriptive statistics are shown (IMPUTATIONSUMMARIES subcommand).
• Specify both an analysis of missing patterns and imputation in a single run of the procedure.
• Specify the maximum number of model parameters allowed when imputing any variable (MAXMODELPARAM keyword)


Download IBM DATA MODELER 12.0 

Selasa, 09 Oktober 2012

BayesiaLab, a new competitor of IBM-Data Modeler

I came across this add from my Facebook Account, a BayesiaLab seminar at singapore, from November 6 to 8, 2012. 


Bayesian Network Application


The course covers the basics of probabilistic graphical models and introduces BayesiaLab as the software platform for manually modeling and machine-learning Bayesian networks. Participants will learn how to generate Bayesian networks for a wide range of analytics tasks, including:
  • prediction/forecasting 
  • diagnostics
  • classification
  • clustering
  • missing values imputation
  • what-if scenario simulation
  • target optimization

Pricing Information

As a general reference point, prices for renting a commercial single-user license of BayesiaLab 5.0 start at approx. USD3,500/year (Standard Edition)

Trial Version:

Download BayesiaLab 5.0.7 Trial

A free 30-day evaluation version of the latest release of BayesiaLab 5.0.7 Professional Edition is available for immediate download from the Bayesia S.A.S. server:
Click here to register and download BayesiaLab (Windows, OS X, Linux/Unix, 32/64-bit)
This will allow you to experiment with new the features of BayesiaLab 5.0.7, plus you can try out all the Bayesian network examples explained in our series of white papers.


Personal Take:

at the rate of 41 pesos to 1 usd the software only costs 143,500 pesos compared to IBM data modeler which amounts to more than 450,000 or more per license for a Data Mining Software. However the software was built from S.A.S., a competitor of the SPSS line of products. I will have to say IBM will have to be more aggressive in marketing their Data Mining software and Business Intelligence Products.