Kamis, 25 April 2013

A Decision Tree Guide on what Statistical Tool to Use

Statistical Decision Tree based on type of data

Determine the data types
Data types range from Nominals, Ordinal or Interval/Scale

If its Ordinal or if the variable has an order, you may opt to test the relationships between them or the differences among rankings. If its Independent of each other you can use Mann Whitney to test two groups or Kruskal Wallis ANOVA for three or more groups. If the groups are Dependent of each other, use Wilcoxon for two groups and use Friedman's two way ANOVA for three or more groups.



Kamis, 18 April 2013

E Learning Edge and Data Analytics

by Albert Anthony D. Gavino, MBA

I just attended a free session on E-Learning Edge: Data Analytics 102, (with the help of our good Marketing Expert Adolfo Aran III) we looked around and there were around 15 to 20 participants. Our speaker had a background in Statistics and MBA and discussed the basics of Data Analytics which involves the following:

  • Information
  • Statistics
  • Technology
  • Strategy
  • Communication
First of all Data analytics doesn't come with a push of a button, it involves proper problem statement of cases, it involves cleaning up of databases, it involves using Statistical Software and lastly involves a proper reporting to its stakeholders such as Infographics. Models such as CHAID, Decision Trees and Logisitic Regression Models are all scientific, but at the end of the day its our stakeholders who we need to communicate with. 

What are the resources that you need?
  1. Statistical software such as SPSS or SAS
  2. a Data Warehouse that is up and running
  3. Research Oriented people who have a background in Statistics
  4. Marketing people who know how to shape the info into Palatable content
  5. and Lastly shareholders who know how to use Technology 
The Data Mining Industry or Business Intelligence, data analytics are shaping the Information Technology Industry with concepts such as Big Data, use of hardware such as Terabytes, Cloud computing with the entry of Google Drive and even Aggressive marketing with Data ads targeted at customers by which Facebook, LinkedIn have now been using.

Always keep in mind that the Innovation with Data Mining is exponentially growing at a fast rate and we now have new careers right in front of us.

For more information about E-Learning Edge


Minggu, 27 Januari 2013

SEM software in the market today


  • LISREL (one of the leaders in SEM and longtime advocate of SEM)
  • IBM-SPSS AMOS (a bit expensive from IBM, but tech support is worth it)
  • SMART PLS (freeware, uses partial least squares method and can run on a small sample size)
Most of these software are used by Statisticians who want to look deeper into their models by using moderation and mediation, Most of these are present in research journals and higher education.

Kamis, 10 Januari 2013

BULK insert statement in MS SQL SERVER 2008

by: ALBERT ANTHONY D. GAVINO

Using BULK INSERT statement in MS SQL Server 2008 helps in importing data files from other sources such as excel, txt into your TABLES in MS SQL Server.

Create a new table in your MS SQL Server
Open SQL Server Management Studio
Select an Instance
Right click Databases>Create a Database
and name the Database as NBA_PLAYERS
with a table name: Contacts

Create the fields as follows:

IDNO (int, not null)
Lastname (varchar(50), not null)
Firstname (varchar(50), not null)

Create a text file on your C: folder with the following records

101, juan, delacruz
102, michael, jordan
103, kobe, bryant

and save it as a txt file

use the code as follows:

BULK INSERT Contacts FROM 'c:\sample\sample.txt'
WITH 
   (
 FIELDTERMINATOR = ',',
 ROWTERMINATOR = '\n"
 )
GO

To view your Appended table

SELECT * FROM Contacts

Kamis, 03 Januari 2013

Using Adventure Works 2008 for Sample

BY: ALBERT ANTHONY D. GAVINO

RECOMMENDED DOWNLOAD

Example AdventureWorks2012_Database.zip
example, 37158K, uploaded Aug 29, 2012 - 36665 downloads

OTHER AVAILABLE DOWNLOADS

Example AdventureWorksLT2012_Database.zip
example, 1131K, uploaded Aug 29, 2012 - 5988 downloads
Example AdventureWorks2008R2_Database.zip
example, 36098K, uploaded Aug 29, 2012 - 48729 downloads
Example AdventureWorksLT2008R2_Database.zip
example, 1028K, uploaded Aug 29, 2012 - 12976 downloads
Example AdventureWorks2008_Database.zip
example, 58410K, uploaded Aug 29, 2012 - 29436 downloads
Example AdventureWorksLT2008_Database.zip
example, 1353K, uploaded Aug 29, 2012 - 7854 downloads

RELEASE NOTES

About this release
This release consolidates AdventureWorks databases for SQL Server 2012,
2008R2 and 2008 versions to one page. Each zip file contains an
mdf database file and ldf log file.

This should make it easier to find and download AdventureWorks databases
since all OLTP versions are on one page.

There are no database schema changes.

For each release of the product, there is a light-weight and full version of the
AdventureWorks sample database.
The light-weight version is denoted by the LT abbreviation in the filename.

An AdventureWorks database can be installed by attaching the database, or using the
 CREATE DATABASE statement.

To attach an AdventureWorks database
  1. Download a database for your SQL Server version.
  2. Unzip the database (mdf) file and log (ldf) file.
  3. From Microsoft SQL Server Management Studio,
    connect to a SQL Server instance.
  4. Right click Databases.
  5. Click Attach.
  6. Click the Add button.
  7. Locate the AdventureWorks database mdf file. For instance,
    AdventureWorks2012_Data.mdf.
  8. Click the OK button on the Locate Database Files dialog window.
  9. Click the OK button on the Attach Databases dialog window to
     attach the database.
CREATE DATABASE statement
From SQL Server Management Studio, execute the following T-SQL:

Example T-SQL
CREATE DATABASE AdventureWorks2012
ON (FILENAME = 'M:\Data\AdventureWorks2012_Data.mdf'),
(FILENAME = 'L:\Tlogs\AdventureWorks2012_Log.ldf')
FOR ATTACH;

MERGE Statement With Table-Valued Parameters


Using Table-Valued Parameter (User-Defined Table Type) as Source Table in the MERGE Statement
(Available in MS SQL Server 2008 r2)
This article continues the discussion of the Table-Valued Parameters article and will combine this new feature with the MERGE statement, yet another new feature of SQL Server 2008.  The stored procedure at the end of that article is as follows:

CREATE PROCEDURE [dbo].[usp_ProcessContact]
@Contact ContactTemplate READONLY
AS

-- Update First Name and Last Name for Existing Emails
UPDATE A
SET [FirstName] = B.[FirstName],
[LastName] = B.[LastName]
FROM [dbo].[Contact] A INNER JOIN @Contact B
ON A.[Email] = B.[Email]

-- Add New Email Addresses
INSERT INTO [dbo].[Contact] ( [Email], [FirstName], [LastName] )
SELECT [Email], [FirstName], [LastName]
FROM @Contact A
WHERE NOT EXISTS (SELECT 'X' FROM [dbo].[Contact] B
WHERE A.[Email] = B.[Email])
GO

The definition of the ContactTemplate user-defined table type is as follows:
CREATE TYPE [ContactTemplate] AS TABLE (
[Email] VARCHAR(100),
[FirstName] VARCHAR(50),
[LastName] VARCHAR(50)
)
GO

As can be seen from this stored procedure, what it does is first update the FirstName and LastName in the [dbo].[Contact] table for all those contacts that exist in the @Contact table-valued parameter based on the Email address.  Then it adds those contact information that are in the @Contact table-valued parameter but are not yet in the [dbo].[Contact] table.
Converting to the MERGE Statement
The first step in converting the UPDATE and INSERT statements into the MERGE statement is to identify the source table and the destination or target table.  In this case the source table is the @Contact user-defined table variable and the destination or target table is the [dbo].[Contact] table.  Once the tables have been identified, we then need to determine the columns that will be used to "merge" the two tables.  In our case, this will be the [Email] column of the [dbo].[Contact] table joined with the [Email] column of the @Contact user-defined table variable.
We then use these tables in the MERGE statement and join them on the columns identified, which will look as follows:
MERGE [dbo].[Contact] AS [Target]
USING @Contact AS [Source]
ON [Target].[Email] = [Source].[Email]

The next step after identifying the tables to "merge" and the columns to join these tables with, we now need to determine what to do for those records that exist on both tables based on the joined column.  In our case, we would want to update the [FirstName] and [LastName] in the [dbo].[Contact] target table with the values in the @Contact source table.
WHEN MATCHED THEN
UPDATE SET [FirstName] = [Source].[FirstName],
[LastName] = [Source].[LastName]

Since the target table has already been identified in the MERGE statement, there's no need to include the target table in the UPDATE statement.  If the target table is included in the UPDATE statement, you will get the following error message:
WHEN MATCHED THEN
UPDATE [dbo].[Contact]
SET [FirstName] = [Source].[FirstName],
[LastName] = [Source].[LastName]

Msg 102, Level 15, State 1, Procedure usp_ProcessContact, Line 11
Incorrect syntax near 'dbo'.

After specifying the statement to perform for those records that exist on both the source table and the target table, the next step is to determine what to do for those records that exists in the source table but are not in the target table.  These records are the new contacts that are not yet in the [dbo].[Contact] table.  In our case, we would want to add those new contacts to the [dbo].[Contact] destination table.  The INSERT statement that will be included in the MERGE statement will look like this:
WHEN NOT MATCHED THEN
INSERT ( [Email], [FirstName], [LastName] )
VALUES ( [Source].[Email], [Source].[FirstName], [Source].[LastName] );

Just like in the UPDATE statement above, since the target table has already been specified in the MERGE statement, there's no need to include the target table in the INSERT statement.  If the target table is included in the INSERT statement, you will get the following error message:
WHEN NOT MATCHED THEN
INSERT INTO [dbo].[Contact] ( [Email], [FirstName], [LastName] )
VALUES ( [Source].[Email], [Source].[FirstName], [Source].[LastName] );

Msg 156, Level 15, State 1, Procedure usp_ProcessContact, Line 14
Incorrect syntax near the keyword 'INTO'.


Putting It All Together


Converting the UPDATE and INSERT statements into a single MERGE sstatement, the stored procedure will now look like this:

CREATE PROCEDURE [dbo].[usp_ProcessContact]
@Contact ContactTemplate READONLY
AS

MERGE [dbo].[Contact] AS [Target]
USING @Contact AS [Source]
ON [Target].[Email] = [Source].[Email]
WHEN MATCHED THEN
UPDATE SET [FirstName] = [Source].[FirstName],
[LastName] = [Source].[LastName]
WHEN NOT MATCHED THEN
INSERT ( [Email], [FirstName], [LastName] )
VALUES ( [Source].[Email], [Source].[FirstName], [Source].[LastName] );
GO
To use this stored procedure, you simply declare a local variable with a type of ContactTemplate, fill up the variable with data and pass it as a table-valued parameter to the stored procedure.
DECLARE @Contacts ContactTemplate

INSERT INTO @Contacts
VALUES ( 'mickey@mouse.com', 'Mickey', 'Mouse' ),
( 'minnie@mouse.com', 'Minnie', 'Mouse' )

EXECUTE [dbo].[usp_ProcessContact] @Contacts
SELECT * FROM [dbo].[Contact]

Email FirstName LastName
------------------ ----------- -----------
mickey@mouse.com Mickey Mouse
minnie@mouse.com Minnie Mouse

Senin, 03 Desember 2012

Hot Deck Imputation for Missing Values

BY ALBERT ANTHONY D. GAVINO

Hot Deck Imputation

This method sorts respondents and non-respondents into a number of imputation subsets according to a user-specified set of covariates. An imputation subset comprises cases with the same values as those of the user-specified covariates. Missing values are then replaced with values taken from matching respondents (i.e. respondents that are similar with respect to the covariates).
If there is more than one matching respondent for any particular non-respondent, the user has two choices:
  1. The first respondent’s value as counted from the missing entry downwards within the imputation subset is used to impute. The reason for this is that the first respondent’s value may be closer in time to the case that has the missing value. For example, if cases are entered according to the order in which they occur, there may possibly be some type of time effect in some studies.
  2. A respondent’s value is randomly selected from within the imputation subset. If a matching respondent does not exist in the initial imputation class, the subset will be collapsed by one level starting with the last variable that was selected as a sort variable, or until a match can be found. Note that if no matching respondent is found, even after all of the sort variables have been collapsed, three options are available:
Re-specify new sort variables
  • The user can specify up to five sort variables.
Perform random overall imputation
  • Where the missing value will be replaced with a value randomly selected from the observed values in that variable.
Do not impute the missing value
  • SOLAS will not impute any missing values for which no matching respondent is found.