Rabu, 02 Oktober 2013

Big Data vs Small Data (why our Stat traditional teachers are against it)

by Albert Anthony D. Gavino

Parametric vs Non-Parametric, Small Data vs Big Data,
Who is the more superior race?





The traditional "parametric" tests, such as t-tests and the analysis of variance, assume the population(s) to be normally distributed; they generally assume that one's measures derive from an equal-interval scale. 

Non-parametric tests involve non-normal distributions, some of which are the following: 
  • multi forms of chi-square tests
  • Fisher Exact Probability test
  • Mann-Whitney Test
  • Wilcoxon Signed-Rank Test
  •  Kruskal-Wallis Test
  • and the Friedman Test

In the field of Big Data, Non-parametric is the higher science and the more powerful one, as noted by one of the UP professors.

We no longer assume that distributions are normal and we can’t use t-tests or ANOVA for that matter.


Kamis, 26 September 2013

The book has arrived!

Predictive Analytics Book by Eric Siegel

Finally the book I requested from Prediction Impact has arrived! this package was sent by my good friend Bobbe Cook from the Marketing Dept of Predictive Analytics World through UPS shipping at $20.15 and the book price at $28.00. (a thumbs up to these guys!)

Fortunately I got it all for free for being a Predictive Analytics World

Blog Partner, a partnership program by the Prediction Impact Inc.
to reach through the Asian Countries.

Package delivered on Sept 26, 2013

The book is a hard bound type with glossy pages

Front Cover

"What Nate Silver did for poker and politics, this does for everything else, A broad, well-written book easily accessible to non-nerd readers"-DAVID LEINWEBER, author of Nerds on Wall Street: Math, Machines and Wired Markets

"This is Moneyball for business, government, and healthcare"
-JIM STERNE, founder, eMetrics Summit; chairman, Digital Analytics Association

Back Cover of the Book

Some of its pages are in colored print and glossy paper material,
I can't wait to read all the chapters from Mortgage Risks, to the Ensemble Effect, Crowd-sourcing and Persuasion by the Numbers: How Telenor, U.S. Bank and the Obama Campaign engineered Influence (that's an Uplift!) Geeky as it sounds this book is for beginners and is friendly to 
starters for Predictive Analytics.

Inside Pages of the book
If you want to buy the book, its available through the ff:


    

    




6716 Cascade Rd., D10
Grand Rapids, MI 49546

Selasa, 24 September 2013

12th National Convention on Statistics


The Philippine Statistical System (PSS) through the National Statistical Coordination Board (NSCB) invites you to participate in the 12th National Convention on Statistics (NCS) which will be held on 1-2 October 2013 at EDSA Shangri-La Hotel, Mandaluyong City.

Day 1 – October 1, 2013 (Tuesday)

7:00 – 8:45
Registration
8:45 – 9:00
Ribbon Cutting of the Statistical Information Management Exhibit (SIMEX) 2013
9:00 – 10:00
JOINT OPENING CEREMONIES of the 24th National Statistics Month and 12th National Convention on Statistics (Isla Ballroom)
Welcome Remarks:
Jose Ramon G. Albert
NSCB Secretary General
Keynote Address
Senate President Franklin Drillon

10:10 – 10:30
Coffee Break
10:30 – 12:15
Plenary Session 1 (Isla Ballroom)1  
Accelerating Inclusive Growth and Competitiveness through Statistics
Session Organizer:
Prof. Solita Collas-Monsod   
Presenters:
Mr. Guillermo Luz
Dr. Josef T. Yap
Discussant:
Dr. Johannes Jutting

12:15 – 1:15
Lunch Break
1:15 – 2:45
Simultaneous Scientific Sessions
(10 sessions)
2:50 – 4:20
Simultaneous Scientific Sessions
(10 sessions)
4:25 – 4:40
Coffee Break
4:40 – 6:10
Simultaneous Sessions
(10 sessions)

Day 2 – October 2, 2013 (Wednesday)

8:30 – 10:00
Simultaneous Scientific Sessions
 (10 sessions)
10:00 – 10:15
Coffee Break
10:15 – 12:00
Plenary Session 2 (Isla Ballroom)2
Social media and statistics
(big data, organic data and privacy)
Session Organizer:
Dr. Corinne Grace Burgos
Presenters:
Mr. Johannes Jutting
Dr. Jose Ramon G. Albert
Ms. Jeanette Beltran

12:00 – 1:00
Lunch Break
1:00 – 2:30
Simultaneous Scientific Sessions
(10 sessions)
2:35 – 4:20
Plenary Session 3 (Isla Ballroom)
Managing risks to development
(disaster and financial risk management,
social protection, food security)
Session Organizer:
Asst. Secretary Lila Shahani
Presenters:
Asst. Secretary Romeo Recide
Gov. Jose Clemente "Joey" Salceda     
                
4:20 – 4:35
Coffee Break
4:35 – 5:00
CLOSING CEREMONY of the 12th NCS
(Isla Ballroom)

Selasa, 03 September 2013

Root Mean Square Error Approximation (RMSEA) and WARP PLS

by Albert Anthony D. Gavino, MBA

Model fit is one important aspect in the field of SEM, otherwise known as Structural Equation Modeling, RMSEA or Root Mean Square Error Approximation is a test if our data is a good fit to the model, with experts saying an acceptable value is less than 0.05.

There are two software programs used by my colleagues in this certain field, one of them is IBM-SPSS AMOS and another is WARP PLS using Partial Least Squares Method, these are deemed popular among Moderation and Mediation experts, the WARP PLS is recommended for users who would want a deeper insight into their moderation variables, while the IBM-SPSS AMOS software is good for researchers looking into mediation variables, Nevertheless both software are powerful for the Researcher and provides them a good tool into coming with the best model.

I guess this is what happens when your colleagues are statistical nerds, Psychometricians and highly cognitive people who just have fun with RMSEA values and Structural Equation Modeling.

Rabu, 28 Agustus 2013

Predictive Analytics World Upcoming Events this Sept to Nov 2013

Predictive Analytics World is the business-focused event for predictive analytics professionals, managers and commercial practitioners, covering today's commercial deployment of predictive analytics, across industries and across software vendors. The conference delivers case studies, expertise, and resources in order to strengthen the business impact delivered by predictive analytics.  

Upcoming Events:

Online Videos on Predictive Analytics

Additional Information:

Please visit www.predictiveanalyticsworld.com and www.textanalyticsworld.com for more information about these events.

Minggu, 28 Juli 2013

EDB singapore and Data Scientists

by Albert Anthony D. Gavino

Would you consider in working in Singapore, Predictive Analytics, Business Intelligence, Analytics ranging from HR, Business, Non-profit and even Medical Science are now the in field when talking about the sexiest occupation in the next two to three years.


the Business Sector Software is controlled by three big names: IBM, SAP and SAS. Background in Statistics, Machine Learning and Databases is most preferred by Companies, and it would be a big help if applicants and beginners get a grasp of the learning concepts such as Logistic Regression, Time Series, Cluster Analysis and Association. Still a lot of open source software are available such as WEKA and Rapid Miner, though free ware are more raw and a bit needs more sharpening compared to the big names. Though that wont hurt but its more power than your Excel spreadsheet software and Pivot Tables.

Jumat, 26 Juli 2013

Global News Network: Big Data Analytics

Just finished with a program Interview on GNN, entitled Big Data Analytics which covers Business Intelligence and Predictive Analytics, it will be aired on Channel 8, Destiny Cable/Sky Cable on Tuesday July 30 from 10 am to 11 am. Mr. Toti Casino is the host and current president of the Philippine Computer Society (PCS)


Selasa, 23 Juli 2013

Predictive Analytics World Conference 2013

Cross-Industry, Cross-Vendor Sessions
The only conference of its kind, Predictive Analytics World delivers vendor-neutral sessions across verticals such as banking, financial services, e-commerce, entertainment, government, healthcare, high technology, insurance, non-profits, publishing, and retail.
And PAW covers the gamut of commercial applications of predictive analytics, includingresponse modeling, customer retention with churn modeling, product recommendations, online marketing optimization, behavior-based advertising, fraud detection, insurance pricing and credit scoring.
Why bring together such a wide range of endeavors? No matter how you use predictive analytics, the story is the same: Predictively scoring customers and other organizational elements optimizes business performance. Predictive analytics initiatives across industries leverage the same core predictive modeling technology, share similar project overhead and data requirements, and face common process challenges and analytical hurdles.
The Cross-Vendor Summit:
  • Meet the vendors and learn about their solutions, software and services
  • Discover the best predictive analytics vendors available to serve your needs
  • Learn what they do and see how they compare.
Valuable Colleagues:
  • Mingle, network and hang out with your best and brightest colleagues
  • Exchange experiences over lunch, breaks and the conference reception, connecting with those professionals who face the same challenges as you.

Conference scope

Predictive Analytics World's sessions cover business applications of predictive analytics, including:
  • Marketing and CRM (offline and online)
    • Response modeling
    • Customer retention with churn modeling
    • Acquisition of high-value customers
    • Direct marketing
    • Database marketing
    • Profiling and cloning
  • Online marketing optimization
    • Behavior-based advertising
    • Email targeting
    • Website content optimization
  • Product recommendation systems
  • Insurance pricing
  • Credit scoring
  • Fraud detection

Minggu, 21 Juli 2013

Predictive Analytics book by Eric Siegel

by Albert Anthony D. Gavino

For Predictive Analytics newbies, this would be a good book, it discusses analytics in a more simpler way and how companies are using analytics to their advantage.

the book is more on applications instead of the technical know how on predictive analytics.

Predictive Analytics book by Eric Siegel
In this rich, entertaining primer, former Columbia University professor and Predictive Analytics World founder Eric Siegel reveals the power and perils of prediction:
  • What type of mortgage risk Chase Bank predicted before the recession.

  • Predicting which people will drop out of school, cancel a subscription, or get divorced before they are even aware of it themselves.

  • Why early retirement decreases life expectancy and vegetarians miss fewer flights.

  • Five reasons why organizations predict death, including one health insurance company.

  • How U.S. Bank, European wireless carrier Telenor, and Obama's 2012 campaign calculated the way to most strongly influence each individual.

  • How IBM's Watson computer used predictive modeling to answer questions and beat the human champs on TV's Jeopardy!.

  • How companies ascertain untold, private truths — how Target figures out you're pregnant and Hewlett-Packard deduces you're about to quit your job.

  • How judges and parole boards rely on crime-predicting computers to decide who stays in prison and who goes free.

  • What's predicted by the BBC, Citibank, ConEd, Facebook, Ford, Google, IBM, the IRS, Match.com, MTV, Netflix, Pandora, PayPal, Pfizer, and Wikipedia.

Buy the Book:

    
    

Minggu, 14 Juli 2013

What are Lift Charts?


by Albert Anthony D. Gavino

Lift Charts help us evaluate data mining models.


You can tell from the chart that the ideal line peaks at around 40 percent, meaning that if you had a perfect model, you could reach 100 percent of your targeted customers by sending a mailing to only 40% of the total population. The actual lift for the filtered model when you target 40 percent of the population is between 60 and 70 percent, meaning you could reach 60-70 percent of your targeted customers by sending the mailing to 40 percent of the total customer population.


a Lift Chart Example from Microsoft Business Intelligence

Minggu, 07 Juli 2013

HR analytics

Today offices have become smarter, even the HR department has become more evil complete with devil tails  collecting our data from time to time. Take this for example

Predictive Analytics on Employee data

  • what days do employees take sick leaves?
  • can we predict if an employee gets sick more often to predicting resigning?
  • how about predicting on what months female employees get pregnant or give birth
  • or what common illnesses are present during the month of June, are employees filing for the common cold sickness?
Now consider yourself an HR executive, and have all these rich big data at your fingertips, you can connect this data with Facebook Data and LinkedIn Data, employees who have undergone training, would they post it on facebook or post it on LinkedIN, are your employees joining professional organizations? how do you track you employees whereabouts? Are they drinking on friday nights? or are they watching movies with friends and families?

a Call Center Employee


IBM Products:

IBM SPSS Modeler and IBM predictive analytics have the power to analyze your employee data.

Rabu, 26 Juni 2013

IBM SPSS Data Modeler

My Review on IBM SPSS Data Modeler

Usability and interface:

Overall the software is very easy to use with, you can data mine without knowledge of SQL syntax or scripts used on tables, its also easy to work with Excel and SPSS files, its also compatible with SAS files.


Tools 

Auto Classifier and Auto Cluster modes are very helpful for the lazy data miner who would like to compare three or more models accuracy based on your data marts or data sets

a lot of models to choose from CHAID, C5 decision trees to Logistic Regression, Cox Regression and Generalized Linear Models and Generalized Mixed models, there are also models accustomed to the financial sector such as the Recency, Frequency and Monetary node, otherwise known as RFM node.

it also features the Anomaly node for Fraud detection, very useful for credit card loans and risk default detection.

I still have to see what SAS and SAP has to offer in terms of its extensive models available with their predictive analytics software, some do say they are the number one in business intelligence but each software has its strengths and weaknesses depending on the user.

Senin, 24 Juni 2013

Data modeling Tips for the newbie

What Tips can we advise a newbie on data modeling?

Here are some simple advices

Create and plan your Data Warehouse, Data Structure and Architecture

  1. Scope and Plan your data. Data architecture is relevant for you to plan way ahead of data troubles such as data integrity and compatibility issues. Plan what files you will be dealing with like flat files to cubes.
  2. Have a background in Statistics. Knowing a little bit of your normal curves, testing for normal distributions will do help but you have to keep on reading for complex models such as Neural Networks and Bayesian Networks.
  3. Know a little bit of scripting, learning SQL can help in extracting, transforming and loading your data, ETL would be a good way to go, a little bit of select statements such as "select from table where customerid = 200"
  4. know your business objectives, do you want models that you can use to have leverage over your competitors, or do you want customer relationship management
  5. Lastly, execute your model, your boss would be happy if you are able to deploy your complex model and show it to the board of trustees. 
Overall, data mining is not as simple as it seem but at least you can get pointers from these :)

Happy DAta Mining!

Kamis, 23 Mei 2013

Data Mining the 2013 COMELEC Results

by Albert Anthony D. Gavino

All this trending hype on the 60-30-10, is just simple math, there is no conspiracy theory to it from a viewpoint of math and statistics, of course there will be variability from precinct to precinct. 

things that you should consider before making your conclusions

1. Provide Statistical Power Analysis Estimates like Sample Power

A simple explanation from the Indiana University website says that "statistical power analysis estimates the power of the test to detect a meaningful effect, given sample size, test size (significance level), and standardized effect size." 


Sample Power used for Two sample proportions, two tailed with
alpha at 0.10
Sample power software are tools that we can show the general readers that statistics can prove certain accuracy over COMELEC results, by providing COMELEC data from each precinct. (No I wont do that since that would be tedious to do, but a team of individuals may likely do so)
2. Show the Data sheet


NAME PARTY LIST TOTAL %
HONTIVEROS, RISA (AKBAYAN) INDEPENDENT 8900861 3.73%
HAGEDORN, ED INDEPENDENT 6876841 2.88%
VILLANUEVA, BRO.EDDIE (BP) INDEPENDENT 5603663 2.35%
CASIÑO, TEDDY (MKB) INDEPENDENT 3491581 1.46%
DELOS REYES, JC (KPTRAN) INDEPENDENT 988795 0.41%
ALCANTARA, SAMSON (SJS) INDEPENDENT 957212 0.40%
BELGICA, GRECO (DPP) INDEPENDENT 898719 0.38%
PENSON, RICARDO INDEPENDENT 825149 0.35%
DAVID, LITO (KPTRAN) INDEPENDENT 821033 0.34%
MONTAÑO, MON INDEPENDENT 777484 0.33%
LLASOS, MARWIL (KPTRAN) INDEPENDENT 564291 0.24%
SEÑERES, CHRISTIAN (DPP) INDEPENDENT 561041 0.23%
FALCONE, BAL (DPP) INDEPENDENT 516863 0.22%
POE, GRACE PNOY 16340333 6.84%
LEGARDA, LOREN (NPC) PNOY 14942824 6.26%
ESCUDERO, CHIZ PNOY 14137127 5.92%
CAYETANO, ALAN PETER (NP) PNOY 14129783 5.92%
ANGARA, EDGARDO (LDP) PNOY 12853305 5.38%
AQUINO, BENIGNO BAM (LP) PNOY 12376372 5.18%
PIMENTEL, KOKO (PDP) PNOY 11846088 4.96%
TRILLANES, ANTONIO IV (NP) PNOY 11389173 4.77%
VILLAR, CYNTHIA HANEPBUHAY (NP) PNOY 11070265 4.64%
ENRILE, JUAN PONCE JR.(NPC) PNOY 9167583 3.84%
MAGSAYSAY, RAMON JR. (LP) PNOY 9153842 3.83%
MADRIGAL, JAMBY (LP) PNOY 5409440 2.26%
BINAY, NANCY (UNA) UNA 13310851 5.57%
EJERCITO ESTRADA, JV (UNA) UNA 11010630 4.61%
HONASAN, GRINGO (UNA) UNA 10620981 4.45%
GORDON, DICK (UNA) UNA 10160019 4.25%
ZUBIRI, MIGZ (UNA) UNA 9490215 3.97%
MAGSAYSAY, MITOS (UNA) UNA 4484515 1.88%
MACEDA, MANONG ERNIE (UNA) UNA 2746359 1.15%
COJUANGCO, TINGTING (UNA) UNA 2405682 1.01%
238828920 100.01%
INDEPENDENT 31783533 13%
PNOY 142816135 60%
UNA 64229252 27%
238828920 1

By a few margin, computing by means of excel you would probably be getting the same results, yes its close to 60-30-10, If you would get all the total votes. Its just pure math, not based on conspiracy theories.


3. Compare your results by comparing it to the top 12 Senatoriables

NAME PARTY LIST TOTAL VOTES
POE, GRACE PNOY       16,340,333
LEGARDA, LOREN (NPC) PNOY       14,942,824
ESCUDERO, CHIZ PNOY       14,137,127
CAYETANO, ALAN PETER (NP) PNOY       14,129,783
BINAY, NANCY (UNA) UNA       13,310,851
ANGARA, EDGARDO (LDP) PNOY       12,853,305
AQUINO, BENIGNO BAM (LP) PNOY       12,376,372
PIMENTEL, KOKO (PDP) PNOY       11,846,088
TRILLANES, ANTONIO IV (NP) PNOY       11,389,173
VILLAR, CYNTHIA HANEPBUHAY (NP) PNOY       11,070,265
EJERCITO ESTRADA, JV (UNA) UNA       11,010,630
HONASAN, GRINGO (UNA) UNA       10,620,981
PNOY    121,325,856 79%
UNA     32,701,876 21%
   154,027,732 100%

Results show that 80 percent in the top 12 came from the PNOY party list while only 20 percent came from the opposition (UNA). 

4. Plan for future research

Purugganan says that there is much that we can learn from Comelec's election data."In all this furor, I think political analysts are missing something important: that, while votes for individual senators may be different in different regions, the country may be pretty much coalesced into clear pro- and anti-administration voting blocs. This may actually mean that we do have such a thing as national parties that voters across the country vote for," he said


By going through these statistics, party lists have become an important variable in predicting who is most probable to go into the Senatorial top 12. Other factors also have to be considered like the impact of Positive Brand Family Names like BINAY and POE.