Kamis, 25 April 2013

Starting a Career in Business Intelligence and Data Mining

by Albert Anthony D. Gavino

New grads like to start a new career that's cool and techie. but what do College students need to learn to get a career in this new industry?

  • a background in Statistics that involves choosing the right statistical tool with a Research Perspective (Choosing a parametric test instead of a non-parametric test) Using Logistic Regression to predict certain outcomes.
  • a background in the field of IT, a basic knowledge of SQL scripts and statements would be helpful and some little knowledge in data warehousing would be an advantage as data ranges from flat files to cubes and some use of snowflake schema.
  • a background in marketing, as you need to present reports in an infographic manner to your specific stakeholders, creating a story out of your models and your theories make it a successful story to tell for your company.
More to come

One and Two Tailed Tests

Going in what direction?

Suppose we have a null hypothesis H
0 and an alternative hypothesis H1. We consider the distribution given by the null hypothesis and perform a test to determine whether or not the null hypothesis should be rejected in favor of the alternative hypothesis.

There are two different types of tests that can be performed. A one-tailed test looks for an increase or decrease in the parameter whereas a two-tailed test looks for any change in the parameter (which can be any change- increase or decrease).

We can perform the test at any level (usually 1%, 5% or 10%). For example, performing the test at a 5% level means that there is a 5% chance of wrongly rejecting H0.

If we perform the test at the 5% level and decide to reject the null hypothesis, we say "there is significant evidence at the 5% level to suggest the hypothesis is false".


 One-Tailed Test

We choose a critical region. In a one-tailed test, the critical region will have just one part (the red area below). If our sample value lies in this region, we reject the null hypothesis in favor of the alternative.

Suppose we are looking for a definite decrease. Then the critical region will be to the left. Note, however, that in the one-tailed test the value of the parameter can be as high as you like.

Example


Suppose we are given that X has a Poisson distribution and we want to carry out a hypothesis test on the mean, l, based upon a sample observation of 3.

Suppose the hypotheses are:

H0: l = 9
H1: l < 9

We want to test if it is "reasonable" for the observed value of 3 to have come from a Poisson distribution with parameter 9. So what is the probability that a value as low as 3 has come from a Po(9)?

P(X ≤ 3) = 0.0212 (this has come from a Poisson table)

The probability is less than 0.05, so there is less than a 5% chance that the value has come from a Poisson(3) distribution. We therefore reject the null hypothesis in favour of the alternative at the 5% level.

Two-Tailed Test

In a two-tailed test, we are looking for either an increase or a decrease. So, for example, H0 might be that the mean is equal to 9 (as before).

This time, however, H1 would be that the mean is not equal to 9. In this case, therefore, the critical region has two parts:

Example


Lets test the parameter p of a Binomial distribution at the 10% level.

Suppose a coin is tossed 10 times and we get 7 heads. We want to test whether or not the coin is fair. If the coin is fair, p = 0.5 .


Put this as the null hypothesis:
H0: p = 0.5
H1: p ≠ 0.5

Now, because the test is 2-tailed, the critical region has two parts. Half of the critical region is to the right and half is to the left. So the critical region contains both the top 5% of the distribution and the bottom 5% of the distribution (since we are testing at the 10% level).

If H0 is true, X ~ Bin(10, 0.5).

If the null hypothesis is true, what is the probability that X is 7 or above?


P(X ≥ 7) = 1 - P(X < 7) = 1 - P(X ≤ 6) = 1 - 0.8281 = 0.1719

Is this in the critical region? No- because the probability that X is at least 7 is not less than 0.05 (5%), which is what we need it to be.

So there is not significant evidence at the 10% level to reject the null hypothesis.

Reference:

Selecting Statistical Tests

by Albert Anthony D. Gavino

Parametric and Non-parametric tests

My office mate uses technical terms in statistics like parametric and non parametric tests, but what are they actually? parametric tests are those that involve interval like data such as weights, height because they can be computed for numerical calculations unlike those of non-parametric tests like variables we cannot compute values on like male and female.

These tests are important in the field of researchers as they begin to plan their data for a specific statistical tool.

Experimental Conditions

any research design has an experimental condition, as such researcher would want to have one condition, two conditions or two or more conditions. The more conditions there are, the more complex they become.

Related or Unrelated designs

this only means if you use a group or participants and be using them again, they are regarded as related, unlike for independent groups, they are called unrelated designs.

Decision Charts

Decision charts are useful for analyzing which statistical test fits your research problem that you would want to solve with a correct statistical test. Here is an example of a statistical decision chart



A Decision Tree Guide on what Statistical Tool to Use

Statistical Decision Tree based on type of data

Determine the data types
Data types range from Nominals, Ordinal or Interval/Scale

If its Ordinal or if the variable has an order, you may opt to test the relationships between them or the differences among rankings. If its Independent of each other you can use Mann Whitney to test two groups or Kruskal Wallis ANOVA for three or more groups. If the groups are Dependent of each other, use Wilcoxon for two groups and use Friedman's two way ANOVA for three or more groups.



Kamis, 18 April 2013

E Learning Edge and Data Analytics

by Albert Anthony D. Gavino, MBA

I just attended a free session on E-Learning Edge: Data Analytics 102, (with the help of our good Marketing Expert Adolfo Aran III) we looked around and there were around 15 to 20 participants. Our speaker had a background in Statistics and MBA and discussed the basics of Data Analytics which involves the following:

  • Information
  • Statistics
  • Technology
  • Strategy
  • Communication
First of all Data analytics doesn't come with a push of a button, it involves proper problem statement of cases, it involves cleaning up of databases, it involves using Statistical Software and lastly involves a proper reporting to its stakeholders such as Infographics. Models such as CHAID, Decision Trees and Logisitic Regression Models are all scientific, but at the end of the day its our stakeholders who we need to communicate with. 

What are the resources that you need?
  1. Statistical software such as SPSS or SAS
  2. a Data Warehouse that is up and running
  3. Research Oriented people who have a background in Statistics
  4. Marketing people who know how to shape the info into Palatable content
  5. and Lastly shareholders who know how to use Technology 
The Data Mining Industry or Business Intelligence, data analytics are shaping the Information Technology Industry with concepts such as Big Data, use of hardware such as Terabytes, Cloud computing with the entry of Google Drive and even Aggressive marketing with Data ads targeted at customers by which Facebook, LinkedIn have now been using.

Always keep in mind that the Innovation with Data Mining is exponentially growing at a fast rate and we now have new careers right in front of us.

For more information about E-Learning Edge