What is Anova and when to use
Concept of Anova and different types of Anova explained in a very simple way with examples, also you will learn how to use Minitab for Anova and infer output. Anova is a very important and versatile analysis used in data analysis and analyzing relationships. Anova is used when X is categorical and Y is continuous data type.
Definition : ANOVA is an analysis of the variation present in an experiment. It is used for examining the differences in the mean values of the dependent variable associated with the effect of independent variables. Essentially , ANOVA is used as a test of means for two or more populations.
The tests in an ANOVA are based on the F-ratio: the variation due to an experimental treatment or effect divided by the variation due to experimental error.
Before we move ahead, we need to understand following four terms very clearly:
- Dependent Variable – Analysis of variance must have a dependent variable that is continuous. This is our “Y-Total sales”, its value will depend on different levels of “X” or “Xs” in our experiment or analysis.
- Independent Variable – ANOVA must have one or more categorical independent variable like Sales promotion. These variables are also called Factors.
- Null hypothesis – All means are equal.
- Factor level – Each Factor can have multiple levels like Heavy, Medium and Low are three levels of Sales promotion.
Different forms of ANOVAThere are three types of Anova analysis which we can use based on number of independent variables(Xs) and type of independent variables. But your dependent variable(Y) will remain continuous always.
Fig 1 explains the types of Anova with an example. In this example “Y” is total sales of a general store in $ which is a continuous variable and it is common for the three examples.
Eta square : The strength of the effects of X on Y is measured by Eta square. The value of Eta square varies between 0 and 1.
F Statistic : The null hypothesis that the category means are equal in the population is tested by an F statistic based on the ratio of mean square related to X and mean square related to error.
Mean square : The mean square is the sum of squares divided by the appropriate degrees of freedom
SS(between/x) : This is the variation in Y related to the variation in the means of the categories of X. this represents variation between the categories of X or the portion of the sum of squares in Y related to X.
SS(within/error) : Also reffered to as SS(error), this is the variation in Y due to the variation within each of the categories of X. This variation is not accounted for by X.
SS(y) : The total variation in Y.
The total variation in Y, denoted by SSy can be decomposed into two components:
SSy = SS x + SS error
Test the Significance in ANOVA
In one-way ANOVA, the interest lies in testing the null hypothesis that the category means are equal in the population.
H0 : Mean1=Mean2=Mean3…….=Meanx
Under the null hypothesis, SSx and SSerror come from the same source of variation. In such case, the estimate of the population variation of Y can be based on either between or within category variation of X.
As we have already talked about X having multiple levels or categories, just refer back to the introduction of ANOVA in case you are having difficulties to understand this.
The null hypothesis is tested by the F statistic based on the ratio between the two estimates of Mean square due to X(between) and Mean square due to error(within):
F = MSx/MSerror
This follows the F distribution, F distribution is a probability distribution of the ratios of sample variances.
Interpretation of ANOVA test
If the null hypothesis of equal category means is not rejected, then the independent variable doesn’t have a significant effect on the dependent variable. On the other hand, if the null hypothesis is rejected, the effect of the independent variable is significant.
This means that the mean value of the dependent variable will be different for different categories of the X, the independent variable.
How to measure strength of effect of X on Y : Eta square
The effect of X on Y is measured by SSx. SSx is related to the variation in the means of the categories of X. The relative magnitude of SSx also increases as the variation in Y within the categories of X increases or decreases.
The strength of the effects of X on Y are measured as:
Eta square = SS x / SS y
The value of Eta square varies between 0 and 1. It takes a value of 0 when all the category means are equal, indicating that X has no effect on Y. So higher the value of Eta square and closer to 1, means variation is Y is explained by the independent X.
Objective: To test the effect of cause X on the CTQ Y
Usage: When cause X is Categorical (grouped) & CTQ Y is Continuous Data
- A project was taken to Reduce the Processing Time.
- One of the causes suspected was lack of experience.
The following data on processing Time was collected with 3 levels of Experience. Analyze the data and verify whether lack of experience is a cause of high Processing Time
As you can see below, we have divided our staff into three categories based on their experience. There are employees in their first month of job, so they are part of “0” month experience. Then we have employees who had more than 1 month but less than 6 month of experience, we grouped them under “6” month experience. More than 6 month experience employees are grouped under “12” month category.
We have taken samples of processing time of different employees of different categories. Samples and sample sizes are for illustration purpose only, so count of samples are kept low.
We are using MINITAB for calculation purpose, though you can use any other software like SPSS or R etc. You will get results more or less in similar way only and you will be able to interpret easily.
The results of running ANOVA on Minitab are presented above. DF is degrees of freedom, Experience level has 2 DF(3 levels – 1=1.
SS is the Sum of Squares or also known as SS(between). This is the variation in Y related to the variation in the means of the categories of X. this represents variation between the categories of X or the portion of the sum of squares in Y related to X.
MS is the Mean Square, it is basically SS divided by DF(Please refer to earlier section on “Test of Significance for more details”).
N-Way Analysis of Variance explained with Two-Way ANOVA example
N-Way ANOVA can be two-way ANOVA or three-way ANOVA or multiple ANOVA, it all depends on the number of independent variables. We are going to take example of two way ANOVA here.
As we have already seen that there are three types of Anova analysis or analysis of variance which we can use based on number of independent variables(Xs) and type of independent variables. But your dependent variable(Y) will remain continuous always.
Just as recap, the figure below explains the types of Anova with an example. In this example “Y” is total sales of a general store in $ which is a continuous variable and it is common for the three examples.
Two-Way ANOVA explained with example
When we do any study or research, we get more than one factor impacting our response variable. We would like to study that without limiting ourselves to one factor only.
Some examples are :
How do consumers intentions to buy a brand vary with different levels of advertising levels and different levels of features.
How to advertising levels (high, medium, low) interact with pricing levels (high, medium, low) to influence overall sale
How does quality score vary with different levels (high, low) of training and different levels of tenure (High, medium, low)
In determination of these effects on dependent variable, n-way ANOVA can be used. A major advantage of ANOVA is that it enables us to examine interactions between the factors. Interactions occur when the effects of one factor on the dependent variable depend on the level of other factors.
The procedure of conducting two-way ANOVA is similar to one-way ANOVA, which we have already seen.
Please refer to the Store sales table below, the last column has the sales data which is continuous type of data and is the dependent variable in our example. We want to examine the effect of the level of in store promotion and couponing on store sales.
So, we have two independent factors, in store promotion and coupon level. Hence it is a case of two-way ANOVA because there are two categorical independent factors.
Our hypothesis would be:
This table has sales data of 30 stores, 2nd and 3rd columns have the independent categorical variable data.
Two-Way ANOVA with Minitab
Understanding Two Way ANOVA Minitab output
The results of running a 2*3 ANOVA on Minitab are presented below. DF is degrees of freedom, Coupon Level has 1 DF(2 levels – 1=1) and In Store Promotion has 2 DF(3 levels-1=2).
SS is the Sum of Squares or also known as SS(between). This is the variation in Y related to the variation in the means of the categories of X. this represents variation between the categories of X or the portion of the sum of squares in Y related to X.
MS is the Mean Square, it is basically SS divided by DF. F is the F test statistic, which is used in ANOVA.
To keep it simple you need to check the “P Value” to arrive at the final output of Hypothesis test.
ANCOVA or Analysis of Covariance
ANCOVA is different from ANOVA because ANCOVA includes at least one categorical independent variable and at least one interval or metric independent variable.
The independent categorical variable is called a Factor and the metric(continuous) independent variable is called a covariate.
The most common usage of covariate is to remove extraneous variation from the dependent variable, because the effects of the factors(categorical independent variable) are of major concern. The variation in the dependent variable due to covariates is removed by an adjustment of the dependent variable’s mean value within each treatment condition.
An analysis of variance is then performed on the adjusted scores. The significance of the covariate is checked using F test.
Example of ANCOVA
We can refer back to the data table used in Two way ANOVA example to illustrate analysis of covariance. That table has two independent variables as In store promotion and Coupon level, these are categorical data types. Suppose we want to check the effect of these independent variables on sales while controlling for the effect of clientele.
It is felt that the affluence of the clientele may also have an effect on the sales. Clientele is measured on an interval scale and serves as covariate in our example.
I hope you have understood the above concept and if you want to learn more such tools then go for a Six Sigma course from Simplilearn. The course is aligned to IASSC and ASQ exam, integrates lean and DMAIC methodologies using case studies and real-life examples.
There is another good online Six Sigma Green Belt course from Coursera. This course is from University System of Georgia and is well recognized.
If you want to learn new age data science techniques, then one good starting point is Data Science course from Simplilearn. Data Science is emerging very fast and early movers will always have advantage.
SUBMIT YOUR QUERY PLEASE CLICK HERE