Chi Square – Contingency Table and Goodness of Fit Examples

What is Chi square and when to use

Chi Square is a widely used tool to check association and is explained here with very simple examples so that the concept is understood. Chi Square is used to check the effect of a factor on output and is also used to check goodness of fit of various distributions.

Important to note : Chi Square is used when both X and Y are discrete data types. Chi Square statistic should be estimated only on counts of data. If the data is in percentage form, they should be converted to counts or numbers. Another assumption is that the observations are drawn independently.

Purpose I
Chi Square is used for following two purpose:

1. To test hypothesis of several proportions (contingency table) : Chi Square is used to test the significance of the observed association in a cross tabulation. The null hypothesis is that there is no association between the variables. The test is conducted by computing the cell frequencies that would be expected if no association were present between the variables, given the row and column totals.

Chi-Square Test of Independence

  • Ho: A factor has no effect on the output
  • Ha: A factor has an effect on the output
Purpose II
2. Chi-Square Goodness-of-Fit Test

Chi Square can also be used to determine whether a certain model fits the observed data. These tests are conducted by calculating the significance of sample deviation from the assumed theoretical(expected) distribution. This can be performed on cross tabulations as well as on frequencies(one-way tabulation). The calculation of the Chi square statistic and the determination of its significance is the same as in scenario 1.

  • Ho: The hypothesized distribution is a good fit of the data
  • Ha: The hypothesized distribution is not a good fit of the data

Chi Square example(Contingency table)

To test hypothesis of several proportions (contingency table)

It is often necessary to compare proportions representing various process conditions. Machines may be compared as to their ability to produce precise parts. The ability of inspectors to identify defective parts can be evaluated. This application of Chi Square is called the Contingency table or row and column analysis.

The procedure is as follows:

1. Take one subgroup from each of the various processes and determine the Observed frequencies(O) for the various conditions being compared.

2. Calculate for each condition the expected frequencies(E) under the assumption that no differences exist among the processes.

3. Compare the observed and expected frequencies to obtain “reality”. The following calculation is made for each condition:

4. Total all the process conditions:

This is the most “famous” Chi-Square statistic.

5. A critical value is determined. The degrees of freedom is determined from the calculation(R-1)(C-1) : the number of rows minus 1 times the number of columns minus 1

6. A comparison between the test statistic and the critical value confirms if a significant difference exists( at a selected confidence level)

Comparing proportions(Contingency tables)

Suppose, we are analyzing the performance of German soccer team in Germany & Overseas during last 2 years. We look at the performance data and come up with following figures

The data has two classifications

This table is called 2 X 2 contingency table (2 rows, 2 columns)

We are comparing two proportions here i.e. Victories in Germany and Overseas

Let’s hypothesize that proportion of victories in home conditions or abroad is equal

Preparing Contingency table

Chi-Square in Minitab

Chi-Square Goodness-of-Fit Test

Goodness of Fit tests are part of procedures that structured in cells(contingency table). In each cell there is an observed frequency (O). We either know from the nature of the problem the expected or theoretical frequency (E), this can be calculated. Chi Square is then summed across the cells as per following formula:

The calculated Chi Square is the compared to the Chi Square critical value for the following degrees of freedom

Chi-Square Goodness-of-Fit Test

  • Ho: The hypothesized distribution is a good fit of the data
  • Ha: The hypothesized distribution is not a good fit of the data

I hope you have understood the above concept and if you want to learn more such tools then go for a Six Sigma course from Simplilearn. The course is aligned to IASSC and ASQ exam, integrates lean and DMAIC methodologies using case studies and real-life examples.

There is another good online Six Sigma Green Belt course from Coursera. This course is from University System of Georgia and is well recognized.

If you want to learn new age data science techniques, then one good starting point is Data Science course from Simplilearn. Data Science is emerging very fast and early movers will always have advantage.