**What is Boxplot/Box and Whisker plot**

There are many graphical methods to summarize data like boxplots, stem and leaf plots, scatter plots, histograms and probability distributions. Out of these Boxplot is one of the simplest and most useful way to graphically show data.

Boxplot is useful in visually comparing the different data sets(preferably same size) taken from the same population. For example you want to compare performance of different teams doing similar work. We can also compare performance of different lots or different batches in a manufacturing environment.

**When do we use**

- First stages of data analysis
- Gives a basic feel for data without assumptions of the distribution it follows

The boxplot is credited to John W. Tukey. The boxplot is simply a summary of five numbers from the data set. The median is the line dividing the box, the upper and lower quartiles of the data define the ends of the box.

The minimum and maximum data points are drawn as points at the ends of the lines(whiskers) extending from the box. A simple boxplot is shown below.

Outliers can also be identified as points(asterisks) more than 1.5 times the interquartile distance from each quartile.

Before we move forward, let us quickly understand what is spread of data and measures of spread. Spread basically talks about the variation in the data set and how much is the range of the data set.

**Measures of Spread**

**Range**

Numerical distance between the highest and the lowest values in a data set.

Range = Max – Min

**Standard Deviation**

The square root of the variance, it is the most commonly used measure to quantify variability

If you have two data sets “A” and “B” and Standard deviation is high in data set A, then spread of A would be more than B. So when we plot Histogram of both A and B, then Histogram of A will be more wider than B. A boxplot is basically an inverted histogram, if the histogram is wider then boxplot would also be wider and vice-verse.

We can also understand this by looking at the two bell curves shown below. A bell curve with higher variation will be wider than a curve with less variation. This same principle applies to boxplots as well, so the Boxplot will be more wider if variation in the data is more or spread of the data is more.

So now we know that we have to look at the size of the boxes to compare performance exhibited by different Boxplots. Another important point to look at the median and mean, how close are they to the target.

**Quartile in layman terms –**

We need to understand this because it is used to make Boxplots, basically with the help of quartiles we break the data set into 4 equal parts.

Dividing data set into two parts – We can divide the data set into two parts with the help of median, median is basically the middle part of any data set. Median can be found after sorting the data.

**Creating quartiles –**

Now that we have divided data set into two parts. Next step would be to divide each part again into two parts in the same manner.

In the picture below you can see that we have divided Part 1 into two equal parts and Part 2 into two equal parts. Now we have four equal parts or quartiles.

Now that we have created our quartiles, we can plot a boxplot using MS Excel. As you see the boxplot, median is shown as a line in the box with value “16”.

**Things to look for in a Box & Whisker plot:**

- Are the boxes about equal or different?
- Do the groups appear normal (symmetrical box halfs and whiskers) or skewed?
- Are there outliers?

Name | Observations | Inference |
---|---|---|

Agent 1 | Long upper whisker, mean & median above the target | On an average agent is meeting the target but +ve variation is very high |

Agent 2 | Long lower whisker, mean & median below the target | On an average is not meeting the target also there is a high variation |

Agent 3 | Most part of the box plot lies below the target line | Agent is not meeting the target but the variation is comparatively lesser |

Agent 7 | Box is so small, short lower and upper whiskers | Agent is meeting the target and the variation is also minimal |

**My Recommendations**

There are great online courses available for Six Sigma, PMP, Data Science, Big Data, Machine Learning and Python.

If you want to have a course from a recognized university then **Coursera** is the place for you. Otherwise I would recommend **Simplilearn. **

**Simplilearn** certificate is well recognized in the industry and courses are really helpful.