How to Create a Histogram
The adage "a picture is worth a thousand words" is certainly true when it comes to visually representing data. Means and medians don't provide as much information as a graph, chart or histogram, and can even be misleading. When you want to provide information about the distribution of data – more specifically, the density of the underlying distribution of numerical data – a histogram may be the way to go.
What are histograms?
Histograms are graphical displays of data using bars of different heights, called "bins," whose widths represent ranges of data points. The bars show the underlying frequency distribution of the data.
If histograms better communicate information than averages or medians, then why aren't they used more often? One reason is simply that creating them takes a little more time than calculating averages or medians, especially if you take the time to do it correctly.
Histograms can be very useful at showing the distribution of data, but if the number of bins (or bars) is incorrect, the data can easily be misrepresented, and patterns won't emerge. Did you know there are methods to calculate the number of bins and the bin width based on the data set itself? If these methods aren't used, the histogram will tell a very different story.
To illustrate the process of creating a histogram, I pulled Speed Index scores from HTTP Archive for the top 100 web sites for April 15, 2016. After filtering out results with no value, and three highest that had a value of 61,000 (which appears to be timeouts), I have a data set of 94 data points ranging from a Speed Index value of 764 to 49,600. It takes almost no time in a spreadsheet to calculate the average of 5,065 and the median of 2,976, but to summarize such a diverse set of data this way doesn't tell the whole story. I want to know if there are patterns and if the data has a normal distribution.
Creating Histograms with Excel
If you are like me and don't have a fancy statistics and charting program, you would probably turn to either Excel, Google Sheets or some other spreadsheet program to create charts. When using Excel, you first need to install the Analysis ToolPak add-on. This tool pak provides the ability to create histograms.
Creating a histogram this way is a very manual process and no guidance is provided in terms of proper bin sizes; you are left to guess what the best bin size is.
I decided to randomly create bins with a fixed width of 2500. This breaks the data into 13 distinct bins, which seems like a good number. The resulting histogram has a long tail and is relatively skewed. With this data set the width of the bins need to be smaller to see more relevant patterns.
Some of you may be looking at this chart and say, "that's not a histogram, that's a bar chart." You are correct. There are numerous differences between histograms and bar charts, while both have columns using them interchangeably is incorrect. A histogram represents binned quantitative data and should have no spaces or gaps unless a given distribution has a value of 0.
Aside from being a manual process, a bigger problem with using Excel to create histograms is you can't actually create a histogram. You can only create bar charts. For this reason I suggest using a tool other than Excel to create histograms. I will switch over to Google Sheets for the remainder of this post.
I suspect that people making histograms with Excel use the method of randomly choosing bin sizes. The problem is, with a large data set and range of results, this histogram doesn't tell us much when almost half of the results fall into the first bin. To create an accurate histogram, there are two different formulas I turn to, depending on the data.
Square Root Method
An easy way to determine the number of bins needed is to calculate the square root of the number of data points. For this data set there are 94 data points. The square root of 94 is 9.69 which I round up to 10.
Next step is to determine the width of the bins. The Speed Index scores range from 764 to 49,600. The easiest way with this data set is to divide the range of values by the number of bins, giving us a bin width of 626.
I like to have histogram bins in nice round numbers, so I round up to 5000. The distribution now looks like this:
If you are using Google Sheets, you don't actually have to go through any of these calculations as this is the method used to create their default histogram chart.
This histogram is actually worse than our Excel bar chart. If you recall, I mentioned that the bins needed to be smaller: using this method the bins are larger. For this dataset, this formula doesn't work, so let's move on to another method.
The Freedman-Diaconis rule is another method to create histograms. This method calculates the width of the bin and doesn't attempt to limit the number of bins. This works well on datasets where there is a large range between the minimum and maximum values, but where data is skewed towards one end. It is a little more complicated to calculate, but much more practical and I think it is worth the time.
Bin Width = 2 (IQR(x))/n⅓
This may look scary, especially if you're not a math person, but it is not so bad and relatively easy to do in a spreadsheet. First let's translate the formula into English.
IQR stands for interquartile range, also known as the "middle fifty" of a data set. Quartiles are the three points that divide a data set into four equal groups, each group comprising a quarter of the data. Basically, it's the 25%, 50% and 75% marks, where the 25% mark is the first quartile, the 50% mark is the median, and the 75% is the third quartile. This indicates where the bulk of the values are. IQR(x) indicates we need to calculate the interquartile range for the values in the data set.
n^⅓ or n⅓ stands for the cube root, where n is the total number of data points. A cube root is a number that, when multiplied by itself three times, produces the number. For example, 3 x 3 x 3 = 27, so the cube root of 27 is 3.
Combining these terms, to calculate the width of the bins, divide the inter-quartile range of the data by the cube root of the number of data points and then multiple by two.
First, to calculate the IQR, the 1st and 3rd quartiles have to be calculated. This is very easy to do in Excel or Google Sheets, as both provide a QUARTILE formula:
where cell A2 is the location of the first number in your data set, cell A95 is the data cell containing the last number in the data set, and 1 indicates the quartile. To calculate the third quartile, simply replace 1 with 3. Once these numbers are obtained, subtract Q1 from Q3.
For my dataset quartile 1 is 1604 and quartile 3 is 5346.5. To calculate IQR:
The formula now reads 2 *(3742.5/n^⅓). The next item to calculate is n^⅓. I again turn to my spreadsheet. N stands for the number of data points in the data set; in this case it is 94. I enter the following formula:
resulting in a rounded value of 4.5. Inserting this into the formula:
Following the order of operations PEMDAS, we first solve the problem within the parentheses
Updating the equation:
I round down to 1500.This gives us 33 bins with a width of 1500 and the following histogram:
Now we are able to see some patterns, although this chart has a number of outliers, and some bins have 0 results. At this point you might consider trimming some of the outliers, by focusing on only sites with scores in the 97th percentile. Use the following formula to determine the 97th percentile:
where A2 is the cell of the first data point, A95 is the cell for the last data point, and 0.97 is the percentile. For our data set the 97th percentile is 24,559. This reduces our data set to 91, which is still a reasonably sized data set. Using the Freedman-Diaconis rule our new equation is:
In this case our bin size when rounded is the same;, the difference is the maximum bin will be 25,500 and not 49,500, reducing the length of the tail.
The resulting histogram looks like this:
Taking a few minutes to create a histogram when dealing with data that has a long tail can provide more insights than presenting an average or median. Histograms and other diagram types help people understand data quickly, enabling your audience to "see" what you are talking about well beyond the insight a single number, like an average, can provide.