Stata tip 141: Adding marginal spike histograms to quantile and cumulative distribution plots

• Pedagogic. Such additions help clarify what (for example) a quantile plot means and how it can be explained as stacking values in terms of their associated cumulative probabilities. Every kind of graph becomes comfortable only with increased familiarity. I have encountered students and colleagues evidently long past their first histogram who struggle a little on their first experience with quantile plots. A little help does no harm.


Marginal spike histograms
Some intriguing examples given by Harrell (2015) prompt experiment with adding marginal spike histograms to quantile and (empirical) cumulative distribution (function) plots. This tip explains why this might be helpful and gives sample code using the official command quantile.
The point is at least twofold.
• Pedagogic. Such additions help clarify what (for example) a quantile plot means and how it can be explained as stacking values in terms of their associated cumulative probabilities. Every kind of graph becomes comfortable only with increased familiarity. I have encountered students and colleagues evidently long past their first histogram who struggle a little on their first experience with quantile plots. A little help does no harm.
• Practical. In principle, marginal histograms add no more information to a quantile plot. In practice, they offer a complementary view of each distribution, affording another way to think about distribution level, spread, and shape-and indeed fine structure too.
Discussion here is phrased entirely in terms of official Stata commands, with the quantile command (see [R] Diagnostic plots) the main workhorse, but the idea extends easily to community-contributed (user-written) commands such as qplot or distplot from the Stata Journal (Cox [1999a(Cox [ , 1999b(Cox [ , 2004(Cox [ , 2005; use search to find recent updates).
As usual, the idea has longer roots. See Galton (1889, 38) for (in modern terms) a quantile plot and a histogram sharing the same vertical magnitude axis.

Enhancing quantile plots
The name "quantile plot" goes back at least to Wilk and Gnanadesikan (1968), but the device is much older, being used several times in the 19th century and early 20th century. The main idea is just to plot ordered values, often but not invariably on the y axis, against their associated cumulative probabilities on the other axis. More generally, ordered values are plotted against corresponding quantiles of some relevant distribution. Small print aside, associated cumulative probabilities are just quantiles of a uniform (rectangular, flat) distribution on the interval from 0 to 1. Cumulative distribution plots are-again, often but not invariably-plotted with those axes reversed.
There are many small variations in both terminology and format. Quantile-quantile plot and empirical cumulative distribution function plot are some other terms. As if in apology or compensation for such long-windedness, concise but more cryptic labels such as q-q plots and ECDF plots can be found. As a matter of curious convention rather than compelling logic, quantile plots commonly use markers or point symbols for each distinct value, while cumulative distribution plots commonly use connected lines. That is a distinction without a difference if any distribution is such that points merge into lines, as may well be true with large sample sizes.
Let's start with a simple quantile plot. The official Stata command quantile adds a reference line that allows comparison with a uniform (rectangular, flat) distribution with the same range as the observed data. I almost never want that, so I usually make it invisible by changing its color. I also often vary the marker symbol and vertical axis label from the default. Figure 1 shows such a plot.
. sysuse auto (1978 automobile data) . set scheme sj . quantile mpg, rlopts(lc(none)) ms(oh) yla(, ang(h))  As implemented in quantile, the idea is to plot values in rank order versus cumulative probability, specifically (unique rank − 0.5) / sample size. As explained in, say, Cox (2021), a rule rank / sample size does not treat lower and upper quantiles symmetrically and would cause other problems.
Because one axis is a probability scale, that is compatible in principle with a marginal histogram showing probabilities in each distinct bin. There are various fairly simple ways to get that directly. Here's one.
. count if mpg < . 74 . egen prob = total(1/`r(N)'), by (mpg) . egen tag = tag(mpg) Counting the sample size explicitly does no harm and is needed if there are missing values (not the case in this example but often true) or if you want to look at a subset of your data (again, not the case here but also often true).
A first attempt just adds spikes to the vertical axis, as in figure 2.
. quantile mpg, rlopts(lc(none)) ms(oh) yla(, ang(h)) > addplot(spike prob mpg if tag, horizontal) legend(off)   The point about adding if tag is partly one of efficiency (the same spikes need not, and should not, be plotted repeatedly) and partly one of avoiding monitor artifacts.
In this example, we were lucky: There are 21 distinct values, so the mean probability in each bin is about 0.05. The maximum probability may naturally be much higher, but here taking probabilities literally (which means numerically) seems to work fine.
In other examples, we might have to do more work and make some decisions to get an extra histogram that is informative yet restrained. Let's look at a larger dataset. Figure 4 shows a distribution of wages on a natural logarithm scale.  Here there are 8,173 distinct values, and so the mean probability in each bin is about 0.0001. The maximum probability again can be much higher, but here taking probabilities literally (which means numerically) means that the histogram degenerates to a rug plot. If you want a rug plot, you can always get one (for example, Cox [2004]), but that is not the goal here.
There are various ways to move forward. One is to decide how much space to give the histogram and to scale accordingly. Then the probability scale for the quantile plot and that for the quantile plot are different, which needs to be explained somewhere. Figure 5 is an example. Another is to use a square-root scale for probabilities, which often works well to make structure clearer. Figure 6 is an example. The two ideas can be combined.

Square-root scales for probability or frequency
If the idea of using a square-root scale is unfamiliar, let's spell that out. Square-root scales stretch back at least a century. See discussion and references in Cox (2012) and especially also Perrin (1913, 198;1916, 131).
Histograms using square-root scales-rootograms!-go back at least to John W. Tukey circa 1965. The idea is to show not bin frequencies but their square roots. Frequencies, as counted variables, tend to have variability that is stabilized by a root transformation, at least approximately. It is now harder to see bins with higher frequency and much easier to see bins with lower frequency. Bins that are empty remain no problem because the square root of zero is zero. Note also that the square root of a normal or Gaussian density is a multiple of another normal or Gaussian density. Hence, if the normal is a reference distribution, we are looking for the same shape on a rootogram, and experience in assessing histograms for approximate normality can be applied directly in assessing rootograms. However, taking the root is only the first step in Tukey's procedure, and we do not implement his hanging or suspended rootograms. See Tukey (1986Tukey ( , 1972Tukey ( , 1977, Tukey and Wilk (1965), or Velleman and Hoaglin (1981).

A bimodal example
As suggested by Frank Harrell (personal communication), we close with an example that shows bimodal structure. We use the famous iris dataset of Edgar Anderson, as publicized by Fisher (1936) and often used as a sandbox in multivariate statistics and machine learning exercises. Stebbins (1978) gave an appreciation of Anderson, a distinguished and idiosyncratic botanist, and comments on the scientific background to distinguishing three species of the genus Iris. Kleinman (2002) surveys Anderson's graphical contributions with statistical flavor.
A Stata-readable copy of the iris data is bundled with the media for this issue. This dataset includes 150 measurements of the width and length of petals and sepals for 50 flowers each of the species Iris setosa, Iris versicolor, and Iris virginica. Using the same recipe as above, figure 7 shows the data for petal length.
The bimodal structure is, I suggest, clearly shown. It is no discovery and easily related to which species is being measured.

Graphics choices
In graphics, choices small and large abound, and taste and circumstance should determine your own choices. The device in this tip is for use whenever it is helpful. Other possibilities include offsetting the histogram slightly by introducing some space between it and the display of quantiles, using a marginal dot or strip plot instead, and smoothing the quantiles slightly because sometimes fine structure is just noise.

Acknowledgments
I thank Frank Harrell and Antony Unwin for encouraging and helpful correspondence.