Source code and input files for this document are available on GitHub.
How should researchers check the assumptions of their statistical models? It is dangerous and foolish to ignore the possibility that those assumptions could be badly violated. On the other hand, the often-taught procedure of doing statistical tests against a null hypothesis that the assumptions are correct (e.g. using Shapiro-Wilk test for Normality on residuals from a linear model) is problematic for several reasons:
Many statisticians emphasize graphical diagnostic methods such as the quantile-quantile (Q-Q) plot over formal testing procedures (e.g. Ramsey and Schafer (1997): "Prefer graphical methods over formal tests for model adequacy" (section 3.6.1.3); Faraway (2016) p. 14 "[w]e prefer graphical methods because they tend to be more versatile and informative" (p. 14); Burdick et al. (2017) p. 45: "we prefer graphical representations over statistical tests"). We agree - graphical procedures give a more holistic view of the data and de-emphasize dichotomous procedures that make sharp distinctions between data sets where the null hypothesis can or cannot be rejected at a level of \(\alpha=0.05\). However, this well-meaning advice contains a catch-22 for newcomers: how is one supposed to judge whether the patterns of deviation shown in a diagnostic plot are worth worrying about? Some implementations of diagnostic plots (e.g. qqPlot
from the car
package for R) superimpose envelopes showing the expected range of variation under the null hypothesis; however, this just re-implements the significance test (although in a graphical form). Otherwise, students are shown a few examples and told which they are supposed to be concerned about and which look OK (according to the whim of the instructor or textbook author).
We have been stewing over this problem for many years; at the ISEC 2020 meeting, we decided to prepare and disseminate a survey on Google forms to ask attendees (who range from graduate students to experienced ecological statisticians) to answer some questions about Q-Q plots of simulated data.
What we asked:
You will be presented with 12 plots and asked after each plot two Yes/No questions and one numerical question per plot. The goal is to assess the ability of researchers to differentiate normally distributed from non-normally distributed data on the basis of Q-Q plots. In particular, lack of normality may increase Type I error rates of different parametric (normality-based) tests.
- Does the Q-Q plot above represent normally distributed data?
- If you saw the Q-Q plot above in the course of diagnostic checking would you take it as evidence that you ought to modify your analysis (e.g. transform response variable, use rank-based non-parametrics, etc)?
- The generated data (if not normally distributed) can inflate the Type I error rate of the F-test of equality of two sample variances. From the Q-Q plot, what would be your estimate of that rate (between 0 and 1, e.g., 0.32).
We generated
We presented them to participants in random order (the same for all participants).
Below, we present the results of question 1 as "were the data non-Normal?" (i.e. using the complement of the answers given), so that the results match up better with question 2, "would you modify your analysis?" (in future runs of this type of experiment, we should change the question framing up-front).
We had a total of 42 participants (we did not collect any information about participant characteristics).
First, we look at all of the Q-Q plots, in order, with their true labels attached:
The first (surprising?) point is just how much variability there is in individual samples of size \(n=26\). The most extreme non-Gaussian values (i.e. Gamma shape=1 and \(t\) df=2, the leftmost panels in rows 2 and 3) are clearly non-normal, but Gaussian rep 2 looks suspiciously non-normal; \(t\) (df=5) looks pretty normal, and arguably better than \(t\) (df=20).
The take-home message here is a reminder (we all need one from time to time) that the statistical properties that we know are about ensembles, while the data we analyze nearly always represents a single realization from that ensemble ...
The modal number of samples rated as nonnormal was 7-8, but the range was from 2 to 12 (out of a total of 12 samples)
Participants often (23% = 69/305) said the data were non-normal but that they would not modify their analysis; they sometimes (8% = 15/199) said that the data were normal and they would modify their analyses (???)
There was no obvious relationship between actual normality (models in blue region) and judgment of normality: for example, normal reps 1, 2, and 4 were selected as non-normal by most participants, while t(df=5) was chosen as normal by most participants ...
Boxplots show the range of responses; red dots are the long-run expected type 1 error (at \(\alpha=0.05\)) for tests of equality of variance for data sets with the specified distribution and parameters.
Participants often estimated that the type-1 error rate would be zero (the range of zero responses was 1 to 12 out of 40). Although type-1 error rates can be below as well as above the nominal level, we would guess that these answers might reflect a confusion between type-1 error rates achieving their nominal levels (i.e. equal to \(\alpha=0.05\)) and "no inflation of the type-1 error rate".
Here, "normal" means the sample was actually drawn from a normal distribution, not that the participants necessarily judged it to be normal ...
On average, participants tended to overestimate the level of type 1 error. Participants may have been unfamiliar with tests of difference in the variance (they are much less common than tests of mean differences). On the other hand, we expect that participants would have overestimated type 1 error rates for \(t\)-tests much more; for even the most extreme models in our sample the type-1 error rate never exceeded 0.0514!
To explore the relationship between expected properties of a given distribution and the specific properties of a particular realization, we compare the Shapiro-Wilk \(W\)-statistic and \(p\)-value averaged over large ensembles with the particular \(W\) and \(p\) values for the realizations examined here ...
What if we arrange replicates by Shapiro-Wilk \(W\) statistic (a reasonably objective measure of non-normality of a particular replicate) instead of by whether they were actually drawn from a normal distribution?
Helps a little bit but still pretty messy; samples with \(W<0.93\) are usually judged to be non-normal, but values with \(W>0.96\) are all over the place.
Box, G. E. P. 1953. “Non-Normality and Tests on Variances.” Biometrika 40 (3/4): 318–35. doi:10.2307/2333350.
Burdick, Richard K., David J. LeBlond, Lori B. Pfahler, Jorge Quiroz, Leslie Sidor, Kimberly Vukovinsky, and Lanju Zhang. 2017. Statistical Applications for Chemistry, Manufacturing and Controls (CMC) in the Pharmaceutical Industry. Springer.
Campbell, H., and C. B. Dean. 2014. “The Consequences of Proportional Hazards Based Model Selection.” Statistics in Medicine 33 (6): 1042–56. doi:10.1002/sim.6021.
Campbell, Harlan. 2019. “The Consequences of Checking for Zero-Inflation and Overdispersion in the Analysis of Count Data.” arXiv Preprint arXiv:1911.00115.
Faraway, Julian J. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models, Second Edition. CRC Press.
Ramsey, Fred L., and Daniel W. Schafer. 1997. The Statistical Sleuth: A Course in Methods of Data Analysis. Duxbury Press.
Rochon, Justine, Matthias Gondan, and Meinhard Kieser. 2012. “To Test or Not to Test: Preliminary Assessment of Normality When Comparing Two Independent Samples.” BMC Medical Research Methodology 12 (1): 81. doi:10.1186/1471-2288-12-81.