3 Testing the Normal Distribution Assumption

Next: 4 Terminology for Types Up: 3 Ordinal Data Analysis Previous: 2 Bivariate Normal Distribution Index

3 Testing the Normal Distribution Assumption

The problem of having no degrees of freedom to test the goodness of fit of the bivariate normal distribution to two binary variables is solved when we have at least three categories in one variable and at least two in the other. To illustrate this point, compare the contour plots shown in Figure 2.4 in which two thresholds have been specified for

**Figure 2.4:** Contour plots of a bivariate normal distribution with correlation .9 (top) and a mixture of bivariate normal distributions, one with .9 correlation and the other with -.9 correlation (bottom). Two thresholds in each dimension are shown.
$\begin{figure} \centerline{\psfig{figure=r9contour2t.ps,height=3in}}%% \center... ...e{\psfig{figure=mixcontour2t.ps,height=3in}}%% \centerline{(b)} \end{figure}$

the two variables. With the bivariate normal distribution, there is a very strong pattern imposed on the relative magnitudes of the cells on the diagonal and elsewhere. There is a similar set of constraints with the mixture of normals, but quite different predictions are made about the off-diagonal cells; all four corner cells would have an appreciable frequency given a sufficient sample size, and probably in excess of that in each of the four cells in the middle of each side [e.g., (1,2)]. The bivariate normal distribution could never be adjusted to perfectly predict the cell proportions obtained from the mixture of distributions. This intuitive idea of opportunities for failure translates directly into the concept of degrees of freedom. When we use a bivariate normal liability model to predict the proportions in a contingency table with $r$

rows and

columns, we use

thresholds for the rows,

thresholds for the columns, and one parameter for the correlation in liability, giving

in total. The table itself contains

proportions, neglecting the total sample size as above. Therefore we have degrees of freedom equal to:

$\displaystyle \mbox{df} = rc -1 - (r+c-1) \mbox{df} = rc - r - c$

(5)

The discrepancy between the frequencies predicted by the model and those actually observed in the data can be measured using the $\chi^2$ statistic given by:

$\begin{displaymath} \chi^2=\sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \end{displaymath}$

Given a large enough sample, the model's failure to predict the observed data would be reflected in a significant $\chi^2$ for the goodness of fit. In principle, models could be fitted by maximum likelihood directly to contingency tables, employing the observed and expected cell proportions. This approach is general and flexible, especially for the multigroup case -- the programs LISCOMP (Muthén, 1987) and Mx (Neale, 1991) use the method -- but it is currently limited by computational considerations. When we move from two variables to larger examples involving many variables, integration of the multivariate normal distribution (which has to be done numerically) becomes extremely time-consuming, perhaps increasing by a factor of ten or so for each additional variable. An alternative approach to this problem is to use PRELIS 2 to compute each correlation in a pairwise fashion, and to compute a weight matrix. The weight matrix is an estimate of the variances and covariances of the correlations. The variances of the correlations certainly have some intuitive appeal, being a measure of how precisely each correlation is estimated. However, the idea of a correlation correlating with another correlation may seem strange to a newcomer to the field. Yet this covariation between correlations is precisely what we need in order to represent how much additional information the second correlation supplies over and above that provided by the first correlation. Armed with these two types of summary statistics -- the correlation matrix and the covariances of the correlations, we may fit models using a structural equation modeling package such as Mx or LISREL, and make statistical inferences from the goodness of fit of the model. It is also possible to use the bivariate normal liability distribution to infer the patterns of statistics that would be observed if ordinal and continuous variables were correlated. Essentially, there are specific predictions made about the expected mean and variance of the continuous variable in each of the categories of the ordinal variable. For example, the continuous variable means are predicted to increase monotonically across the categories if there is a correlation between the liabilities. An observed pattern of a high mean in category 1, low in category 2 and high again in category 3 would not be consistent with the model. The number of parameters used to describe this model for an ordinal variable with $r$

categories is $r+2$

, since we use

for the thresholds, one each for the mean and variance of the continuous variable, and one for the covariance between the two variables. The observed statistics involved are the proportions in the cells (less one because the final proportion may be obtained by subtraction from 1) and the mean and variance of the continuous variable in each category. Therefore we have:

$\displaystyle \mbox{df}_{\mbox{oc}}$	$\textstyle =$	$\displaystyle (r-1)+2r-(r+2)$	(6)
	$\textstyle =$	$\displaystyle 2r-3$

So the number of degrees of freedom for such a test is

where

is the number of categories.

Next: 4 Terminology for Types Up: 3 Ordinal Data Analysis Previous: 2 Bivariate Normal Distribution Index

Jeff Lessem 2002-03-21