1 Calculating Summary Statistics by Hand

Next: 2 Using SAS or Up: 2 Continuous Data Analysis Previous: 2 Continuous Data Analysis Index

1 Calculating Summary Statistics by Hand

The variances and covariances used in twin analyses often are computed using a statistical package such as SPSS [SPSS, 1988] or SAS [SAS, 1988], or by PRELIS []. Nevertheless, it is useful to examine how they are calculated in order to ensure a comprehensive understanding of one's observed data. In this section we describe the calculation of means, variances, covariances, and correlations. Some simulated measurements from 16 MZ and 16 DZ twin pairs are presented in Table 2.1. The observed values in the columns labelled Twin 1

**Table 2.1:** Simulated measurements from 16 MZ and 16 DZ Twin Pairs.

MZ		DZ
Twin 1	Twin 2	Twin 1	Twin 2
3	2	0	1
3	3	2	3
2	1	1	2
1	2	4	3
0	0	3	1
2	2	2	2
2	2	2	2
3	2	1	3
3	3	3	4
2	3	1	0
1	1	1	1
1	1	2	1
4	4	3	3
2	3	3	2
2	1	2	2
1	2	2	2

and Twin 2 have been selected to illustrate some elementary principles of variation in twins. In order to obtain the summary statistics of variances and covariances for genetic analysis, it is first necessary to compute the average value for a set of measurements, called the mean. The mean is typically denoted by a bar over the variable name for a group of observations, for example $\overline{X}$ or $\overline{Twin 1}$ or $\overline{Twin 2}$ . The formula for calculation of the mean is:

$\displaystyle \overline{X}$	$\textstyle =$	$\displaystyle \frac{X_1 + X_2 + \cdots + X_n}{n}$
	$\textstyle =$	$\displaystyle \frac{\displaystyle{\sum_{i=1}^{n} X_i}}{n},$	(1)

in which

represents the $i^{th}$ observation and

is the total number of observations. In the twin data of Table 2.1, the mean of the measurements on Twin 1 of the MZ pairs is

$\begin{eqnarray*}\overline{Twin 1} & = & \frac{3 + 3 + 2 + \cdots + 2 + 2 + 1}{16}\nonumber \ & = & 32/16 \ & = & 2.0 \end{eqnarray*}$

The mean for the second MZ twin ( $\overline{Twin 2}$ ) also is 2.0, as are the means for both DZ twins. The variance of the observations represents a measure of dispersion around the mean; that is, how much, on average, observations differ from the mean. The variance formula for a sample of measurements, often represented as $s^2$

or $V_{MZ}$ or $V_{DZ}$ , is

$\displaystyle s^2$	$\textstyle =$	$\displaystyle \frac{(X_1 - \overline{X})^2 + (X_2 - \overline{X})^2 + \cdots + (X_n - \overline{X})^2}{n - 1}$
	$\textstyle =$	$\displaystyle \frac{\displaystyle{\sum_{i=1}^{n} (X_i - \overline{X})^2}}{n - 1}$	(2)

We note two things: first, the difference between each observation and the mean is squared. In principle, absolute differences from the mean could be used as a measure of variation, but absolute differences have a greater variance than squared differences [Fisher, 1920], and are therefore less efficient for use as a summary statistic. Likewise, higher powers (e.g. $\sum_{i=1}^{n} (X_i-\overline{X})^4$ ) also have greater variance. In fact, Fisher showed that the square of the difference is the most informative measure of variance, i.e., it is a sufficient statistic. Second, the sum of the squared deviations is divided by

rather than

. The denominator is

in order to compensate for an underestimate in the sample variance which would be obtained if $s^2$

were divided by

. (This arises from the fact that we have already used one parameter -- the mean -- to describe the data; see Mood & Graybill, 1963 for a discussion of bias in sample variance). Again using the twin data in Table 2.1 as an example, the variance of MZ Twin 1 is

$\begin{eqnarray*}V_{MZT1} & = & \frac{(3 - 2)^2 + (3 - 2)^2 + \cdots + (2 - 2)^... ... & = & \frac{1 + 1 + 0 + \cdots + 0 + 0 + 1}{15} \ & = & 16/15 \end{eqnarray*}$

The variances of data from the second MZ twin, DZ Twin 1, and DZ Twin 2 also equal $16/15$

. Covariances are computationally similar to variances, but represent mean deviations which are shared by two sets of observations. In the twin example, covariances are useful because they indicate the extent to which deviations from the mean by Twin 1 are similar to the second twin's deviations from the mean. Thus, the covariance between observations of Twin 1 and Twin 2 represents a scale-dependent measure of twin similarity. Covariances are often denoted by $s_{x,y}$ or Cov $_{MZ}$ or Cov $_{DZ}$ , and are calculated as

$\displaystyle s_{x,y}$	$\textstyle =$	$\displaystyle \frac{(X_1 - \overline{X})(Y_1 - \overline{Y}) + (X_2 - \overline... ...Y_2 - \overline{Y}) + \cdots + (X_n - \overline{X})(Y_n - \overline{Y})}{n - 1}$
	$\textstyle =$	$\displaystyle \frac{\displaystyle{\sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y})}}{n - 1}$	(3)

Note that the variance formula shown in Eq. 2.2 is just a special case of the covariance when $Y_i = X_i$

. In other words, the variance is simply the covariance between a variable and itself. For the twin data in Table 2.1, the covariance between MZ twins is

$\begin{eqnarray*}\mbox{Cov}_{MZ} & = & \frac{(3 - 2)(2 - 2) + (3 - 2)(3 - 2) + ... ...frac{ 0 + 1 + 0 + 0 + \cdots + 4 + 0 + 0 + 0}{15}\ & = & 12/15 \end{eqnarray*}$

The covariance between DZ pairs may be calculated similarly to give 8/15. The correlation coefficient is closely related to the covariance between two sets of observations. Correlations may be interpreted in a similar manner as covariances, but are rescaled to give a lower bound of -1.0 and an upper bound of 1.0. The correlation coefficient, $r$

, may be calculated using the covariance between two measures and the square root of the variance (the standard deviation) of each measure:

$\begin{displaymath}r = \frac{\mbox{Cov}_{x,y}}{\sqrt{V_x V_y}} \end{displaymath}$

(4)

For the simulated MZ twin data, the correlation between twins is

$\begin{eqnarray*}r_{MZ}& =& \frac{12/15}{\sqrt{(16/15) (16/15)}} \\ & = & 12/16 = .75, \end{eqnarray*}$

and the DZ twin correlation is

$\begin{eqnarray*}r_{DZ}& =& \frac{8/15}{\sqrt{(16/15) (16/15)}} \\ & = & 8/16 = .50 \end{eqnarray*}$

Although variances and covariances typically define the observed information for biometrical analyses of twin data, correlations are useful for comparing resemblances between twins as a function of genetic relatedness. In the simulated twin data, the MZ twin correlation ( $r = .75$

) is greater than that of the DZ twins (

). This greater similarity of MZ twins may be due to several sources of variation (discussed in subsequent chapters), but at the least is suggestive of a heritable basis for the trait, as increased MZ similarity could result from the fact that MZ twins are genetically identical, whereas DZ twins share only 1/2 of their genes on average.

Next: 2 Using SAS or Up: 2 Continuous Data Analysis Previous: 2 Continuous Data Analysis Index

Jeff Lessem 2002-03-21