The Central Limit Theory

There's another key point in the notion of the SE of the mean, contained in those three little words "of the mean." Just about everything we do in statistical inference (at least up to Chapter 6) has to do with differences between means. We use the original data to estimate the mean of the sample and its SE and work forward from there.

It seems that most people who have taken a stats course forget this basic idea when they start to worry about when to use parametric statistics such as t tests. Although it is true that parametric statistics hang on the idea of a normal distribution, all we need is a normal distribution of the means, not of the original data.

Think about this one a little more. The basic idea of all of this is that the results of any one experiment are influenced by the operation of chance. If we did the experiment a zillion times and calculated the mean each time, these mean values would form some distribution centered on the true "population" mean with an SD equal to the original SD divided by the square root of the sample size. This would seem to imply that the data must be normally distributed if we're going to use this approach. Strangely, this is not so.

Time for a break. Let's talk football. All those giants (mere mortals inflated by plastic and foam padding) grunt out on the field in long lines at the start of a game. Because they're covered with more stuff than medieval knights, they all wear big numbers stitched on the back of their bulging jerseys so we can tell who's who. Typically, the numbers range from 1 to 99.

It turns out that we really don't like spectator sports, so if you gave us season tickets to the local NFL games, likely as not, we would sit there dreaming up some statistical diversion to keep ourselves amused. For example, let's test the hypothesis that the average number of each team is the same. (Boy, we must be bored!) We would sit on the 50-yard line, and as the lads ran onto the field, we would rapidly add their team numbers and then divide by the number of players on each side. We could do this for every game and gradually assemble a whole bunch of team means. Now, what would those numbers look like?

Well, we know the original numbers are essentially a rectangular distributionâ€”any number from 1 to 99 is as likely as any other. The actual distribution generated is shown in Figure 3-2. It has a mean of 49.6. Would the calculated means also be distributed at random from 1 to 99 with a flat distribution? Not at all! Our bet is that, with 11 players on each side, we would almost never get a mean value less than 40 or more than 60 because, on the average, there would be as many numbers below 50 as above 50, so the mean of the 11 players would tend to be close to 50.

In Figure 3-3, we've actually plotted the distribution of means for 100 samples of size 11 drawn at random from a flat (rectangular) distribution ranging from 1 to 99. These data certainly look like a bell-shaped curve. The

Jersey number

Figure 3-2 Distribution of original jersey numbers.

Jersey number

Figure 3-2 Distribution of original jersey numbers.

Mean

Figure 3-3 Distribution of means of sample size = 11.

calculated mean of this distribution is 49.62 (the same, naturally), and the SE is 8.86. If we had pretended that the original distribution was normal, not rectangular, the SD would have equaled 28. The SE would have been 28/VTT = 8 .45, which is not far from 8.86.

Therein lies the magic of one of the most forgotten theorems in statistics: the Central Limit Theorem, which states that for sample sizes sufficiently large (and large means greater than 5 or 10), the means will be normally distributed regardless of the shape of the original distribution. So if we are making inferences on means, we can use parametric statistics to do the computations, whether or not the original data are normally distributed.