The approach to statistics is a personal one. The integration of one's academic background and experience determines the degree of emphasis placed on its various facts. Because of this, the agricultural scientist, the veterinary scientist, the psychologist, the economist and the mathematical statistician, each come to regard the study of statistics with a different perspective.
This paper outlines the writer's approach. It has to some extent mathematical bias. The subject matter of the address will be mathematical, although no complexities beyond high school algebra will be brought forward. One believes that a clear understanding of general principles renders the particular examples within any generalisation less formidable. Also, this approach allows a greater cover of the study of statistics in the course of one paper.
Mathematical statistics has two fields of interest. The first aspect could be called descriptive statistics. Here, the statistician is concerned basically with the condensation and description of quantitative data. The second aspect could be called statistical inference; where the statistician checks deviations of observed data from theoretical material and gives comments based on his findings. From this latter aspect have evolved tests of significance, the analysis of variance technique and experimental designs to exploit this approach.
To amplify these points:—
In descriptive statistics the statistician often is asked to examine univariate, bivariate and multivariate populations. With univariate populations he is given a string of individual values to summarise. For example, he may have the weights of sixty A.L.S. steers when each was, say, 220 days old. A list of the weights would not convey the general picture. A convenient way of handling these weights would be to observe the difference between the highest and lowest value, divide this into nine or ten equal class-intervals, and to tally the number of animals which fall into each class-interval. This is known as determining the frequency distribution; the frequency of animals falling in each class-interval. Graphically, this can be handled by drawing two axes at right angles, along the axis parallel to the lower edge of the paper representing the class-intervals and the second axis at right angles to this representing the frequencies. Knowing the class-interval and its frequency, the two can be represented by a point on the graph, the class-interval mid-point and the frequency acting as co-ordinates. In this way we can obtain a pictorial summary of all the class-interval—frequency combinations in the data. The joining of all these points by straight lines gives the frequency polygon. If lines are drawn perpendicular to the extremes of the class-intervals from the axes, and these parallels cut by lines drawn at right angles but passing through the points representing the frequency of the class interval, gives the histogram. The latter seems more logical since the frequency of values within a class-interval is a fixed number and graphically it should remain fixed for that class interval also. A sloping line within a class-interval as one gets in a frequency polygon implies a change of frequency within the class interval.
The next task is to describe the distribution numerically. To do this satisfactorily requires two measures, that of central tendency and the spread or dispersion of the values around the central value. Skewness can be measured also; that is, a measure of the degree of asymmetry of the data about the central value.
Central tendency mostly is measured by the anthmetic mean, but in some cases by the mode, and in psychological work often by the median. Measures of dispersion are the range; that is, the difference between the highest and lowest value in a set of data, the variance, and the standard deviation. Skewness can be measured using higher moments of the distribution which need not bother us here. There is another measure of skewness which often is used. The data is placed in descending order, split into four equally numbered groups; the three values which form the fences, as it were, termed the first second and third quartiles, and the differences between first and second quartile and the difference between the second and third quartile are compared. This statement gives an indication of the skewness of the distribution.
Often observed univariate distributions are similar in appearance to theoretical distributions; the important ones being the normal distribution, the binomial and the Poisson distributions. The agreement can be checked by other statistical procedures.
Bivariate populations involve paired values for each single item in the group. For example, in a field trial we may have yield per plot and the number of plants per plot. Again, in animal work we might have the age of A.I.S. steers and the weight of these same animals. The paired values can be summarised by correlation tables and graphically by a scattergram. These are very similar. In the first case, the two criteria are divided into a convenient number of class-intervals at right angles and tallying the frequencies of observations falling into the various cells; thus giving the correlation table. In the scattergram, two axes are drawn at right angles and each axis represents one of the factors. A point is plotted to represent each paired observation, the co-ordinates for the point being the two values of the single item in the group. The process is repeated for all items to give the scattergram.
Mathematical treatment of this data can be complex. One approach is to divide the scattergram into four areas by lines at right angles to the axes passing through the mean value of each of the two varieties. The correlation coefficient measures the dominance of points in two diagonal quarters over and above the remaining two quarters. If the two were near equal there would be considerable scatter across the scattergram; whereas if one set was far in excess of the second set then there would be indication of some degree of association between the two factors. Regression analysis provides the algebraic relation between the two factors and is preferred. As well as obtaining the relationship between the variables it provides the means or obtaining the correlation coefficient
With multivariate populations the graphical presentation becomes unwieldy. Even the trivariate populations require a certain degree of skill in their manipulation. The mathematical approach is to compute the regression equation relating the variables and to compute multiple and partial correlation coefficients.
We pass on now to the second phase of statistics, that of statistical inference. At the outset, the mathematical statistician here differentiates between parameters and statistics. Parameters are measures from a population (or universe); statistics are estimates of these parameters obtained from a sample or samples taken from that population (or universe). He adopts the convention of using Greek letters for parameters and the corresponding Roman letters for the statistics. He computes functions of these statistics and by mathematical theory determines the theoretical distribution of an infinite number of such like computations. The two are then brought together and he is able to estimate how often the observed sample results are likely to occur from chance.
The tossing of dice offers an example to illustrate the method of statistical inference. All scores between 2 and 12 are possible with a single throw of two dice. The theoretical frequencies of the different scores for a specified number of throws can be computed. The histogram can be plotted as described previously. It should be noted that the chance of a particular score or less can be estimated by computing the area under the histogram up to the score in question and dividing this by the total area under the histogram.
This process can be done with any theoretical distribution. The convention in biological work is to put the cart before the horse. The value is located which is likely to be exceeded from chance only once in twenty times. In effect the distribution is plotted. if you could imagine a frequency polygon with an infinite number of extremely small class-intervals and a point is located along the lower axis from which a vertical line would chop off one-twentieth of the area under the curve. If the distribution is symmetrical we could have two values equally spaced from its axis of symmetry which would cut off areas 0.025 times the whole area to the left of the first perpendicular and to the right of the second perpendicular respectively.
Another important concept is that of an unbiased estimate of a parameter. An estimate is unbiased if the expected value of the statistic equals the population parameter.
We have, say, estimates of a parameter, and these statistics plotted in a histogram but replacing the frequencies by their relative frequencies;i.e., the proportion of times the particular statistics have occurred in the observations. The expected value of the statistic is simply the addition of the paired product of individual statistic and relative frequency over the entire data. If this value equals the population parameter the estimate is said to be unbiased.
We will see now how this is put into practical use. Basically, a sample is taken from a population, and unbiased estimates of the parameters computed from the sample. These estimates are then re-constituted into a new combination. The theoretical distribution or such a combination and the key values marking off specified areas under this distribution are known. The process of statistical inference is matching the computed combination with the theoretical values and offering conclusions which this pairing brings forth.
The most common statistics computed are the t-value, the chi-squared, and Fisher's z value. The last named is identical with Snedecor's F value. In each case the procedure is basically the same; that of matching an observed value with the theoretical and offering appropriate conclusions. There are variations to this theme, but these are merely the manipulation of the relationships, much in the same way as in school algebra the subject of a formula can be changed. Another point is that the format of presentation of the theoretical values is not standardised. Tables for t-values may even differ from one text book to another.
Chi-squared tables list probabilities 0.01, 0.02 0.05, 0.10, 0.20 0.30. 0.50, 0.70, 0.80, 0.90, 0.95, 0.98 and 0.99 across the table, on the extreme left column are the numbers of degrees of freedom in the sample, and in the body of the table, the theoretical chi-squares values. Once a chi-squared value is determined the table provides how often such a result is likely from chance.
Probably the importance of the chi-squared test in regard to animal experimentation and agronomic work is Fisher's extension of it and the practical procedures which have followed from this extension.
Snedecor's F-distribution is the ratio of the two chi-squared statistics previously divided by their respective degrees of freedom.
By making the transformation z-½logeF we have Fisher's z distribution.
The tabulated theoretical values for F are presented in a two-way table with degrees of freedom heading both aspects of the table, and in each cell in the body of the table are two values the 5% and 1% values for F. The 5 per cent. values are printed in Roman Face and 1 per cent. in Bold Face Type. For large values of degrees of freedom the F-values have to be extrapolated.
The practical application of the F-distribution is seen in the analysis of variance technique. Briefly the data is classified into different sources of variation. If the data were simply random fluctuations, that is, the differences between classes were simply due to chance independent estimates of the chi-squared values respective degrees and freedom derived from the data would be near equal so that the F-value would be unity or thereabouts. The "thereabouts" figure would fluctuate depending on the number of degrees of freedom for each chi-squared estimate. When the observed ration exceeds the values in the table we suspect factors other than chance are operating.
The randomised block design is a method of studying the effect of classification of experimental data. It provides chi-squared estimates using the classification with its degrees of freedom and also a chi-squared estimate due to unassignable causes based on its degrees of freedom. In a two-way classification the experimental data provides two chi-squared estimates over and above the one from unassignable causes. The observed ratio or ratios of the scaled chi-squares give F-values which are then checked against the value in the tables.
There are elaborate statistical designs which have evolved from the randomised block design layout. Latin squares, lattice designs and the factorial experiments are not basically different from one another, in the sense the experimental data subsequently provide the ratio of chi-squared estimates appropriately scaled to their respective degrees of freedom and these match with the theoretical values.
In conclusion, it should be realised that the value of experimental work rests heavily on the subsequent statistical analysis of the experimental results. In many instances the experimenter, whether he be agricultural scientist, veterinary scientist, psychologist, biologist, chemist, prior to plunging into his investigations, would be advised to find out if his proposed plan lends itself to worthwhile statistical analysis. This information would be given willingly by any statistician, and further, it is quite likely the statistician, from his knowledge of designs and analyses, together with the experimenter, with his knowledge of the test material, would bring improvements to the original plan.
(This article, the basis of an address at Conference, 1960, has been abridged to the extent of omission of certain formulae relating to t-value, chi-squared and Fisher's z values and Snedecor's F distribution. This information is available readily, however, from the author, should any further particulars be required.—EDITOR)