Unbiased Research: William Sealy Gosset (a.k.a. Student)

Sunday, April 24, 2016

William Sealy Gosset (a.k.a. Student)

William Sealy Gosset (a.k.a. Student)

William Sealy Gosset was born on June 13, 1876 as the oldest of five children. After studying chemistry and mathematics, Gosset went on to work at Guinness brewery in Dublin in 1899 where he worked unofficially as a statistician. When the brewery began research on the best yielding but cheapest varieties of barley, Gosset was able to really work on his statistical craft. By 1903 Gosset was able to calculate standard error, and in 1904 he published a report titled The Application of the Law of Error to the work of the Brewery that got the attention of Karl Pearson. Gosset spent two terms in 1906 and 1907 in Pearson’s biometrics lab studying distributional theory and the correlation coefficient.

After learning under Pearson, Gosset took up the study of law of error and published the Probable Error of a Mean in 1908 in Pearson’s journal Biometrika. Due to disclosure issues at the brewery, Gosset was forced to publish under the pseudonym Student. This explains why his most noteworthy achievement is now called the Student’s t-distribution and not the Gosset t-distribution. Gosset’s work attracted the attention of Ronald Fisher who believed that Gosset had caused a “logical revolution”.

Gosset would eventually go on to publish 22 manuscripts ranging from experimental design and robustness to the theory of Natural Selection. When not studying statistics, Gosset was noted to be interested in gardening, boat-building, biking, golfing, sailing, and fishing. William Gosset left Dublin in 1935 to head a new Guinness brewery in London where he later died of a heart attack in 1937 at the age of 61. He was survived by his wife, three children, and grandson.

In pursuing his goal of estimating population parameters from a small sample, Gosset ran into some heavyweight opposition, not the least of which being his former mentor Karl Pearson who said, “only naughty brewers take n so small that the difference is not of the order of the probable error”. Increasing n was too expensive for Guinness, so Gosset set out to make his own probability tables for small samples. He designed an analog simulation, and the tediousness really makes one appreciate Excel. Man, that thing is nice. Sans Excel, he wrote down the heights and lengths of the left middle finger of 3000 convicts (he chose convicts just because that was the data available to him) on separate pieces of cardboard. He then separated the data into 750 groups of 4, recording the mean and standard deviation of each group. He calculated his z statistic, being (X - µ)/s, and plotted the probability scores against the population (of 3000 convicts) parameters.

Awesomely, a team at McGill ran Gosset’s methods through R, to see if his observations would hold up to the high output of computerized simulations. See for yourself:

As they report: “Scotland Yard precision [of measuring by 1/8” instead of 1” increments] and today’s computing power would have left Gosset in no doubt that the distribution of s which he ‘assumed’ was correct was in fact correct.”

The data handed to Gosset when he began working for Guinness was difficult for him to interpret at first. The sample sizes were very small, and these small sample sizes resulted in large standard deviations that made analyzing differences between means impossible. At the time there were no distribution curves or probability tables that were appropriate to apply to data from small sample sizes. One of Gosset’s statistical focuses became how to properly analyze these types of data. It is evident today that this biased the focus of his more famous accomplishments towards studies that primarily dealt with resolving issues of small sample sizes.

Gosset needed a version of the normal distribution curve and probability table that would allow for differences between two means to be observed in experiments with small sample sizes. After studying under Karl Pearson, Gosset came up with an alternative distribution for smaller sample sizes. He published this work in Biometrika, and in his article titled “The Probable Error of a Mean,” he offered an alternative distribution (t-distribution table) that was modeled after normal distributions for larger sample sizes. His provided evidence that his new t-distribution could be applied to data from small sample sizes. He also came up with the Student’s t-test, which was later optimized by a collaborator, RA Fischer.

Since Gosset was responsible for verifying his t-distribution by comparing his calculations, curves, and tabled data to those that were previously well-established using normally distributed data, this biased the application of t-distribution tables and the t-test to data from small samples sizes that are normally distributed. Today, when deciding on whether to use a t-test or not, we still have to be aware of one of the multiple assumptions that a t-test makes, which is that the data that you apply a t-test to are normally distributed.

It is hard to imagine a world in which inferences about a whole cannot be made from a small number of samples. Without this capability much of the research in the biological sciences would be impossible. William Gosset and his “Student’s t test” were instrumental in laying the statistical foundation upon which much of research now rests.

The Student’s t test is one of the most widely utilized statistical tests in biological research to this day. His work informed the work of Fisher and the development of the concept of “significant difference.” While Gosset found the idea of statistical significance “nearly valueless,” it has nonetheless become a pillar of modern scientific interpretation.

Gosset also made substantial contributions to the field of industrial quality control. His ideas heavily influenced W. Edward Demings, a pioneer of the quality evolution and the idea of “six sigma.”

Despite these contributions, Gosset remains little known in the lore of statitistics. This may be due to his use of the pseudonym “Student” in his publications, the fact that he was not a professor like Fisher or Pearson, or his apparent lack of an ideology. Despite his significant role in the development of modern statistics, Gosset seems to have been less interested in revolutionizing a scientific field than in solving practical problems.

Unbiased Research

Sunday, April 24, 2016

William Sealy Gosset (a.k.a. Student)

No comments:

Post a Comment