Primer of Biostatistics, by Stanton Glantz. |
Suppose IPMB did contain statistics. What would that look like? I suspect Russ and I would summarize this topic in an appendix. The logical place seems to be right after Appendix G (The Mean and Standard Deviation). We would probably not want to go into great detail, so we would only consider the simplest case: a “student’s t-test” of two data sets. It would be something like this (but probably less wordy).
Appendix G ½ Student’s T TestI should mention a few more things.
Suppose you divide a dozen patients into two groups. Six patients get a drug meant to lower their blood pressure, and six others receive a placebo. After receiving the drug for a month, their blood pressure is measured. The data is given in Table G ½.1.
Table G ½.1. Systolic Blood Pressure (in mmHg)
Drug Placebo 115 99 90 106 99 100 108 119 107 96 96 104
Is the drug effective in lowering blood pressure? Statisticians typically phrase the question differently: they adopt the null hypothesis that the drug has no effect, and ask if the data justifies the rejection of this hypothesis.
The first step is to calculate the mean, using the methods described in Appendix G. The mean for those receiving the drug is 102.5 mmHg, and the mean for those receiving the placebo is 104.0 mmHg. So, the mean systolic blood pressure was lower with the drug. The crucial question is: could this difference arise merely from chance, or does it represent a real difference? In other words, is it likely that this difference is a coincidence caused by taking too small of a sample?
To answer this question, we need to next calculate the standard deviation σ of each data set. We calculate this using Eq. G.4, except that because we do not know the mean of the data but only estimate it from our sample, we should use the factor N/(N-1) for the best estimate of the variance, where N = 6 in this example. The standard deviation is then σ = √( Σ (x -xmean)2/(N-1) ). The calculated standard deviation for the patients who took the drug is 9.1, whereas for the patients who took the placebo it is 8.2.
The standard deviation describes the spread of the data within the sample, but what we really care about is how accurately we know the mean of the data. The standard deviation of the mean is calculated by dividing the standard deviation by the square root of N. This gives 3.7 for patients taking the drug, and 3.3 for patients taking the placebo.
We are primarily interested in the difference of the means, which is 104.0 – 102.5 = 1.5 mmHg. The standard deviation of the difference in the means can be found by squaring each standard deviation of the mean, adding them, and taking the square root (standard deviations add like in the Pythagorean theorem). You get√(3.72 + 3.32) = 5.0 mmHg.
Compare the difference of the means to the standard deviation of the difference of the means by taking their ratio. Following tradition we will call this ratio T, so T = 1.5/5.0 = 0.3. If the drug has a real effect, we would expect the difference of the mean to be much larger than the standard deviation of the difference of the mean, so the absolute value of T should be much greater than 1. On the other hand, if the difference of means is much smaller than the standard deviation of the difference of the means, the result could arise easily from chance and |T| should be much less than 1. Our value is 0.3, which is less than 1, suggesting that we cannot reject the null hypothesis, and that we have not shown that the drug has any effect.
But can we say more? Can we transform our value of T into a probability that the null hypothesis is true? We can. If the drug truly had no effect, then we could repeat the experiment many times and get a distribution of T values. We would expect the values of T to be centered about T = 0 (remember, T can be positive or negative), with small values much more common than large. We could interpret this as a probability distribution: a bell shaped curve peaked at zero and falling as T becomes large. In fact, although we will not go into the details here, we can determine the probability that |T| is greater than some critical value. By tradition, one usually requires the probability p to be larger than one twentieth (p greater than 0.05) if we want to reject the null hypothesis and claim that the drug does indeed have a real effect. The critical value of T depends on N, and values are tabulated in many places (for example, see here). In our case, the tables suggest that T would have to be greater than 2.23 in order to reject the null hypothesis and say that the drug has a true (or, in the technical language, a “significant”) effect.
If taking p greater than 0.05 seems like an arbitrary cutoff for significance, then you are right. Nothing magical happens when p reaches 0.05. All it means is that the probability that the difference of the means could have arisen by chance is less than 5%. It is always possible that you were really, really unlucky and that your results arose by chance but |T| just happened to be very large. You have to draw a line somewhere, and the accepted tradition is that p greater than 0.05 means that the probability of the results being caused by random chance is small enough to ignore.
Problem 1 Analyze the following data and determine if X and Y are significantly different.Use the table of critical values for the T distribution at
X Y 94 122 93 118 104 119 105 123 115 102 96 115 http://en.wikipedia.org/wiki/Student%27s_t-distribution.
1. Technically, we consider above a two-tailed t-test, so we’re testing if we can reject the null hypothesis that the two means are the same, implying that either the drug had a significant effect of lowering blood pressure or the drug had a significant effect of raising blood pressure. If we wanted to test only if the drug lowered blood pressure, we should use a one-tailed test.
2. We analyzed what is known as an unpaired test. The patients who got the drug are different than the patients who did not. Suppose we gave the drug to the patients in January, let them go without the drug for a while, then gave the same patients the placebo in July (or vice versa). In that case, we have paired data. It may be that patients vary a lot among themselves, but that the drug reduced everyone’s blood pressure by the same fixed percentage, say 12%. There are special ways to generalize the t-test for paired data.
3. It’s easy to generalize these results to the case when the two samples have different numbers N.
4. Please remember, if you found 20 papers in the literature that all observed significant effects with p less than but on the order of 0.05, then on average one of those papers is going to be reporting a spurious result: the effect is reported as significant when in fact it is a statistical artifact. Given that there are thousands (millions?) of papers out there reporting the results of t-tests, there are probably hundreds (hundreds of thousands?) of such spurious results in the literature. The key is to remember what p means, and to not over-interpret or under-interpret your results.
5. Why is this called the “student’s t-test”? The inventor of the test, William Gosset, was a chemist working for Guinness, and he devised the t-test to assess the quality of stout. Guinness would not let its chemists publish, so Gosset published under the pseudonym “student.”
6. The t-test is only one of many statistical methods. As is typical of IPMB, we have just scratched the surface of an exciting and extensive topic.
7. There are many good books on statistics. One that might be useful for readers of IPMB (focused on biological and medical examples, written in engaging and nontechnical prose) is Primer of Biostatistics, 7th edition, by Stanton Glantz.