Frequently-Asked Questions - March


"What's wrong with the histograms Minitab draws for Question 1 of Assignment #3?"

1998-03-23

You know that the chi-square distribution with 2 degrees of freedom has its mode at the origin (see the graph on page 170 of Rosner, or page 23-1 of the March 4 lecture notes), however, some or all of the histograms Minitab draws for your samples with n = 50 might have the first bar much shorter than the second. (With n = 5 the samples are too small for the histograms to be of any use, so the problem is less noticeable.)

Look at the X-axis. If you used the Minitab default histogram, you will probably find that the mid-point of the first interval is 0. This isn't a good choice for non-negative data because no data can fall in the left half of the first interval, hence the first bar of the histogram, as an estimate of probability density, will be too short by a factor of 2.

To avoid this when you define the histogram, set the first cut-point to be zero, or force the first cut-point to be zero by setting the first mid-point to be half the interval width.

In an ideal statistics package, the default setting should always give something reasonable. Some packages will by default set the first histogram cut-point to 0 when the data are non-negative. Minitab doesn't and the result is not satisfactory here.


"How do I know whether to use a one-sided or two-sided test?"

1998-03-22

In this course we consider "simple" hypotheses like m = 3 that specify a single value, and "composite" alternative hypotheses like m > 3, m < 3, or m <> 3 that specify a range of values. (Here <> means "is not equal to"; the usual symbol, a crossed-out equals sign, is unavailable on web browsers.) If the alternative is two-sided ( m <> 3), do a two-sided test. If it is one-sided, do a one-sided test in the direction specified by the hypothesis. A two-sided test is sometimes called a "double-tail test". A one-sided test can be a left-tail or right-tail test.

Take the example of the castings that are supposed to have a mean weight of 5 kg. Usually, you are checking for a difference in either direction (is the mean weight really 5 kg, or is it off in either direction?), so your alternative will be two-sided. You will get a one-sided alternative in situations where you are only interested in one direction (you are concerned that the castings might be overweight and you don't care if they are underweight) or where you know that only one direction is possible (perhaps you know that when the casting process goes wrong it always increases the mean weight and never decrease it).

Now, suppose you do a 5% right-tail z-test for a normal mean and get z0 = 1.843. The right-tail 5% point is 1.645 so this result is significant at the 5% level and you reject the hypothesis. But if you had chosen to do a two-sided test, the two-sided critical value is 1.960 so you would accept the hypothesis. In the one case the P-value is slightly less than 5%, in the other it is slightly more than 5%, so you could explain the contradiction by saying that the test is borderline in either case.

The theory of hypothesis testing assumes that you pick a and decide on a one-sided or two-sided test before you get the data. But what if you decide to do a 5% right-tail test, get the data, and find that the test statistic is away off in the left tail? You would probably abandon the right-tail test and do a two-tailed test. Your presuppositions going into the experiment appear to have been wrong so you revise your planned analysis. This is an example where the paradigm of statistics doesn't match the paradigm of science and we have to be pragmatic.


"How do I know whether to use a two-sample t-test or a paired t-test?"

1998-03-21

Simple. If the data are paired you must use a paired test. If they aren't, you can't.

If you have 10 subjects and each was observed before and after treatment, the data are paired.

If you have 10 pairs of subjects, with the members of each pair chosen to be as similar as possible (race, gender, age, medical history, etc), and one member of each pair is randomly selected to be the control while the other gets the treatment, the data are paired.

If you have 10 subjects randomly assigned to a control group and another 10 subjects randomly assigned to a treatment group, there is no pairing, no matching. There is no way to match a subject from one group with any particular subject in the other group. You have two independent samples.


"What data do I use in Questions 3, 4, and 5 of Assignment #3?"

1998-03-21

No data. None. In these questions you are not analysing data. You are plotting mathematical functions that help you to understand how the statistical methods will work when you apply them to data.

Question 3

This is the power curve for a two-sided test for the mean of a normal distribution. I derived the beta curve for this test on March 13 (page 27-5 of the lecture notes) and power = 1 - beta, so everything you need is there. I suggest that you program the formula in a spreadsheet. Use the function NORMSDIST() to get the standard normal probability integral, denoted by F in the formula. You have m0 = 3, s2 = 3, n = 5 or 50 and z0.975 = 1.960. Pick a convenient grid of values for m1 (I went from 0 to 6 in steps of 0.2), compute the power at each value of m1, and plot the curve.

As I mentioned in class on March 20, formula 7.25 for the power of this test on page 217 of Rosner has only one F term. The formula we derived in class has two terms. Rosner's version is an approximation that works OK out in either tail, where m1 is far from m0 in either direction, because out there only one of the two F terms is important while the other is negligible. It will not work near m0; in particular, it gives a/2 instead of a for the power right at m0.

What does the power curve tell us about the test? For a given sample size n, it shows how far m1 has to be from m0 in order to have a high probability of rejecting the hypothesis that m = m0.

Question 4

This question is from last year's final examination. It can easily be done with your pocket calculator (if you remember the formula for the Poisson pdf) or by using the Poisson probabilities in Table 2.

First, plot a Poisson distribution with m = 3. The probabilities you need are found at the top of page 640. Plotting the probabilities for k = 0, 1, 2, ..., 11 as a bar chart gives the graph. The left-tail rejection region with a as close as possible to 5% is obviously just the one point, k = 0. So the power curve is P[ k=0 | m1] plotted against m1 over a suitable grid of m1 values. If you use Table 2, you can get m1 = 0.5, 1.0, ..., and going up to 4 will show the curve nicely. It is just as easy with a pocket calculator if you remember that for the Poisson distribution P[ k=0 | m1] = exp(- m1). The only advantage of using a spreadsheet here is that it can draw the graph for you.

Question 5

Spreadsheets don't have a function to calculate the pdf of the F distribution, so it is easiest to use Minitab, then paste the results into a spreadsheet to draw the graph. The F distribution with 1 degree of freedom in the numerator, like the chi-square distribution on 1 degree of freedom, is infinite at the origin, so compute the pdf over a grid that starts just above 0. I went from 0.001 to 2.001 in steps of 0.02, but this grid is finer than I really needed.

MTB > set c1
DATA> 0.001:2.001/0.02
DATA> end
MTB > pdf c1 c2;
SUBC> f 1 6.
MTB > pdf c1 c3;
SUBC> f 4 6.

After you copy C1-C3 to a spreadsheet, you can set up the graph axes and labels to match the graph in the book Your graph will be accurate. How well drawn was the one in the book?


"What formulas do I need for Question 2 of Assignment #3?"

1998-03-21

First, you have to show that Var(x_bar) = s2 for perfectly correlated data. The formula for Var(L) from the February 25 class will give you that, after you work out what sij is when si2 = sj2 = s2 and rij = 1, for all i and j.

Going back to the March 4 class, we have the identity

S (xi - x_bar)2= S(xi - m)2 - n (x_bar - m)2

Taking expected values of both sides and remembering the definitions of variance, etc, gives

E[S(xi - x_bar)2] = n s2 - n Var(x_bar)

and so, remembering the definition of s2,

(n-1) E[s2] = n s2 - n Var(x_bar)

For independent data, Var(x_bar) = s2 /n, giving E[s2] = s2, a result we derived in class. For perfectly correlated data, Var(x_bar) = s2 and hence E[s2] = 0.


"How do I get F values with large degrees of freedom from Table 9?"

1998-03-12

[Note: I have used ¥ for the "infinity" symbol in this note; it may not display correctly on some platforms but that is what it is supposed to be.]

Take F67,35,0.975 as an example. The closest values in Table 9 are:

F24,30,0.975 = 2.14

F¥,30,0.975 = 1.79

F24,40,0.975 = 2.01

F¥,40,0.975 = 1.64

As a quick but crude approximation, you could note that all four values are pretty close to each other so the simple average, which is 1.90, won't be too far off. Under test and examination conditions, this will be good enough.

A more exact answer comes from linear interpolation. First, take the two values with 24 numerator degrees of freedom and interpolate between 30 and 40 denominator degrees of freedom to get the value for 35 denominator degrees of freedom:

F24,35,0.975 = 2.14 + [(35 - 30)/(40 - 30)]*(2.01 - 2.14) = 2.075

Repeating for ¥ denominator degrees of freedom gives:

F¥,35,0.975 = 1.79 + [(35 - 30)/(40 - 30)]*(1.64 - 1.79) = 1.715

Finally, we have to interpolate these values between 24 and ¥ denominator degrees of freedom but the ¥ is a problem. The solution is to interpolate the reciprocal of degrees of freedom between 1/24 and 0, since 1/¥ = 0.

F67,35,0.975 = 2.075 + {[(1/67) - (1/24)]/[0 - (1/24)]}*(1.715 - 2.075) = 1.844

That was a lot of bother. It is much easier to use Minitab and get the exact value, which is 1.8459.

 


"What material should I study for Test #2?"

1998-03-06

Linear Combinations of Random Variables, Normal Approximation to the Binomial and Poisson Distributions
Chapter 5, especially 5.6 - 5.8.
Estimation
All of Chapter 6 except 6.7.3, 6.8, 6.9
Hypothesis Testing: One-Sample Inference
All of Chapter 7
Hypothesis Testing: Two-Sample Inference
All of Chapter 8 except 8.8 (Case Study)

You should also review the material covered in Test #1. There is a lot of material here, but once you understand the patterns you will see the same ideas repeated over and over again.


"How do I get a bar chart and a line chart on the same graph?"

1998-03-06

This would have been useful for Q1 on A01.

In Excel, choose a "Line-Bar" chart from the "Combination Chart" selection.

In Quattro Pro, from the Graphics menu, the "Type" selection box has radio buttons down the left side; the default is "2-D" but further down is "Combo". When you click on this, a selection of combination charts appears, including "Line-Bar".

In ClarisWorks, the "Series" dialog box lets you select a different type of Display for each series on the chart.


"In Quattro Pro, how do I get a line chart without the points marked on it?"

1998-03-06

A line chart connects a series of points, and by default each point appears as a marker on the line. To eliminate the markers, select the chart by clicking on it and go to the "Properties" menu at the right of the menu bar under the toolbar (not the main menu bar). Select "Series", then select "Marker Style". Uncheck "AutoSize" and set the "Weight" to 0; you can do this by typing a number 0 in the box, or by sliding the scale first to the right, then left to 0.


Back to the Statistics 2MA3 Home Page