## Statistics 2MA3 - Assignment #3

### 2002-03-25 - Q10 added 2002-03-26

#### Q1

Suppose you want to find a two-sided 99% confidence interval for the variance of a normal population. How many observations are needed to ensure that the upper limit is no more than 4 times the lower limit? Will this calculation be valid for non-normal data?

#### Q2

If your time to run a marathon has a mean of 4 hr 10 min with a standard deviation of 15 min, and your friend's time has a mean of 3 hr 55 min with a standard deviation of 10 min, and you both run in the same race, what is the probability that you will finish ahead of your friend? State any assumptions you make and discuss their validity in this example.

#### Q3

Samples of four different brands of diet margarine were analysed to determine the percentage of physiologically active polyunsaturated fatty acids (PAPFUA). Give an appropriate analysis, including an ANOVA table and a graph. Give a 95% confidence interval for the residual variance. State any assumptions you make and do what you can to test the assumptions. State your conclusions.

```Brand           PAPFUA (%)
Imperial:       14.1  13.6  14.4  14.3
Parkay:         12.8  12.5  13.4  13.0  12.3
Mazola:         16.8  17.2  16.4  17.3  18.0
Fleischmann's:  18.1  17.2  18.7  18.4```

#### Q4

(a) Two types of fish attractors, one made from clay pipes and the other from cement blocks and brush, were used over 8 consecutive time periods spanning two years at Lake Tohopekaliga, Florida. The data give fish caught per fishing day in each time period.

```period    1    2    3    4    5    6    7    8
pipe   6.64 7.89 1.83 0.42 0.85 0.29 0.57 0.63
brush  9.73 8.21 2.17 0.75 1.61 0.75 0.83 0.56```

Is there evidence that one attractor is more effective than the other? Which is the most appropriate analysis: a two-sample t-test, a paired t-test, or a paired t-test on log-transformed data? Give a p-value. State your assumptions and your conclusions.

(c) Suppose that you are planning a larger study of the same attractors in the same lake. If one attractor attracts 50% more fish than the other, how many time periods would you need to ensure that a 5% test has a power of 99%?

#### Q5

A camera was developed to determine the gray level over the lens of the human eye. Twelve patients were randomly selected, six with normal eyes and another six with cataractous eyes. One eye was tested on each patient. The data show the gray level on a scale of 1 (black) to 256 (white).

```patient        1   2   3   4   5   6
cataractous  161 140 136 171 106 149
normal       158 182 185 145 167 177```

(a) Display the data on an appropriate graph. Is there evidence of a significant difference between the two groups? Give a p-value, state your assumptions and your conclusions, and do what you can to assess the validity of your assumptions.

(b) How useful would this be as a screening test for cataracts? Plot a ROC curve. Choose a suitable cutoff. At this cutoff, what would the predictive value positive be if the prevalence of cataracts were 12% in the population being screened?

#### Q6

The following data are from a 1979 paper classifying 445 college students according to their level of marijuana use, and their parents' use of alcohol and psychoactive drugs. Do the data suggest that parental usage and student usage are independent in the population from which the sample was drawn? State your assumptions and your conclusions.

```                        Student level of marijuana use
never  occasional  regular
Parental         neither    141          54       40
use of alcohol   one         68          44       51
& drugs          both        17          11       19```

#### Q7

Analyze the following data from a study to determine the effect of different plate materials and ambient temperature on the effective life (in hours) of a storage battery. There were 3 replicates at each combination of material and temperature. Give a 99% confidence interval for the residual variance. State your assumptions and your conclusions.

```                               Ambient temperature
low           medium             high
Plate material  1   130,  74, 155     34,  80,  40     20,  82,  70
2   150, 159, 188    136, 106, 122     25,  58,  70```

#### Q8

A study of nitrogen emissions from power boilers reported x = burner area liberation rate (in Mbtu/hr-ft2) and y = NOx emission rate (in ppm) for 14 boilers.

```x 100  125  125  150  150  200  200  250  250  300  300  350  400  400
y 150  140  180  210  190  320  280  400  430  440  390  600  610  670```

(a) Fit a straight line to the data by least squares, with NOx emission rate as the dependent variable. Plot the data and the fitted line on a graph. Can NOx emission rate be predicted as a linear function of burner area liberation rate? Present your analysis in an ANOVA table with F-Tests for non-linearity and for the slope of the regression line. Give a 95% confidence interval for the residual variance. State your assumptions and your conclusions.

(b) Predict the NOx emission rate at burner area liberation rates 10 and 325. How reliable do you think your predictions are?

#### Q9

(a) In a genetics study, you plan to observe 20 F1 offspring. Your theory says that the probability of getting a wild-eyed female is 1/4 but you are concerned that the actual probability may be greater than 1/4. Using the exact binomial distribution, find a critical region such that the type I error rate is as close as possible to 5%. What is the exact significance level of the test? What is the probability of a type II error if the true probability of a wild-eyed female is 1/2? If the true probability of a wild-eyed female is 3/4?

(b) Suppose you do the study and find that 10 out of the 20 F1 offspring are wild-eyed females. Apply the test you derived in (a). State your conclusions. Compute a P-value and state your conclusions in terms of a P-value.

(c) Repeat (b) using the normal approximation to the binomial (with continuity correction).

#### Q10

In Assignment #2, Chapter 4 Question C you looked at data giving the number of leaves on 53 plants and and you computed the variance-mean ratio to test whether it was overdispersed. Because you didn't know the sampling distribution of the variance-mean ratio, you could not do a test of significance or compute a P-value. Try the same simulation method you used to study the sampling distribution of the sample mean empirically in Assignment #1 Part C Question 2. Generate 1000 samples, each with 53 independent observations from a Poisson distribution with the same mean as the leaf data. Compute the variance-mean ratio for each sample, and use functions like max(), sort() and quantile() to see where the variance-mean ratio for the leaf data lies in this distribution. What would you give as a P-value in this case? Note that it is convenient to define a function vmratio() to compute the ratio.

```> vmratio <- function(x) var(x)/mean(x)
> vmratio(nleaf)
 2.269805
> nleafdat <- apply(matrix(rpois(53*1000,mean(nleaf)), ncol=1000), 2, vmratio)
```