S2MA3 Assignment #1

Statistics 2MA3 - Assignment #1

2001-01-21

Due: 2001-01-30 09:00

A. Probability Calculations

Do the following problems from Rosner, Fundamentals of Biostatistics, 5th Edition.

Problems 4.39 - 4.43 on p. 110
Problems 4.57 - 4.62 on p. 112
Problems 4.78 - 4.80 on p. 114
Problems 5.27 - 5.30 on p. 148

B. Exploratory Data Analysis

Your response to each question should be in the form of a report. Prepare the report in a word processor, integrating graphics and discussion.

1. Niagara River Pollution Case Study

Find the Niagara River Pollution Case Study archived at http://www.ssc.ca/Documents/Case%20Studies/1999/E-niagara.html. Extract the "PCB in Solids" readings at Fort Erie. Study the time sequence graphically, looking for the following features: trend, cyclic effects, change-points, autocorrelation. Are the "detection limits" a problem? Does a log transformation make the data easier to interpret?

2. Effect of exposure to lead on child development

The data set LEAD on the data disk accompanying Rosner, Fundamentals of Biostatistics, 5th Edition, is described in the file LEAD.DOC. I think the easiest way to import it into R is to open the Excel file LEAD.XLS, save the worksheet as an ASCII file, then import it with read.table().

I noticed two things to watch out for in this file: variable names with underscores (why is this a problem?), and the missing-value code. You could edit the variable names in Excel, changing Lead_type to Lead.type, etc., before importing into R. Although it isn't entirely clear in LEAD.DOC, the missing-value code is 99. You could use the argument na.string = "99" in read.table() to change it to NA throughout, but unfortunately some IQ scores equal 99 and they will get changed to NA. Instead, you could edit the missing value codes one variable at a time in R. For example, if you called the data frame lead, and want to change the missing-value code 99 to NA in the variable Hyperact, you could use

lead$Hyperact <- ifelse(lead$Hyperact == 99, NA, lead$Hyperact)

Note the use of == for a logical test of equality. Note how conveniently R operates on the whole column in a single command.

Draw a comparative box plot to compare full-scale IQ between the control group, the currently exposed group and the previously exposed group. What can you conclude?

Draw a comparative box plot to compare full-scale IQ between those with no hyperactivity recorded (Hyperact == 0) and the others. What can you conclude?

3. Diet Record versus Food Frequency Questionnaire

The data set VALID is described in VALID.DOC and explained in more detail under "Nutrition" on p. 42 of the text.

For each of the four nutrition indicators (Saturated fat, Total fat, Alcohol consumption, Total calories), plot the Diet Record (DR) values against the Food Frequency Questionnaire (FFQ) values. Is FFQ a good substitute for DR?

For the DR values only, draw a pairs() plot of the four nutrition indicators. What does it show?

C. Simulation Exercises

Your response to each question should be in the form of a report. Prepare the report in a word processor, integrating graphics and discussion.

1. When do normal data look normal?

In Exercise #2 you drew histograms and density estimates for samples of different sizes from the standard normal distribution and the chi-square distribution on 1 degree of freedom. How large does the sample size have to be for the data to give a reliable indication of the shape of the underlying distribution?

2. The sampling distribution of the sample mean

Generate 200 samples, each with n = 4 independent observations, from the standard normal distribution, and compute the 200 sample means. Display the 200 sample means on a histogram and compute the mean, variance and standard deviation of this distribution. Repeat for another 200 samples, this time each with n = 100 independent observations. Explain how this illustrates Equation 5.10 on p. 135 of the text.

A clever way to do this in R is to generate 800 independent standard normal observations and arrange them in a matrix with 4 rows and 200 columns, then use apply() to compute the 200 column means. (The modification of the code to generate samples of size n = 100 is obvious.)

simdat <- matrix(rnorm(800), ncol = 200)
xbars <- apply(simdat, 2, mean)
hist(xbars)
mean(xbars)
var(xbars)
sqrt(var(xbars))

Generate 200 samples, each with n = 4 independent observations, from the chi-square distribution on 1 degree of freedom, and compute the 200 sample means. Display the 200 sample means on a histogram and compute the mean, variance and standard deviation of this distribution. Repeat for another 200 samples, this time each with n = 100 independent observations. Explain how this illustrates Equation 6.3 on p. 174 of the text.