S2MA3 Assignment #1

Statistics 2MA3 - Assignment #1

2002-01-20

Due: 2002-01-30 09:00

A. Probability Calculations

Do the following problems from Rosner, Fundamentals of Biostatistics, 5th Edition.

Problems 3.108 - 3.111 on p. 74
Problems 3.112 - 3.115 on p. 75
- Hint for 3.113: use the binomial distribution to compute the probability of observing 100 or more cases in 1000 men if the probability of a case is 8%.
Problems 4.34 - 4.35 on p. 110
Problems 5.41 - 5.45 on p. 149

B. Exploratory Data Analysis

Your response to each question should be in the form of a report. Prepare the report in a word processor, integrating graphics and discussion.

1. Effect of smoking on bone density

The data set BONEDEN on the data disk accompanying Rosner, Fundamentals of Biostatistics, 5th Edition, is described in the file BONEDEN.DOC. An easy way to import it into R is to open the Excel file BONEDEN.XLS, save the worksheet as a tab-delimited text file boneden.txt, edit it with NotePad to make sure thare are no extraneous lines and that there is a single <CR> after the last line, then import it with read.table(), specifying that there is a header line and the separator is a tab character. After creating the data frame boneden, check that it is complete (dim(boneden) shows the number of rows and columns, for example) and save your workspace (frequently!) to protect against a system crash.

> boneden <- read.table("boneden.txt", header = T, sep = "\t") 
> dim(boneden) [1] 41 25 
> save.image()

Follow the instructions in 2.37 - 2.45 on p. 43 of Rosner, but with the following modification. He asks for a scatter plot of % difference in bone density, grouped by difference in tobacco use, where difference has been categorized into 5 levels. Instead, give scatter plots like those in Figure 2.12 on p. 38 (using different plotting symbols to distinguish monozygotic and dizygotic twins), and then give comparative box plots to compare the 5 levels. You can use cut() to categorize the continuous variable.

Note that the heavier-smoking twin is defined as the one with the higher pack-years and that this is always Twin 2; you can quickly verify this by plotting pyr2 against pyr1. The calculation of C is illustrated below. If you attach(boneden) you can refer to variables in the data frame without the boneden$ prefix, but you must use the prefix if you are creating a new variable. Since we have added a new variable to the data frame, we have to detach the data frame and attach it again if we want to use the new variable without the boneden$ prefix.

> attach(boneden)
> plot(pyr1,pyr2)
> abline(0,1)
> boneden$lsc <- 100*(ls2 - ls1)/((ls2 + ls1)/2)
> detach(boneden)
> attach(boneden)

Explore the data set graphically and report anything else interesting you find.

2. Niagara River Pollution Case Study

Find the Niagara River Pollution Case Study archived at http://www.ssc.ca/Documents/Case%20Studies/1999/E-niagara.html. Extract the "Dieldrin in water" readings at Fort Erie and Niagara-on-the-Lake. Study the two time sequences graphically, looking for the following features: trend, cyclic effects, change-points, autocorrelation. Are the "detection limits" a problem? Does a log transformation make the data easier to interpret? Are there differences between the two stations?

C. Simulation Exercises

Your response to each question should be in the form of a report. Prepare the report in a word processor, integrating graphics and discussion.

1. When do normal data look normal?

In Exercise #2 you drew histograms and density estimates for samples of different sizes from the standard normal distribution and the chi-square distribution on 1 degree of freedom. How large does the sample size have to be for the data to give a reliable indication of the shape of the underlying distribution?

2. The sampling distribution of the sample mean

Generate 200 samples, each with n = 4 independent observations, from the standard normal distribution, and compute the 200 sample means. Display the 200 sample means on a histogram and compute the mean, variance and standard deviation of this distribution. Repeat for another 200 samples, this time each with n = 100 independent observations. Explain how this illustrates Equation 5.10 on p. 135 of the text.

A clever way to do this in R is to generate 800 independent standard normal observations and arrange them in a matrix with 4 rows and 200 columns, then use apply() to compute the 200 column means. (The modification of the code to generate samples of size n = 100 is obvious.)

> simdat <- matrix(rnorm(800), ncol = 200)
> xbars <- apply(simdat, 2, mean)
> hist(xbars)
> mean(xbars)
> var(xbars)
> sqrt(var(xbars))

Generate 200 samples, each with n = 4 independent observations, from the chi-square distribution on 3 degrees of freedom, and compute the 200 sample means. Display the 200 sample means on a histogram and compute the mean, variance and standard deviation of this distribution. Repeat for another 200 samples, this time each with n = 100 independent observations. Explain how this illustrates Equation 6.3 on p. 174 of the text.