Statistics 3N03/3J04 - Assignment #3 Solutions

2004-12-03


Use R to do the graphics on this assignment. Do the ANOVA calculations in R and with your calculator, and submit both. The text references are to Montgomery & Runger, Applied Statistics and Probability for Engineers, 3rd edition.

Full marks = 140


Question 1

Figure 8-4

[Graph 3]

> xgr <- seq(-4,4,length=50)
> plot(xgr, dnorm(xgr), type = "l", lty = 1, xlab = "x", ylab ="f(x)")
> lines(xgr,dt(xgr,10),lty=2)
> lines(xgr,dt(xgr,1),lty=3)
> legend(1.8,.38,c("infinite df","10 df","1 df"),lty=1:3)
> title("t density")

Figure 8-8

[Graph 3]

> xgr <- seq(0,30,length=50)
> plot(xgr, dchisq(xgr, 2), type = "l", lty = 1, xlab = "x", ylab ="f(x)")
> lines(xgr,dchisq(xgr,5),lty=2)
> lines(xgr,dchisq(xgr,10),lty=3)
> legend(15,.4,c("2 df","5 df","10 df"),lty=1:3)
> title("Chi-square density")

Figure 10-4

[Graph 3]

> xgr <- seq(0,8,length=90)
> plot(xgr, df(xgr,5,15), type = "l", lty = 1, xlab = "x", ylab ="f(x)")
> lines(xgr,df(xgr,5,5),lty=3)
> legend(3,.6,c("F(5,15)","F(5,5)"),lty=c(1,3))
> title("F density")


Question 2

[8 marks]

When n = 5 the coverage is only about 86%, much less than the 95% intended. When n = 100 the coverage is 95%. If n > 40 it appears that the coverage will be close enough to 95%.

> n <- 5
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.864
 
> plot(c(0,20),c(-10,40),type="n",xlab="Interval Number",ylab="x")
> for (i in 1:20)
    { points(i, xbare[i], pch=16)
      lines(c(i,i), c(llim[i],ulim[i])) }
> abline(h=10, lty=3)

> n <- 10
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.92
> n <- 20
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.924
> n <- 40
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.935
> n <- 100
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.952

[8 marks]

The sample mean and variance from normal data are statistically independent, as demonstrated by the scatterplot below. This means that t confidence intervals from normal data will be too wide or too narrow independently of whether they are to low or too high. Exponential data, however, give positively correlated sample mean and variance. This is because, being positively skewed, if there happen to be observations from the right tail in a sample, the sample mean and the sample variance will both be inflated. The result is that the t-based confidence intervals that are too high will tend to be too wide (and hence more likely to include the true value) and those that are too low will tend to be too narrow (and hence more likely to miss the true value).

 

 

 

Question 3

(a) [5 marks]

> qchisq(.975,27)/qchisq(.025,27)
[1] 2.963932
> qchisq(.975,26)/qchisq(.025,26)
[1] 3.028276

Hence 27 degrees of freedom are required.

(b) [10 marks]

The number of pathogens found in a sample will ,with probability 0.7, be Poisson with mean 30, and, with probability 0.3, be Poisson with mean 60. A Bayes Theorem calculation gives the conditional probability of being at the higher mean, given that 40 pathogens were observed.

> .3*dpois(40,60)/(.3*dpois(40,60)+.7*dpois(40,30))
[1] 0.04223268
 

Question 4

(a) [valid analysis 2, graph 3, analysis 4, assumptions 2, conclusions 1]

> hex
   indoor outdoor
1    0.22    0.90
2    0.18    0.66
3    0.28    1.24
4    0.34    0.37
5    0.18    1.55
6    0.12    0.54
7    0.29    0.27
8    0.08    0.68
9    0.39    1.26
10   0.28    0.48
> anova(lm(indoor~outdoor,hex))
Analysis of Variance Table
 
Response: indoor
          Df   Sum Sq  Mean Sq F value Pr(>F)
outdoor    1 0.000153 0.000153  0.0144 0.9076
Residuals  8 0.085487 0.010686               
> summary(lm(indoor~outdoor,hex))
 
Call:
lm(formula = indoor ~ outdoor, data = hex)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-0.15489 -0.06114  0.01135  0.05606  0.14951 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 0.228326   0.071893   3.176   0.0131 *
outdoor     0.009652   0.080542   0.120   0.9076  
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
Residual standard error: 0.1034 on 8 degrees of freedom
Multiple R-Squared: 0.001792,	Adjusted R-squared: -0.123 
F-statistic: 0.01436 on 1 and 8 DF,  p-value: 0.9076 
 
> plot(indoor~outdoor,hex)
> abline(lm(indoor~outdoor,hex))

Assumptions:

Conclusions:


> t.test(hex$ind,hex$outd,pair=T)
 
	Paired t-test
 
data:  hex$ind and hex$outd 
t = -4.066, df = 9, p-value = 0.002816
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.8700034 -0.2479966 
sample estimates:
mean of the differences 
                 -0.559 
> stem(hex$ind-hex$outd)
 
  The decimal point is at the |
 
  -1 | 40
  -0 | 9765
  -0 | 420
   0 | 0
 

Assumptions:

Conclusions:

(b) [valid analysis 2, graph 3, analysis 5, assumptions 2, conclusions 1]

> wafer
   defects clean
1       53     S
2      193     S
3      113     S
4      640     S
5      800     S
6      140     S
7       85     S
8      658     S
9      140     S
10     140     S
11      26     M
12      90     M
13     546     M
14      90     M
15     120     M
> t.test(defects~clean,wafer,var.eq=T)
 
	Two Sample t-test
 
data:  defects by clean 
t = -0.8445, df = 13, p-value = 0.4137
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -433.3820  189.7820 
sample estimates:
mean in group M mean in group S 
          174.4           296.2 
 
> boxplot(defects~clean,wafer)
> var(wafer$defects[wafer$clean=="S"])/var(wafer$defects[wafer$clean=="M"])
[1] 1.814901
> 2*pf(var(wafer$defects[wafer$clean=="S"])/var(wafer$defects[wafer$clean=="M"]),9,4,lower.tail=F)
[1] 0.5929594

Assumptions:

Conclusions:


 

Question 5

[either interaction plot 5, ANOVA with interaction 14, CI for residual variance 3, conclusions 3]

> paper
   pctwr etime temp
1     24    10  100
2     26    10  100
3     21    10  100
4     25    10  100
5     39    20  100
6     34    20  100
7     37    20  100
8     40    20  100
9     58    30  100
10    55    30  100
11    56    30  100
12    53    30  100
13    33    10  120
14    33    10  120
15    36    10  120
16    32    10  120
17    51    20  120
18    50    20  120
19    47    20  120
20    52    20  120
21    75    30  120
22    71    30  120
23    70    30  120
24    73    30  120
25    45    10  140
26    49    10  140
27    44    10  140
28    45    10  140
29    67    20  140
30    64    20  140
31    68    20  140
32    65    20  140
33    89    30  140
34    87    30  140
35    86    30  140
36    83    30  140
> anova(lm(pctwr~as.factor(etime)*as.factor(temp), paper))
Analysis of Variance Table
 
Response: pctwr
                                 Df Sum Sq Mean Sq  F value    Pr(>F)    
as.factor(etime)                  2 8200.4  4100.2 856.5203 < 2.2e-16 ***
as.factor(temp)                   2 4376.7  2188.4 457.1431 < 2.2e-16 ***
as.factor(etime):as.factor(temp)  4  103.3    25.8   5.3936  0.002522 ** 
Residuals                        27  129.2     4.8                       
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Conclusions:

> interactplot(paper$pctwr,as.factor(paper$etime),as.factor(paper$temp))
> interactplot(paper$pctwr,as.factor(paper$temp),as.factor(paper$etime))
> anova(lm(pctwr~as.factor(etime)*as.factor(temp),paper))[4, 3]
         
4.787037 
> anova(lm(pctwr~as.factor(etime)*as.factor(temp),paper))[4, 3]/(qchisq(c(.975,.025),27)/27) 
[1] 2.992278 8.868909


Question 6

[ANOVA with lack of fit 13, CI for residual variance 3, assumptions 3, conclusions 3, predictions and comment 3]

> summary(lm(pctwr~etime, paper[paper$temp==100,]))
 
Call:
lm(formula = pctwr ~ etime, data = paper[paper$temp == 100, ])
 
Residuals:
   Min     1Q Median     3Q    Max 
-5.000 -1.812  0.500  1.375  3.250 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.50000    1.89407    3.96  0.00269 ** 
etime        1.57500    0.08768   17.96 6.12e-09 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
Residual standard error: 2.48 on 10 degrees of freedom
Multiple R-Squared: 0.9699,	Adjusted R-squared: 0.9669 
F-statistic: 322.7 on 1 and 10 DF,  p-value: 6.116e-09 
 
> anova(lm(pctwr~etime, paper[paper$temp==100,]))
Analysis of Variance Table
 
Response: pctwr
          Df  Sum Sq Mean Sq F value    Pr(>F)    
etime      1 1984.50 1984.50  322.68 6.116e-09 ***
Residuals 10   61.50    6.15                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
> anova(lm(pctwr~etime, paper[paper$temp==100,]))[2,3]
     
6.15 
> anova(lm(pctwr~etime, paper[paper$temp==100,]))[2,3]/(qchisq(c(.975,.025),10)/10)
[1]  3.002464 18.940719
 
> anova(lm(pctwr~etime+as.factor(etime), paper[paper$temp==100,]))
Analysis of Variance Table
 
Response: pctwr
                 Df  Sum Sq Mean Sq  F value    Pr(>F)    
etime             1 1984.50 1984.50 372.0937 1.249e-08 ***
as.factor(etime)  1   13.50   13.50   2.5313    0.1461    
Residuals         9   48.00    5.33                       
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 

Assumptions:

Conclusions:

> anova(lm(pctwr~etime+as.factor(etime), paper[paper$temp==100,]))[3,3]
         
5.333333 
> anova(lm(pctwr~etime+as.factor(etime), paper[paper$temp==100,]))[3,3]/(qchisq(c(.975,.025),9)/9)
[1]  2.523292 17.775214 
 
> predict(lm(pctwr~etime, paper[paper$temp==100,]), newdata=data.frame(etime=c(0,25,60)))
      1       2       3 
  7.500  46.875 102.000 
 

Predictions:

> plot(pctwr~etime, paper[paper$temp==100,])
> abline(lm(pctwr~etime, paper[paper$temp==100,]))


Question 7

[ANOVA with interaction 12, conclusions 3, plots of residuals and discussion 10]

> baked
   density pos temp posf tempf
1      570   1  800    1   800
2      565   1  800    1   800
3      583   1  800    1   800
4     1063   1  825    1   825
5     1080   1  825    1   825
6     1043   1  825    1   825
7      565   1  850    1   850
8      510   1  850    1   850
9      590   1  850    1   850
10     528   2  800    2   800
11     547   2  800    2   800
12     521   2  800    2   800
13     988   2  825    2   825
14    1026   2  825    2   825
15    1004   2  825    2   825
16     526   2  850    2   850
17     538   2  850    2   850
18     532   2  850    2   850
> anova(lm(density~posf*tempf, baked))
Analysis of Variance Table
 
Response: density
           Df Sum Sq Mean Sq  F value   Pr(>F)    
posf        1   7160    7160   15.998 0.001762 ** 
tempf       2 945342  472671 1056.117 3.25e-14 ***
posf:tempf  2    818     409    0.914 0.427110    
Residuals  12   5371     448                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Conclusions

Analysis of Residuals

There are no obvious systematic patterns in the residuals. This suggests that the model is adequate and the homoscedasticity assumption is reasonable. The probability plot is straight enough to validate the assumpton of normality.

> plot(lm(density~posf*tempf, baked))




Statistics 3N03/3J04