Assignment #3 Solutions

Statistics 3N03/3J04 - Assignment #3 Solutions

2004-12-03

Use R to do the graphics on this assignment. Do the ANOVA calculations in R and with your calculator, and submit both. The text references are to Montgomery & Runger, Applied Statistics and Probability for Engineers, 3rd edition.

Full marks = 140

Question 1

Figure 8-4

[Graph 3]

> xgr <- seq(-4,4,length=50)
> plot(xgr, dnorm(xgr), type = "l", lty = 1, xlab = "x", ylab ="f(x)")
> lines(xgr,dt(xgr,10),lty=2)
> lines(xgr,dt(xgr,1),lty=3)
> legend(1.8,.38,c("infinite df","10 df","1 df"),lty=1:3)
> title("t density")

Figure 8-8

[Graph 3]

> xgr <- seq(0,30,length=50)
> plot(xgr, dchisq(xgr, 2), type = "l", lty = 1, xlab = "x", ylab ="f(x)")
> lines(xgr,dchisq(xgr,5),lty=2)
> lines(xgr,dchisq(xgr,10),lty=3)
> legend(15,.4,c("2 df","5 df","10 df"),lty=1:3)
> title("Chi-square density")

Figure 10-4

[Graph 3]

> xgr <- seq(0,8,length=90)
> plot(xgr, df(xgr,5,15), type = "l", lty = 1, xlab = "x", ylab ="f(x)")
> lines(xgr,df(xgr,5,5),lty=3)
> legend(3,.6,c("F(5,15)","F(5,5)"),lty=c(1,3))
> title("F density")

Question 2

[8 marks]

When n = 5 the coverage is only about 86%, much less than the 95% intended. When n = 100 the coverage is 95%. If n > 40 it appears that the coverage will be close enough to 95%.

> n <- 5
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.864
 
> plot(c(0,20),c(-10,40),type="n",xlab="Interval Number",ylab="x")
> for (i in 1:20)
    { points(i, xbare[i], pch=16)
      lines(c(i,i), c(llim[i],ulim[i])) }
> abline(h=10, lty=3)

> n <- 10
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.92
> n <- 20
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.924
> n <- 40
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.935
> n <- 100
> expdata <- matrix(rexp(1000*n, 1/10), ncol=n)
> xbare <- apply(expdata, 1, mean)
> sxe <- sqrt(apply(expdata, 1, var))
> llim <- xbare-qt(.975,n-1)*sxe/sqrt(n)
> ulim <- xbare+qt(.975,n-1)*sxe/sqrt(n)
> mean(10 > llim & 10 < ulim)
[1] 0.952

[8 marks]

The sample mean and variance from normal data are statistically independent, as demonstrated by the scatterplot below. This means that t confidence intervals from normal data will be too wide or too narrow independently of whether they are to low or too high. Exponential data, however, give positively correlated sample mean and variance. This is because, being positively skewed, if there happen to be observations from the right tail in a sample, the sample mean and the sample variance will both be inflated. The result is that the t-based confidence intervals that are too high will tend to be too wide (and hence more likely to include the true value) and those that are too low will tend to be too narrow (and hence more likely to miss the true value).

Question 3

(a) [5 marks]

> qchisq(.975,27)/qchisq(.025,27)
[1] 2.963932
> qchisq(.975,26)/qchisq(.025,26)
[1] 3.028276

Hence 27 degrees of freedom are required.

(b) [10 marks]

The number of pathogens found in a sample will ,with probability 0.7, be Poisson with mean 30, and, with probability 0.3, be Poisson with mean 60. A Bayes Theorem calculation gives the conditional probability of being at the higher mean, given that 40 pathogens were observed.

> .3*dpois(40,60)/(.3*dpois(40,60)+.7*dpois(40,30))
[1] 0.04223268

Question 4

(a) [valid analysis 2, graph 3, analysis 4, assumptions 2, conclusions 1]

> hex
   indoor outdoor
1    0.22    0.90
2    0.18    0.66
3    0.28    1.24
4    0.34    0.37
5    0.18    1.55
6    0.12    0.54
7    0.29    0.27
8    0.08    0.68
9    0.39    1.26
10   0.28    0.48
> anova(lm(indoor~outdoor,hex))
Analysis of Variance Table
 
Response: indoor
          Df   Sum Sq  Mean Sq F value Pr(>F)
outdoor    1 0.000153 0.000153  0.0144 0.9076
Residuals  8 0.085487 0.010686               
> summary(lm(indoor~outdoor,hex))
 
Call:
lm(formula = indoor ~ outdoor, data = hex)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-0.15489 -0.06114  0.01135  0.05606  0.14951 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 0.228326   0.071893   3.176   0.0131 *
outdoor     0.009652   0.080542   0.120   0.9076  
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
Residual standard error: 0.1034 on 8 degrees of freedom
Multiple R-Squared: 0.001792,	Adjusted R-squared: -0.123 
F-statistic: 0.01436 on 1 and 8 DF,  p-value: 0.9076 
 
> plot(indoor~outdoor,hex)
> abline(lm(indoor~outdoor,hex))

Assumptions:

Independence (can't test)
Normality (can't test but residuals look OK on the graph)
Linearity (can't really test with only 1 df for pure error but looks OK on the graph)
Homoscedasticity (can't test but looks OK on the graph)

Conclusions:

Assuming a linear relationship, there is no evidence (P = 0.91) that indoor concentration of hexavalent chromium varies with outdoor concentration at the same house.


> t.test(hex$ind,hex$outd,pair=T)
 
	Paired t-test
 
data:  hex$ind and hex$outd 
t = -4.066, df = 9, p-value = 0.002816
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.8700034 -0.2479966 
sample estimates:
mean of the differences 
                 -0.559 
> stem(hex$ind-hex$outd)
 
  The decimal point is at the |
 
  -1 | 40
  -0 | 9765
  -0 | 420
   0 | 0

Assumptions:

differences independent (can't test)
differences normal (can't test but looks OK on the graph)

Conclusions:

There is some evidence (P = 0.003 two-sided or P = 0.0014 one-sided) that the concentration of hexavalent chromium is higher indoors than outdoors.

(b) [valid analysis 2, graph 3, analysis 5, assumptions 2, conclusions 1]

> wafer
   defects clean
1       53     S
2      193     S
3      113     S
4      640     S
5      800     S
6      140     S
7       85     S
8      658     S
9      140     S
10     140     S
11      26     M
12      90     M
13     546     M
14      90     M
15     120     M
> t.test(defects~clean,wafer,var.eq=T)
 
	Two Sample t-test
 
data:  defects by clean 
t = -0.8445, df = 13, p-value = 0.4137
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -433.3820  189.7820 
sample estimates:
mean in group M mean in group S 
          174.4           296.2 
 
> boxplot(defects~clean,wafer)
> var(wafer$defects[wafer$clean=="S"])/var(wafer$defects[wafer$clean=="M"])
[1] 1.814901
> 2*pf(var(wafer$defects[wafer$clean=="S"])/var(wafer$defects[wafer$clean=="M"]),9,4,lower.tail=F)
[1] 0.5929594

Assumptions:

Independence (can't test, sample too small)
Normality (can't test but looks OK by the graph)
Homoscedasticity (OK by F test, P = 0.59)

Conclusions:

There is no evidence (P = 0.41) that the mean number of defects differs between standard and megasonic cleaning. (A one-sided left-tail test would lead to the same conclusion with P = 0.21.)

Question 5

[either interaction plot 5, ANOVA with interaction 14, CI for residual variance 3, conclusions 3]

> paper
   pctwr etime temp
1     24    10  100
2     26    10  100
3     21    10  100
4     25    10  100
5     39    20  100
6     34    20  100
7     37    20  100
8     40    20  100
9     58    30  100
10    55    30  100
11    56    30  100
12    53    30  100
13    33    10  120
14    33    10  120
15    36    10  120
16    32    10  120
17    51    20  120
18    50    20  120
19    47    20  120
20    52    20  120
21    75    30  120
22    71    30  120
23    70    30  120
24    73    30  120
25    45    10  140
26    49    10  140
27    44    10  140
28    45    10  140
29    67    20  140
30    64    20  140
31    68    20  140
32    65    20  140
33    89    30  140
34    87    30  140
35    86    30  140
36    83    30  140
> anova(lm(pctwr~as.factor(etime)*as.factor(temp), paper))
Analysis of Variance Table
 
Response: pctwr
                                 Df Sum Sq Mean Sq  F value    Pr(>F)    
as.factor(etime)                  2 8200.4  4100.2 856.5203 < 2.2e-16 ***
as.factor(temp)                   2 4376.7  2188.4 457.1431 < 2.2e-16 ***
as.factor(etime):as.factor(temp)  4  103.3    25.8   5.3936  0.002522 ** 
Residuals                        27  129.2     4.8                       
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Conclusions:

There is evidence (P = 0.0025) that exposure time interacts with temperature. This means that both exposure time and temperature affect percentage of water removed, but the effect of exposure time is different at different temperatures and the effect of temperature is different at different exposure times. (Because of the interaction, tests of the main effects are not valid.)

> interactplot(paper$pctwr,as.factor(paper$etime),as.factor(paper$temp))
> interactplot(paper$pctwr,as.factor(paper$temp),as.factor(paper$etime))
> anova(lm(pctwr~as.factor(etime)*as.factor(temp),paper))[4, 3]
         
4.787037 
> anova(lm(pctwr~as.factor(etime)*as.factor(temp),paper))[4, 3]/(qchisq(c(.975,.025),27)/27) 
[1] 2.992278 8.868909

Question 6

[ANOVA with lack of fit 13, CI for residual variance 3, assumptions 3, conclusions 3, predictions and comment 3]

> summary(lm(pctwr~etime, paper[paper$temp==100,]))
 
Call:
lm(formula = pctwr ~ etime, data = paper[paper$temp == 100, ])
 
Residuals:
   Min     1Q Median     3Q    Max 
-5.000 -1.812  0.500  1.375  3.250 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.50000    1.89407    3.96  0.00269 ** 
etime        1.57500    0.08768   17.96 6.12e-09 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
Residual standard error: 2.48 on 10 degrees of freedom
Multiple R-Squared: 0.9699,	Adjusted R-squared: 0.9669 
F-statistic: 322.7 on 1 and 10 DF,  p-value: 6.116e-09 
 
> anova(lm(pctwr~etime, paper[paper$temp==100,]))
Analysis of Variance Table
 
Response: pctwr
          Df  Sum Sq Mean Sq F value    Pr(>F)    
etime      1 1984.50 1984.50  322.68 6.116e-09 ***
Residuals 10   61.50    6.15                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
> anova(lm(pctwr~etime, paper[paper$temp==100,]))[2,3]
     
6.15 
> anova(lm(pctwr~etime, paper[paper$temp==100,]))[2,3]/(qchisq(c(.975,.025),10)/10)
[1]  3.002464 18.940719
 
> anova(lm(pctwr~etime+as.factor(etime), paper[paper$temp==100,]))
Analysis of Variance Table
 
Response: pctwr
                 Df  Sum Sq Mean Sq  F value    Pr(>F)    
etime             1 1984.50 1984.50 372.0937 1.249e-08 ***
as.factor(etime)  1   13.50   13.50   2.5313    0.1461    
Residuals         9   48.00    5.33                       
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Assumptions:

Independence
Normality
Homoscediaticity

Conclusions:

There is no evidence (P = 0.15) of nonlinearity, so we can test the slope. There is strong evidence (P < 0.001) that the slope is not zero, so we conclude that the percentage of water removed can be predicted as a linear function of exposure time when the temperature is held at 100°F.

> anova(lm(pctwr~etime+as.factor(etime), paper[paper$temp==100,]))[3,3]
         
5.333333 
> anova(lm(pctwr~etime+as.factor(etime), paper[paper$temp==100,]))[3,3]/(qchisq(c(.975,.025),9)/9)
[1]  2.523292 17.775214 
 
> predict(lm(pctwr~etime, paper[paper$temp==100,]), newdata=data.frame(etime=c(0,25,60)))
      1       2       3 
  7.500  46.875 102.000

Predictions:

The prediction for 25 sec is a linear interpolation and should be OK. The extrapolations to 0 sec and 60 sec are not reliable because percantage is bounded by 0% and 100%; in particular, linear extrapolation to 60 sec gives a result greater than 100%.

> plot(pctwr~etime, paper[paper$temp==100,])
> abline(lm(pctwr~etime, paper[paper$temp==100,]))

Question 7

[ANOVA with interaction 12, conclusions 3, plots of residuals and discussion 10]

> baked
   density pos temp posf tempf
1      570   1  800    1   800
2      565   1  800    1   800
3      583   1  800    1   800
4     1063   1  825    1   825
5     1080   1  825    1   825
6     1043   1  825    1   825
7      565   1  850    1   850
8      510   1  850    1   850
9      590   1  850    1   850
10     528   2  800    2   800
11     547   2  800    2   800
12     521   2  800    2   800
13     988   2  825    2   825
14    1026   2  825    2   825
15    1004   2  825    2   825
16     526   2  850    2   850
17     538   2  850    2   850
18     532   2  850    2   850
> anova(lm(density~posf*tempf, baked))
Analysis of Variance Table
 
Response: density
           Df Sum Sq Mean Sq  F value   Pr(>F)    
posf        1   7160    7160   15.998 0.001762 ** 
tempf       2 945342  472671 1056.117 3.25e-14 ***
posf:tempf  2    818     409    0.914 0.427110    
Residuals  12   5371     448                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Conclusions

The hypothesis that there is no interaction between firing temperature and furnace position is accepted at the 5% level of significance. Hence we can proceed to test the man effects.
The hypothesis that the mean density is the same at each firing temperature is rejected at the 5% level of significance.
The hyppothesis that the mean density is the same at each oven position is rejected at the 5% level of significance.

Analysis of Residuals

There are no obvious systematic patterns in the residuals. This suggests that the model is adequate and the homoscedasticity assumption is reasonable. The probability plot is straight enough to validate the assumpton of normality.

> plot(lm(density~posf*tempf, baked))