1998-03-06 (Week 8)


Assignment #3

Due: As soon as possible after you complete Assignment #2

Fit the breast cancer survival data from pages 102-103 of Bishop, Fienberg & Holland, using one of GLMStat, Splus or SAS. Repeat the fit with the other two packages. Use binomial error with logit link.


Generalized Linear Models (GLMs)

There is a good (but brief!) introduction to GLMs in Chatfield, pages 207-209. McCullagh & Nelder is comprehensive but very difficult to read. Dobson is a good practical introduction. Bishop, Fienberg and Holland discusses only categorical data; it is now outdated, but still an excellent source of examples.

Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. (1975) Discrete Multivariate Analysis: Theory and Practice. MIT Press.

Chatfield, C. (1988) Problem Solving: A Statistician's Guide. Chapman & Hall.

Dobson, A.J. (1983) An Introduction to Statistical Modelling. Chapman & Hall.

McCullagh, P. & Nelder, J.A. (1989) Generalized Linear Models, Second Edition. Chapman & Hall.

While many statistics packages claim to handle GLMs, the only ones I know of that are comprehensive and convenient are GLIM, GLMStat, Splus and SAS. GLMStat, available only as Macintosh shareware, is by far the most convenient and user-friendly. It is modelled after GLIM, the original GLM software. While there is no printed manual, there is complete online help. There is a copy of GLMStat on the Macintosh in BSB-102. We no longer have a copy of GLIM.

GLM analysis includes conventional linear models, but offers two advantages: (1) a whole range of models including Poisson, binomial, gamma, and other error structures are handled in the same way and with the same algorithm as linear regression with normal errors; (2) unbalanced designs and missing data are handled as easily as balanced designs.

Here are the breast cancer survival data from pages 102-103 of Bishop, Fienberg & Holland, set up for GLMStat. The design is complete and balanced. The columns for "response" and "Binomial n" are "linear", the other columns are "integer" or categorical (i.e. factors in Splus). I have coded histology as a single variable "hist" with four levels: 1 = "Minimal, Malignant", 2 = "Minimal, Benign", 3 = "Greater, Malignant", 4 = "Greater, Benign". The objective is to find the simplest possible model for which the residual deviance passes the chi-square goodness-of-fit test. The fitted model must be hierarchical, in the sense that if any interaction is included in the model, all the lower-order interactions and main effects implied by that interaction must also be in the model.

case

response

binomial n

city

age

hist

1

26

35

1

1

1

2

68

75

1

1

2

3

25

29

1

1

3

4

9

12

1

1

4

5

20

29

1

2

1

6

46

55

1

2

2

...

...

...

...

...

...

36

1

1

3

3

4


Experimental Error and Natural Variation

In any statistical study, we have to distinguish between natural variation and measurement error. Measurement error is the result of imperfect measurement devices; generally it can be reduced by using better equipment but the cost of doing so may be prohibitive and in any event it isn't worth reducing measurement error much below the level of natural variation.

Natural variation is an unavoidable natural property of the system we are studying. It may be of interest in its own right. A study design can reduce natural variation by breaking down the population into well-defined strata or sub-groups that are as homogeneous as possible.

When we are dealing with the sizes of animals, natural variation is frequently characterised by a coefficient of variation of about 10%, where coefficient of variation (c.o.v.) is defined as the standard deviation divided by the mean. See the Guinea Pig Growth Chart for an example where the c.o.v. is consistently near 10%.

Is the 10% c.o.v. rule true for fish length in the Kapuskasing river data? Is it true if we stratify by age? One problem with these data is that the length of each fish is repeated once for every age increment; if we leave all the repeats in the data, the variance of fish length will be underestimated.

To produce a vector of fish lengths and fish ages where each fish appears only once, the following code will work if we first resolve the non-unique fish identifiers ("fish"). There were two fish with identifier #1004; I labelled them #1004 and #2004, and there were two fish with identifier #1061 which I labelled #1061 and #2061.

flenr <- sapply(split(flen, fish), mean)
fager <- sapply(split(fage, fish), mean)

In the January 16 notes I showed the result of

boxplot(split(flen, fage))

but this uses all the repeated fish lengths. How different is the corrected graph?

boxplot(split(flenr, fager))


Back to S4P03