Frameworks for statistical inference and estimation

1 Frameworks for statistical inference and estimation

Use statistical procedures from a range of schools and strictly adhere to their respective methods and interpretation. For example, do a Fisherian significance test properly and interpret it properly. Then set up a formal Neymann-Pearson (sic) test and interpret it formally (this means setting both Type I and II error rates beforehand, among other things). Then do an estimation procedure. Then switch hats and do a Bayesian analysis. Take the results of all four, noting their different behavior, and come to your conclusion. Good analysis and interpretation are as important as the fieldwork, so allot adequate time and resources to both.
- Francis H. J. Crome, ``Researching Tropical Forest Fragementation: Shall We Keep On Doing What We're Doing?'' ch. 31 in Tropical Forest Remants: Ecology, Management and Conservation of Fragmented Communities (ed. W. F. Laurance and R. O. Bierregaard, Jr.), University of Chicago, Chicago: 1997 (p. 501).

1.1 Frequentist

1.1.1 Fisherian

Formulate H₀; calculate test statistic, find probability of data or more extreme values given null model H₀ (p value). p-value provides strength of evidence for rejecting H₀. No fixed p value.

1.1.2 Neyman-Pearson

Formulate null (H₀) and a specific alternative (H₁) hypothesis. Decide on a fixed value of a (probability of type-I error beforehand; e.g. p=0.05); with the sample size and the choice of test statistic, and the alternative hypothesis H₁, this determines b (probability of type-II error). Use this fixed value of a as a decision rule to decide whether to accept H₀ or H₁. (Decision analysis: associate costs with type I and type II errors, decide on balance.) Over the long term, the frequency of incorrect decisions will match a and b.

Note: modern "bad" statistical practice as criticized by Yoccoz and Johnson essentially muddles these two schools.

1.2 Likelihood

1.2.1 ``Classical'' likelihood

Formulate a likelihood model for P(data|H(x)), the probability of observing the data (not the data or a more extreme value as in frequentist approaches), where H(x) is a particular hypothesis (model/parameter value). Find the model or parameter value that gives the maximum likelihood. Accept this maximum likelihood estimate (or, in the case of a choice between two hypotheses, whichever hypothesis has the higher likelihood).

To establish confidence limits, you calculate likelihood profiles (how likelihood drops off as you go away from the MLE).

Example: suppose we want to test whether a coin is fair (p(head = 0.5); we flip it 10 times and get 7 heads. The likelihood that the probability of getting a head is p is proportional to p⁷ (1-p)³ (assuming flips are independent) and the log-likelihood is C + 7 ln(p) + 3 ln(1-p).

What can we tell?

The MLE of p is 7/10 (unsurprisingly).
p = 0.7 is 2.28 times more likely than p = 0.5 (do you think that is ``significant''?)
If we set limits of head-probabilities that are 10 times less likely than the maximum, we get limits (0.37,0.93).
If we set limits of head-probabilities that are 1.92 log-likelihood units (e^1.92=6.82 times) less likely than the maximum, we get limits (0.39,0.92).

Why should we use 1.92 log-likelihood units? The likelihood ratio test says that if we allow n parameters to vary, the q% confidence limits are (asymptotically) the q quantiles (upper tails) of c²_n/2: c²₁(0.95)/2 = 1.92. Essentially, this is a frequentist argument sneaking back in that says how close we can expect to get to the true MLE in repeated trials.

Example: do two populations have the same mean?

Suppose we have two populations (assume we know they are normally distributed and that they have the same [known] variance s², for simplicity). The normal distribution is ~ e^{-(x-m)²/(2 s²)}; therefore the joint log-likelihood of a set of values drawn independently from a normal distribution with mean m (the sum of the individual log-likelihoods) is µ -å_i (x_i-m)² - the sum of squares. If we assume there is really just a single distribution from which all of the values are drawn the log likelihood is µ -å_i (x_i - [`X]); if we are allowed different means for the two different populations we get µ -å_i (x_i - [`X]₁) - å_i(x_i-[`X]₂). The likelihood for the second case can't be worse than the first (since we could always set [`X]₁ = [`X]₂ = [`X]). If it is more than 1.92 greater than the first, we can conclude that the two populations have ``significantly'' different means. We can also find confidence limits, etc..

Points about likelihood:

Usually gives ``sensible'', unbiased answers
Only relative likelihoods matter
Gaussian, constant errors ® least-squares
``Everybody likes likelihood'' - it is an important component of all statistical schools, although they all use it differently

1.2.2 Information-theoretic

Very different approach, focuses on model selection. Starts from abstract definitions of the distance between models (cf. least-squares), arrives at an expression for the (approximate) distance between the true model and any given candidate, which turns out to be -2 log(L)+C. C is a ``penalization term'' that goes up with the number of parameters and varies according to your definition, but Akaike's Information Criterion uses C = 2.

1.3 Bayesian

Remember, what would really like to know is P(H₀|data), the probability of our hypothesis given the data we've observed.

Use Bayes' Rule (see below) to get probabilities on hypotheses and parameter values. In frequentist statistics, the underlying hypotheses/parameters/models are true and the data come from a probability distribution; in Bayesian statistics the data are true and the underlying hypotheses etc. have probability distributions.

(Bayesian decision analysis: associate costs with different outcomes, maximize expected return.)

Bayesian analysis

is not ``more powerful'' (unless you cheat) than other approaches
can be used to cheat, but should not ...
does not allow e.g. unreplicated designs, although (i) it does provide a relatively easy framework for estimating dependence among units and (ii) it gets around the philosophical problems of unreplicated designs (e.g. evolutionary biology)

Fisher N-P Likelihood Information Bayes

depends (at least conceptually) on replicated outcomes yes yes yes no no

outcome depends on sampling rules (violates Likelihood principle) yes yes no no no

gives decision rules no yes no (no) yes

requires alternative hypotheses specified no yes yes yes yes

intuitive probability interpretation no no yes no yes

subjective no no no no yes

requires specified priors no no no no yes

allows integrating previous results no no no no yes

2 Bayesian approaches in a little more detail

How can we figure out the probability of the hypothesis, P(H₀|data) from the likelihood, P(data|H₀)? (The likelihood P(data|H₀) is a probability of data, not hypotheses: the likelihoods for all our candidate hypotheses don't even add up to 1, as they would if they were probabilities of different hypotheses ...)

How? Bayes' Rule.

P(H₀|data) = P(data|H₀) ·P(H₀)
P(data)
= P(data|H₀) ·P(H₀)

å
i
P(data|H_i) ·P(H_i)

(1)

In words: multiply the likelihood P(data|H₀) by the prior probability P(H₀), and divide by the sum of (likelihood × prior) for all candidate hypotheses.

False-positive/testing example

My favorite example (which I will probably not have time to present in class): suppose the probability of having some deadly but rare disease (D) is 10^-4. There is a test for this disease which has no false negatives: if you have the disease, you will test positive (P(+|D) = 1). However, there are occasionally false positives; 1 person in 1000 who doesn't have the disease) will test positive anyway (P(+|not D) = 10^-3). We want to know the probability that someone who has a positive test is actually ill.

Using Bayes' Rule

P(D|+) = P(+|D) P(D)
P(+)
.
(2)
We know P(+|D) (=1) and P(D) ( = 10^-4), but we have to figure out P(+), the overall probability of testing positive. You can test positive if you are really diseased or if you are really healthy;

P(+) = (p(D and +)) + (p(not D and +)),
(3)
according to the rule that if A and B are mutually exclusive (you can't be both ill and not ill) P(A and B) = P(A)+P(B). We can then say

P(+) = P(D) P(+|D) + (1-P(D)) P(+|not D)
(4)
by the rule that P(A and B) = P(A) P(B|A). Putting it all together,

P(D|+)

=

P(+|D) P(D)
P(D) P(+|D) + (1-P(D))P(+|not D)

=

1 ×10^-4
1 ×10^-4 + (1-10^-4) ×10^-3

»

10^-4
10^-3

=

1
10
.

Even though false positives are rare, the chance of being ill if you test positive is still only 10%!

Priors If P(H₀) is constant reduces to likelihood rule (except that we can say something about probability) but what does a "flat" prior mean? (scale changes, subdividing hypotheses)

How do you do this?

(Continue with previous examples: coin-flipping and testing two populations.)

Coin-flipping:

flat prior
``probably fair'' prior
(paranoid prior?)

Two-population example:

testing point null hypotheses deprecated by Bayesians ...
can give (e.g.) probability that m₁-m₂ is greater (or less) than a certain value

Credible intervals (symmetric, contains 95% of probability).

Issues:

although the prior (and the interpretation of the probability) are subjective, the updating rule (Bayes' rule) is the rational response to fresh evidence
Simple models can be (practically speaking) harder: you may have to ``roll your own'' more often
Conjugate priors
Complex models: cf. Crome comment, MCMC

Final conclusions:

be suspicious of big differences between the conclusions of different approaches: is someone cheating?
usually differences are because different frameworks allow us to ask different questions: really think about what question you want to answer!

File translated from T_EX by T_TH, version 2.71.
On 4 Sep 2002, 08:59.

	Fisher	N-P	Likelihood	Information	Bayes
depends (at least conceptually) on replicated outcomes	yes	yes	yes	no	no
outcome depends on sampling rules (violates Likelihood principle)	yes	yes	no	no	no
gives decision rules	no	yes	no	(no)	yes
requires alternative hypotheses specified	no	yes	yes	yes	yes
intuitive probability interpretation	no	no	yes	no	yes
subjective	no	no	no	no	yes
requires specified priors	no	no	no	no	yes
allows integrating previous results	no	no	no	no	yes