## SOME DEFINITIONS...

### Updated 2003-04-07

An experiment that can result in different outcomes, even though it is repeated in the same manner every time, is called a random experiment.

The set of all possible outcomes of a random experiment is called the sample space of the experiment.

An event is a subspace of the sample space of a random experiment.

A sample space is discrete if it consists of a finite (or countably infinite) set of outcomes.

Probability is a measure of certainty on a scale of 0 to 1. The probability of an impossible event is 0, the probability of an inevitable event is 1. If A and B are events, then P(A+B) = P(A) + P(B) - P(A.B), where A+B denotes set union and A.B denotes set intersection. Any one of the following three definitions of probability can be used to assign a probability to an event that is neither impossible nor inevitable.

The relative frequency definition of probability applies when the sample space consists of elementary outcomes which, through physical symmetry, are recognized as being equally likely. The probability of an event E is the number elementary outcomes in E divided by the number of elementary outcomes in the sample space.

The limiting frequency definition of probability applies when you can envisage a sequence of independent trials. Consider the number of trials that result in an event E, divided by the total number of trials. The probability of E is the hypothetical limit to which any such series of trials will tend.

The subjective definition of probability defines your personal probability of an event E as the maximum amount of money you are willing to bet in order to win \$1 if E occurs.

The odds for an event A are computed as the probability that A will happen divided by the probability that A will not happen.

A random variable is a function that assigns a real number to each outcome in the sample space of a random experiment.

A discrete random variable is a random variable with a finite (or countably infinite) range.

A continuous random variable is a random variable with an interval (either finite or infinite) of real numbers for its range.

The probability density function for a random variable X is a non-negative function f(x) which gives the relative probability of each point x in the sample space of X. It integrates to 1 over the whole sample space. The integral over any subset of the sample space gives the probability that X will fall in that subset.

A parameter is a scalar or vector that indexes a family of probability distributions.

The expected value or mean or average of a random variable is computed as a sum (or integral) over all possible values of the random variable, each weighted by the probability of getting that value. It can be interpreted as the centre of mass of the probability distribution.

The variance of a random variable is the expected squared deviation from the mean.

The covariance of two random variables is the expected product of their deviations from their respective means.

The correlation coefficient is a dimensionless measure of association between two random variables. Pearson's correlation coefficient is computed as their covariance divided by the product of their standard deviations.

The sensitivity of a screening test is the conditional probability that the subject has the symptom, given that they have the disease.

The specificity of a screening test is the conditional probability that the subject does not have the symptom, given that they do not have the disease.

The predictive value positive PV+ of a screening test is the conditional probability that the subject has the disease, given that they have tested positive.

The predictive value negative PV- of a screening test is the conditional probability that the subject does not have the disease, given that they have tested negative.

The receiver operating characteristic (ROC) curve for a screening test is a plot of sensitivity on the ordinate against (1-specificity) on the abscissa, where the different points on the curve correspond to different cut-off points used to designate test positive. The area under the ROC curve is the probability that given a two subjects, one with the disease and the other without the disease, the one with the disease will have the higher score on the test.

The prevalence of a disease is the proportion of people in the study population who currently have the disease.

The (cumulative) incidence of a disease is the probability that a person who has never had the disease will develop a new case of the disease over a given period of time.

Risk ratio is the probability of the disease in the risk group divided by the probability of the disease in the non-risk group.

Odds ratio is the odds for the disease in the risk group divided by the odds for the disease in the non-risk group, or the odds for the risk factor in the disease group divided by the odds for the risk factor in the non-disease group, or the risk ratio for the disease divided by the risk ratio for not having the disease.

When approximating a discrete distribution (defined on integer values) by a continuous distribution, the sum of probabilities up to and including the probability at a point x is usually approximated by the area under the continuous distribution up to x + 0.5; the 0.5 is called the continuity correction.

A population consists of the totality of the observations with which we are concerned.

A sample is a subset of observations selected from the population.

A statistic is any function of the observations in a sample. It may not include any unknown parameters.

The distribution of a statistic is called a sampling distribution. It describes how the statistic will vary from one sample to another.

A pivotal quantity is a function of a statistic and the parameter of interest that follows a standard distribution. The distribution may not include any unknown parameters.

A confidence interval is a random interval which includes the true value of the parameter of interest with probability 1-a. When we have computed a confidence interval from data, it is fixed by the data and no longer random, so we say that we are 100(1-a)% confident that it includes the true value of the parameter.

In parametric statistical inference, an hypothesis is a statement about the parameters of a probability distribution.

A null hypothesis states that there is no difference between the hypothesized value of a parameter and its true value.

The alternative hypothesis is an hypothesis that applies when the null hypothesis is false.

A simple hypothesis specifies a single value for a parameter, a composite hypothesis specifies more than one value, or a range of values.

Any statistic which follows a different distribution under the null hypothesis than it does under the alternative hypothesis can be used as a test statistic. A test statistic can be derived from a pivotal quantity by replacing the unknown parameter by its hypothesized value.

The distribution of a test statistic when the null hypothesis is true is called the reference distribution for the test.

Rejecting the null hypothesis when it is true is defined as a type I error. Failing to reject the null hypothesis when it is false is defined as a type II error.

In an accept-reject test of hypothesis, the conditional probability of committing a type I error, given that the hypothesis is true, is called the level of significance of the test.

In an accept-reject test of hypothesis, the conditional probability of rejecting the null hypothesis, given that the alternative hypothesis is true, is called the power of the test. The type II error rate is computed as (1-power).

There are three definitions of P-value. Satisfy yourself that all three mean exactly the same thing.

(1) P-value is the smallest level of significance that will lead to rejection of the null hypothesis with the given data.

(2) P-value is the largest level of significance that will lead to acceptance of the null hypothesis with the given data.

(3) P-value is the probability of getting a value of the test statistics as extreme as, or more extreme than, the value observed, if the null hypothesis were true. The alternative hypothesis determines the direction of "extreme".

Frequentist inference is usually based on a P-value, whch is the probability of getting what was observed or a result more extreme than that, given that the hypothesis is true. The weakness in the logic is that we are computing probabilities of events that didn't happen, assuming that the hypothesis is true, which it may not be. The alternative hypothesis is used only to determine what events are more extreme than the observed event and need not be completely specified.

Bayesian inference uses Bayes' formula to compute the conditional posterior probability that the hypothesis is true, given what was observed. The calculation involves the unconditional prior probability that the hypothesis is true, as well as the conditional probabilities of getting what was observed given the hypothesis and given the alternative. The logic of Bayesian inference is sound, but there are practical difficulties in its application. In particular, the prior probability that the hypothesis is true may have to be determined subjectively, the alternative hypothesis must be completely specified, and the computations are very difficult in all but the simplest applications.

A statistical method is said to be robust if it does what it is supposed to do even when the assumptions on which it is based are not satisfied. (For example, the z-test for a normal mean when the variance is known is robust against non-normality, but not against dependent data or an incorrectly specified variance.)

In the simple linear regression model, the conditional mean of the dependent variable (also called the Y-variable or response variable or predicted variable) is a linear function of a single independent variable (also called the X-variable or explanatory variable or predictor variable).

The term regression comes from breeding experiments. If inheritance were perfect, plotting a characteristic of an offspring against the same characteristic in the parent would give points along the diagonal. In reality, offspring tend to "regress" towards the population mean, so that offspring of superior parents tend to be less superior than their parents and offspring of inferior parents tend to be less inferior than their parents, hence the points will lie along a line with slope less than 1. This was called the "regression line" and fitted by least squares. Now, any model fitting with least squares is called "regression".

In the multiple linear regression model, the conditional mean of the dependent variable is a linear function of more than one independent variable.

More generally, in a regression model, the conditional mean of the dependent variable is a function of one or more independent variables.

A categorical independent variable is called a factor. The categories are called the levels of the factor.

Replications are experiment observations made under the same conditions, that is, under the same combination of factor levels.

An experimental design is said to be balanced if each combination of factor levels is replicated the same number of times.

In analysis of variance, or ANOVA, the sum of squared deviations of the dependent variable about its mean is broken down into a sum of terms, each term a sum of squared deviations representing the variation attributable to an explanatory variable, and the residual, or unexplained, variation.

The degrees of freedom of a sum of squared deviations is the number of squares in the sum, minus the number of fitted parameters in the expected values about which the deviations are computed.

A mean square is a sum of squares divided by its degrees of freedom.

Residual variance is the mean squared deviation about a model. The residual mean square is an estimate of residual variance.

The variance of observations made under identical conditions is called pure error.

A contingency table gives the number of subjects in each category of a cross-classification formed by two or more factors. The word "contingency" comes from the Latin word for "touching"; the table shows how the different factors touch each other. If there are just two factors, the first factor being "rows" with R levels and the second factor being "columns" with C levels, the table is a rectangular array called an R by C contingency table. The test for independence in an R by C contingency table is based on the weighted difference between the observed counts and the counts that would be expected in a table having the same marginal totals but with independent row and column factors.

## RELATIONS BETWEEN SOME DISTRIBUTIONS...

The Bin(n, p) distribution can be approximated by the Pois(np) distribution when n is large and p is small.

The Bin(n, p) distribution can be approximated by the N(np, np(1-p)) distribution when p < 0.5 and np > 5 or when q < .5 and nq > 5. The continuity correction is recommended.

The Pois(m) distribution can be approximated by the N(m, m) distribution when m > 5. The continuity correction is recommended.

A process in time (or space) where events happen one at a time, at random, independently of each other, at a constant average rate l, is called a Poisson process. The number of events in a fixed time interval of length t follows a Poisson distribution with mean lt. The time between events, or from an arbitrary time to the next event, follows an exponential distribution with mean 1/l.

The relations between the Normal, Chi-square, t and F distributions can be illustrated with the following identities, which you should verify in the tables:

zp = tinfinity,p

(z1-p/2)2 = (t infinity,1-p/2 ) 2 = c21,1-p = F1, infinity,1-p = 1/Finfinity,1,p

(td,1-p/2 ) 2 = F1,d,1-p = 1/Fd,1,p

(c2d,1-p)/d = Fd, infinity,1-p = 1/Finfinity,d,p