An experiment that can result in different outcomes, even though it is repeated in the same manner every time, is called a **random experiment**.

The set of all possible outcomes of a random experiment is called the **sample space** of the experiment.

An **event** is a subspace of the sample space of a random experiment.

A sample space is **discrete** if it consists of a finite (or countably infinite) set of outcomes.

**Probability** is a measure of certainty on a scale of 0 to 1. The probability of an impossible event is 0, the probability of an inevitable event is 1. If A and B are events, then P(A+B) = P(A) + P(B) - P(A.B), where A+B denotes set union and A.B denotes set intersection. Any one of the following three definitions of probability can be used to assign a probability to an event that is neither impossible nor inevitable.

The

relative frequency definition of probabilityapplies when the sample space consists of elementary outcomes which, through physical symmetry, are recognized as being equally likely. The probability of an event E is the number elementary outcomes in E divided by the number of elementary outcomes in the sample space.The

limiting frequency definition of probabilityapplies when you can envisage a sequence of independent trials. Consider the number of trials that result in an event E, divided by the total number of trials. The probability of E is the hypothetical limit to which any such series of trials will tend.The

subjective definition of probabilitydefines your personal probability of an event E as the maximum amount of money you are willing to bet in order to win $1 if E occurs.

The **odds** for an event A are computed as the probability that A will happen divided by the probability that A will not happen.

A **random variable** is a function that assigns a real number to each outcome in the sample space of a random experiment.

A **discrete random variable** is a random variable with a finite (or countably infinite) range.

A **continuous random variable** is a random variable with an interval (either finite or infinite) of real numbers for its range.

The **probability density function** for a random variable X is a non-negative function f(x) which gives the relative probability of each point x in the sample space of X. It integrates to 1 over the whole sample space. The integral over any subset of the sample space gives the probability that X will fall in that subset.

A **parameter** is a scalar or vector that indexes a family of probability distributions.

The **expected value** or **mean** or **average** of a random variable is computed as a sum (or integral) over all possible values of the random variable, each weighted by the probability of getting that value. It can be interpreted as the centre of mass of the probability distribution.

The **variance** of a random variable is the expected squared deviation from the mean.

The **covariance** of two random variables is the expected product of their deviations from their respective means.

The **correlation coefficient** is a dimensionless measure of association between two random variables. Pearson's correlation coefficient is computed as their covariance divided by the product of their standard deviations.

The **sensitivity** of a screening test is the conditional probability that the subject has the symptom, given that they have the disease.

The **specificity** of a screening test is the conditional probability that the subject does not have the symptom, given that they do not have the disease.

The **predictive value positive** PV^{+} of a screening test is the conditional probability that the subject has the disease, given that they have tested positive.

The **predictive value negative** PV^{-} of a screening test is the conditional probability that the subject does not have the disease, given that they have tested negative.

The **receiver operating characteristic** (ROC) curve for a screening test is a plot of sensitivity on the ordinate against (1-specificity) on the abscissa, where the different points on the curve correspond to different cut-off points used to designate test positive. The area under the ROC curve is the probability that given a two subjects, one with the disease and the other without the disease, the one with the disease will have the higher score on the test.

The **prevalence** of a disease is the proportion of people in the study population who currently have the disease.

The **(cumulative) incidence** of a disease is the probability that a person who has never had the disease will develop a new case of the disease over a given period of time.

**Risk ratio** is the probability of the disease in the risk group divided by the probability of the disease in the non-risk group.

**Odds ratio** is the odds for the disease in the risk group divided by the odds for the disease in the non-risk group, or the odds for the risk factor in the disease group divided by the odds for the risk factor in the non-disease group, or the risk ratio for the disease divided by the risk ratio for not having the disease.

When approximating a discrete distribution (defined on integer values) by a continuous distribution, the sum of probabilities up to and including the probability at a point *x *is usually approximated by the area under the continuous distribution up to *x* + 0.5; the 0.5 is called the **continuity correction**.

A **population** consists of the totality of the observations with which we are concerned.

A **sample** is a subset of observations selected from the population.

A **statistic** is any function of the observations in a sample. It may not include any unknown parameters.

The distribution of a statistic is called a **sampling distribution**. It describes how the statistic will vary from one sample to another.

A **pivotal quantity** is a function of a statistic and the parameter of interest that follows a standard distribution. The distribution may not include any unknown parameters.

A **confidence interval **is a random interval which includes the true value of the parameter of interest with probability 1-a. When we have computed a confidence interval from data, it is fixed by the data and no longer random, so we say that we are 100(1-a)% *confident* that it includes the true value of the parameter.

In parametric statistical inference, an **hypothesis** is a statement about the parameters of a probability distribution.

A **null hypothesis** states that there is no difference between the hypothesized value of a parameter and its true value.

The **alternative hypothesis** is an hypothesis that applies when the null hypothesis is false.

A **simple hypothesis** specifies a single value for a parameter, a **composite hypothesis** specifies more than one value, or a range of values.

Any statistic which follows a different distribution under the null hypothesis than it does under the alternative hypothesis can be used as a **test statistic**. A test statistic can be derived from a pivotal quantity by replacing the unknown parameter by its hypothesized value.

The distribution of a test statistic when the null hypothesis is true is called the **reference distribution** for the test.

Rejecting the null hypothesis when it is true is defined as a **type I error**. Failing to reject the null hypothesis when it is false is defined as a **type II error**.

In an accept-reject test of hypothesis, the conditional probability of committing a type I error, given that the hypothesis is true, is called the **level of significance** of the test.

In an accept-reject test of hypothesis, the conditional probability of rejecting the null hypothesis, given that the alternative hypothesis is true, is called the **power** of the test. **The type II error rate** is computed as (1-power).

There are three definitions of **P-value**. Satisfy yourself that all three mean exactly the same thing.

(1)

P-valueis thesmallestlevel of significance that will lead torejectionof the null hypothesis with the given data.(2)

P-valueis thelargestlevel of significance that will lead toacceptanceof the null hypothesis with the given data.(3)

P-valueis the probability of getting a value of the test statistics as extreme as, or more extreme than, the value observed,if the null hypothesis were true.The alternative hypothesis determines the direction of "extreme".

**Frequentist inference** is usually based on a **P-value**, whch is the probability of getting what was observed or a result more extreme than that, given that the hypothesis is true. The weakness in the logic is that we are computing probabilities of events that didn't happen, assuming that the hypothesis is true, which it may not be. The alternative hypothesis is used only to determine what events are more extreme than the observed event and need not be completely specified.

**Bayesian inference** uses Bayes' formula to compute the conditional **posterior probability** that the hypothesis is true, given what was observed. The calculation involves the unconditional **prior probability** that the hypothesis is true, as well as the conditional probabilities of getting what was observed given the hypothesis and given the alternative. The logic of Bayesian inference is sound, but there are practical difficulties in its application. In particular, the prior probability that the hypothesis is true may have to be determined subjectively, the alternative hypothesis must be completely specified, and the computations are very difficult in all but the simplest applications.

A statistical method is said to be **robust** if it does what it is supposed to do even when the assumptions on which it is based are not satisfied. (For example, the z-test for a normal mean when the variance is known is robust against non-normality, but not against dependent data or an incorrectly specified variance.)

In the **simple linear regression** model, the conditional mean of the **dependent variable** (also called the **Y-variable** or **response variable **or** predicted variable**) is a linear function of a single **independent variable** (also called the **X-variable** or **explanatory variable **or** predictor variable**).

The term **regression** comes from breeding experiments. If inheritance were perfect, plotting a characteristic of an offspring against the same characteristic in the parent would give points along the diagonal. In reality, offspring tend to "regress" towards the population mean, so that offspring of superior parents tend to be less superior than their parents and offspring of inferior parents tend to be less inferior than their parents, hence the points will lie along a line with slope less than 1. This was called the "regression line" and fitted by least squares. Now, any model fitting with least squares is called "regression".

In the **multiple linear regression** model, the conditional mean of the **dependent variable** is a linear function of more than one **independent variable**.

More generally, in a **regression** model, the conditional mean of the **dependent variable** is a function of one or more **independent variables**.

A categorical independent variable is called a **factor**. The categories are called the **levels** of the factor.

**Replications** are experiment observations made under the same conditions, that is, under the same combination of factor levels.

An experimental design is said to be **balanced** if each combination of factor levels is replicated the same number of times.

In **analysis of variance**, or **ANOVA**, the **sum of squared deviations** of the dependent variable about its mean is broken down into a sum of terms, each term a sum of squared deviations representing the variation attributable to an explanatory variable, and the **residual**, or **unexplained**, **variation**.

The **degrees of freedom** of a sum of squared deviations is the number of squares in the sum, minus the number of fitted parameters in the expected values about which the deviations are computed.

A **mean square** is a sum of squares divided by its degrees of freedom.

**Residual variance** is the mean squared deviation about a model. The **residual mean square** is an estimate of residual variance.

The variance of observations made under identical conditions is called **pure error**.

A **contingency table** gives the number of subjects in each category of a cross-classification formed by two or more **factors**. The word "contingency" comes from the Latin word for "touching"; the table shows how the different factors touch each other. If there are just two factors, the first factor being "rows" with R **levels** and the second factor being "columns" with C **levels**, the table is a rectangular array called an **R by C contingency table**. The test for independence in an R by C contingency table is based on the weighted difference between the observed counts and the counts that would be expected in a table having the same marginal totals but with independent row and column factors.

The Bin(n, p) distribution can be approximated by the Pois(np) distribution when n is large and p is small.

The Bin(n, p) distribution can be approximated by the N(np, np(1-p)) distribution when p < 0.5 and np > 5 or when q < .5 and nq > 5. The continuity correction is recommended.

The Pois(m) distribution can be approximated by the N(m, m) distribution when m > 5. The continuity correction is recommended.

A process in time (or space) where events happen one at a time, at random, independently of each other, at a constant average rate l, is called a **Poisson process**. The number of events in a fixed time interval of length t follows a **Poisson distribution** with mean lt. The time between events, or from an arbitrary time to the next event, follows an **exponential distribution** with mean 1/l.

The relations between the Normal, Chi-square, t and F distributions can be illustrated with the following identities, which you should verify in the tables:

z

_{p}= t_{infinity,p}(z

_{1-p/2})^{2}= (t_{infinity,1-p/2 })_{ }^{2 }= c^{2}_{1,1-p}= F_{1, infinity,1-p}= 1/F_{infinity,1,p}(t

_{d,1-p/2 })_{ }^{2 }= F_{1,d,1-p}^{ }= 1/F_{d,1,p}(c

^{2}_{d,1-p})/d = F_{d, infinity,1-p}= 1/F_{infinity,d,p}