SOME DEFINITIONS...

Updated 2000-12-11

An experiment that can result in different outcomes, even though it is repeated in the same manner every time, is called a random experiment.

The set of all possible outcomes of a random experiment is called the sample space of the experiment.

An event is a subspace of the sample space of a random experiment.

A sample space is discrete if it consists of a finite (or countably infinite) set of outcomes.

A random variable is a function that assigns a real number to each outcome in the sample space of a random experiment.

A discrete random variable is a random variable with a finite (or countably infinite) range.

A continuous random variable is a random variable with an interval (either finite or infinite) of real numbers for its range.

A parameter is a scalar or vector that indexes a family of probability distributions.

The expected value or mean or average of a random variable is computed as a sum (or integral) over all possible values of the random variable, each weighted by the probability of getting that value. It can be interpreted as the centre of mass of the probability distribution.

The variance of a random variable is the expected squared deviation from the mean.

When approximating a discrete distribution (defined on integer values) by a continuous distribution, the sum of probabilities up to and including the probability at a point x is usually approximated by the area under the continuous distribution up to x + 0.5; the 0.5 is called the continuity correction.

A population consists of the totality of the observations with which we are concerned.

A sample is a subset of observations selected from the population.

A statistic is any function of the observations in a random sample. It may not include any unknown parameters.

The distribution of a statistic is called a sampling distribution. It describes how the statistic will vary from one sample to another.

A pivotal quantity is a function of a statistic and the parameter of interest that follows a standard distribution. The distribution may not include any unknown parameters.

A confidence interval is a random interval which includes the true value of the parameter of interest with probability 1-a. When we have computed a confidence interval from data, it is fixed by the data and no longer random, so we say that we are 100(1-a)% confident that it includes the true value of the parameter.

In parametric statistical inference, an hypothesis is a statement about the parameters of a probability distribution.

A null hypothesis states that there is no difference between the hypothesized value of a parameter and its true value.

The alternative hypothesis is an hypothesis that applies when the null hypothesis is false.

A simple hypothesis specifies a single value for a parameter, a composite hypothesis specifies more than one value, or a range of values.

A test statistic can be derived from a pivotal quantity by replacing the unknown parameter by its hypothesized value.

The distribution of a test statistic when the null hypothesis is true is called the reference distribution for the test.

Rejecting the null hypothesis when it is true is defined as a type I error. Failing to reject the null hypothesis when it is false is defined as a type II error.

There are three definitions of P-value. Satisfy yourself that all three mean exactly the same thing.

(1) P-value is the smallest level of significance that will lead to rejection of the null hypothesis with the given data.

(2) P-value is the largest level of significance that will lead to acceptance of the null hypothesis with the given data.

(3) P-value is the probability of getting a value of the test statistics as extreme as, or more extreme than, the value observed, if the null hypothesis were true. The alternative hypothesis determines the direction of "extreme".

A statistical method is said to be robust if it does what it is supposed to do even when the assumptions on which it is based are not satisfied. (For example, the z-test for a normal mean when the variance is known is robust against non-normality, but not against dependent data or an incorrectly specified variance.)

Simple linear regression means fitting a model where the conditional mean of the dependent variable (also called the Y-variable or response variable) is a linear function of a single independent variable (also called the X-variable or explanatory variable).

In analysis of variance, or ANOVA, the sum of squared deviations of the dependent variable about its mean is broken down into a sum of terms, each term a sum of squared deviations representing the variation attributable to an explanatory variable, plus the residual, or unexplained, variation.

The degrees of freedom of a sum of squared deviations is the number of squares in the sum, minus the number of fitted parameters in the expected values about which the deviations are computed.

A mean square is a sum of squares divided by its degrees of freedom.


RELATIONS BETWEEN SOME DISTRIBUTIONS...

The Bin(n, p) distribution can be approximated by the Pois(np) distribution when n is large and p is small.

The Bin(n, p) distribution can be approximated by the N(np, np(1-p)) distribution when p < 0.5 and np > 5 or when q < .5 and nq > 5. The continuity correction is recommended.

The Pois(m) distribution can be approximated by the N(m, m) distribution when m > 5. The continuity correction is recommended.

A process in time (or space) where events happen one at a time, at random, independently of each other, at a constant average rate l, is called a Poisson process. The number of events in a fixed time interval of length t follows a Poisson distribution with mean lt. The time between events, or from an arbitrary time to the next event, follows an exponential distribution with mean 1/l.


Statistics 3N03