|
Statistics: Making Sense of Data |
Glossary |
|
accuracy The extent to which a measurement is close to its true value. additive two-way ANOVA model A two-way ANOVA model that assumes that no interaction is present. alternative hypothesis A hypothesis that specifies the values of the population parameter that are possible when the null hypothesis fails; can be one-sided or two-sided. analysis of variance A body of methods for determining the influences of various postulated variables or factors on the variability of a response variable, such as the influences of wheat variety and soil type on crop yield. ANOVA Stands for analysis of variance. ANOVA table The standard tabular format for presenting the results of an analysis of variance. Also called analysis of variance table. average The sum of a set of numbers divided by the number of numbers in the set; same as the mean of the set of numbers. believability The fact that a valid statistical inference can be trusted and relied on, as contrasted with the fact that an invalid statistical inference cannot be trusted. between-samples sum of squares In a one-way ANOVA, the sum of squared deviations, each such deviation being a sample mean minus the overall sample mean of all the observations, where each such squared deviation is first multiplied by its sample size before the addition of the squared deviations. between-sample variability In analysis of variance, variability of observations between different samples caused by differences among the population means of the samples. bias Systematic error of measurement in repeated measurements, as distinct from random errors, which vary unsystematically between measurements. biased estimator An estimator of a population parameter that systematically underestimates or systematically overestimates the parameter. For example, S2 systematically underestimates s2. bimodal data set A data set with two clusters of high concentration. binomial distribution The probability model for the number of successes x in a fixed number n of two-outcome ("success," "failure") trials that are independent and are of equal probability p; the distribution is computed as
bivariate
data Data involving two variables, such
as height and weight, or amount of smoking and a
measure of health; often graphed in a scatter plot. blocking Converting a one-way ANOVA into a two-way ANOVA by introducing a second factor into the experimental design judged likely to contribute substantially to the observed variability, such as soil type in a study of wheat yields of different varieties of wheat. Levels of the first factor are randomly assigned within homogeneous "blocks" of the second factor. Bonferroni method In hypothesis testing of more than two populations, a method to simultaneously compare pain; of population parameters (often population means) after the hypothesis of the joint equality of all the parameters has been rejected. The comparisons produced by the method satisfy a specified overall statistical confidence percentage of being simultaneously correct. bootstrap The technique of random resampling from a given sample (usually with replacement, but sometimes without replacement using an extensive replication of the given sample to resample from) and computation of the statistic of interest after each resampling. Because the given sample should be shaped approximately like the unknown population, the bootstrap-produced distribution of the statistic of interest (as given by its relative frequency histogram) should be a good estimate of the unknown theoretical distribution of the statistic of interest. Many nonparametric estimation and hypothesis testing procedures can be constructed using bootstrapping to estimate the unknown distribution of the statistic of interest box-and-whisker
plot A graphical way of displaying data that shows the median as center
and the quartiles and extreme values as spread; also called a box plot. box model A method of generating simulated data by randomly drawing a number from a box repeatedly either with or without replacement between draws, depending on the application. The box may contain any number of real numbers, some appearing more than once; the box containing 1, 2, 3, 4, 5, 6 simulates the throwing of a fair die. box plot Same as box-and-whisker plot. categorical data Data that fall into a finite number of categories; the emphasis is on the frequency in each category (an example is the number of men and women in a college class). causation A relationship between bivariate variables X and Y that holds when it is known that varying X causes a change in the value of Y; a correlation, even when large, between X and Y does not imply causation. center of data A value that indicates the middle, or center, of a set of data. Also called measure of central tendency. Important measures of central tendency include the mean and the median. central
limit theorem The empirical and theoretical result that
chi-square
density The theoretical curve used to calculate chi-square probabilities
chi-square hypothesis testing; also called the smooth chi-square curve. chi-square statistic X2 = sum of (O -E)2 /E, where the sum is over all categories,O is the observed frequency, and E is the expected frequency under the null circle graph A graph for categorical data. The proportion of elements belonging to each category is proportionally represented as a pie-shaped sector of a circle. Sometimes called a pie chart. cluster A portion of high concentration in a data set. Can be of inferential importance, as in the example of an unusually high number of cases of bacterial meningitis in a college town in a data set from many towns, indicating an epidemic. complementary events Given an event A, the complement of A ("not A") consists of all the outcomes not in A. For example, if A is the event that two of three children are boys, then not A is the event that there are either zero, one, or three boys. The equations P(not E) = 1 -P(E) and p(not E) = 1- p(E) are important laws regarding probabilities of complementary events. conditional
probability The probability that one event will occur given that another has
occurred. The conditional probability that event A will occur given that
event B has occurred is given by
confidence interval An interval computed from a random sample that is used to estimate some population parameter; the "confidence" probability that the confidence interval contains the parameter must be stated with the confidence interval. confidence level The probability (typically 0.9,0.95,0.99) that the statistician's confidence interval contains the true, unknown population parameter. Sometimes stated as a percentage. confounding Said of two or more possible causes of an observed statistically significant effect when statistical analysis cannot determine which is the cause. convenience samples A sample of observations that is convenient and low cost to gather, but not necessarily representative of the population. continuous distribution A probability law for a continuous variable; the probability that the variable falls within any given interval is given by the area of the interval under a specified curve called a density. Examples include the normal and chi-square distributions. control group The group of individuals or units not receiving the treatment, allowing for a comparison between the effect of a treatment and that of no treatment. correction factor A factor that multiplies the usual sampling-with-replacement standard error to adjust for the fact that randomly sampling without replacement from a finite population reduces the standard error of the sample mean from its sampling-with-replacement value. correlation
coefficient, r A measure of how close two variables of a
scatter plot are to being perfectly linearly related; computed by dividing the
covariance by the product of the two variables' standard deviations. Also called
the Pearson correlation coefficient. covariance
A measure of how closely two variables of a scatter plot are linearly related. A
closely related and more easily interpreted concept is correlation. data
Numerical information, usually about the real world degrees
of freedom (1) For the chi-square distribution: An integer that
determines which chi-square density, and hence which row of the Appendix C
chi-square table, to use. (2) For the t distribution: An integer
that determines which t density, and hence which row of the Appendix F Student's
t table, to use. (3) For the F distribution: The integer that
divides an ANOVA sum of squares to produce a corresponding index (such as the
between-samples variability or the within- samples variability) that is
independent of the number of observations and the number of populations being
sampled from. For the within-samples sum of squares and the between-samples sum
of squares, such division allows use of the F distribution to do ANOVA in the
case of sampling from normal populations. The F distribution has a numerator and
a denominator degrees of freedom that determine which F density, and hence which
entries of the Appendix H F tables to use. density
See continuous distribution. descriptive
statistics Graphical and
numerical techniques for describing or summarizing data that capture the essence
of the data. deviation
The signed distance of a data
point from the mean of the data. discrete
probability model Assigns a
positive probability to each of a finite number of outcomes in the sample
space of a model. distribution
The probability law for a
statistic of interest; for example, the height of a person chosen at random may
follow the normal distribution. dot
plot Same as line plot. double-blind
study A study in which neither
the subjects nor the expert conducting the study knows who has received the
treatment and who has received the placebo. equally
likely outcomes A theoretical
probability model, such as that for a fair die, in which every outcome has the
same probability; also called discrete uniform distribution. See uniform
distribution (discrete). error
sum of squares Same as the within-samples
sum of squares. error
variance The mean squared
error for the best-fitting least squares regression line; also called the
residual variance. estimation
Using data to make an educated
guess about the magnitude of a population
or probability model parameter. event
A set of outcomes of interest
in a probability model; for example, the event E that the number showing on a
fair die is even is given as E = {2,4, 6}. Usually the goal is to compute or
estimate p(E). exclusive
events Events with no outcomes
in common. expected
value (experimental)
The average value of a random quantity that has been repeatedly observed in
replications of an experiment; possibly obtained through the five-step method. expected
value (theoretical)
Given a discrete random variable X with probability law p(x), the expected value, E(X), is E(X) = sum of xp(x).
Also called the theoretical mean of the random variable X. For a continuous random variable, the
summation is replaced by a calculus integration. experimental
probability The estimated
probability of an event; obtained by dividing the number of successful trials by
the total number of trials. It is often obtained by applying the five-step
method. exponential
distribution The continuous
probability distribution describing the time until the failure of a component if
the probability of a failure does not change with the component's age. F distribution
The continuous probability law of the ratio of two mean squares (such as the
between-samples mean square divided by the within-samples mean square for a
one-way ANOVA) used to carry out ANOVAs when sampling from normal populations. five-step
method Basic simulation
method of this book; the five steps are the choice of a model, the definition of
a trial, the definition of the statistic of interest associated with the trial
(often the number of successful trials), the repetition of trials, and the
calculation of the average of the statistic of interest (often an experimental
probability or probability estimate). frequency
Number of data points, usually
in an interval. frequency
interpretation of probability
The interpretation of a theoretical probability as approximately predicting the
proportion of occurrences of the event of interest (such as the proportion of
heads being approximately
frequency
table A table giving the
number of data points in a data set falling in each of a set of given intervals. gapping
Occurrence of an interval containing no data, caused by the fact that the
probability model generating the data assigns a probability of 0 to the
interval. For example, a study of tall and short men might randomly sample only
men shorter than 5'6" and men taller than 6'0". histogram
A bar graph presenting the
frequencies of occurrence of data points. Sometimes called a frequency histogram. hypergeometric
distribution A discrete
distribution used to assess the probability of a certain number of
"defectives" in a random sample; given by
where a is the number of
defectives in the population, b is the number of nondefectives in the
population, r is the sample size, and n=a + b is the population
size. hypothesis testing Deciding which of two realities about a population, such as whether a coin is fair or biased, is true on the basis of a random sample from the population. independence
Property of two events that holds if the occurrence of one does not affect
the probability of the occurrence of the other; see also law of experimental
independence and law of theoretical independence. inferential statistics Techniques used to draw conclusions from data. interquartile range The difference between the third quartile and the first quartile of a set of data. influence
The ability of a single data
point to have a major effect on a statistical inference. A desirable property of
a robust procedure is that every data point has little influence. For example,
the sample median is not influenced by the magnitude of very large data points. interaction
In a two-way ANOVA, the
influence of the value of one variable on how the values of the other variable
affect the response variable. judgment sampling Choosing a sample according to expert knowledge rather than a random mechanism. Kendall's tau test A particular hypothesis test used to detect a (possibly nonlinear) trend in bivariate (X, Y) data. law
of experimental independence The fact that if the events A and B are
independent, then P(A and B)
law of theoretical independence The fact that if the events A and B are independent, then p(A and B) = p(A)p(B), where p denotes theoretical probability. least
mean error regression The method of locating the best-fitting regression
line by minimizing the mean absolute error for all sloped lines passing through
(
least
squares regression The method of locating the best-fitting regression line
to a scatter plot by minimizing the sum of the squared vertical distances
between the points of the scatter plot and the line; amounts to choosing the
best slope among all lines passing through (
level of significance User-specified probability of rejecting the null hypothesis when it is true; usually set at 0.05, sometimes at 0.01 or 0.1. linear relationship A relationship between two variables whose scatter plot is well fit by a straight line; the equation Y = mX + C is often used, where m is the slope and C is the vertical axis intercept value. line plot A line graph that orders the data along a real number line. Also called a dot plot mean
(sample) The sum of a set of numbers divided by the number of numbers in the
set; same as the average, and often defined by
mean (theoretical) The population or probability distribution mean. In the case of a discrete distribution given by p(x), it equals the sum of xp(x). Same as the theoretical expected value of a random variable sampled from the population. mean absolute error Also mean error. the average of the vertical distances between the points of a scatter plot and the fitted regression line. mean
deviation For regression, the measure of spread (variation) that is the
average of the deviations of the data points from the sample mean; also called
the mean absolute deviation. mean square An ANOVA sum of squares divided by its degrees of freedom. mean square error For regression, the average of the squared vertical distances between the points of a scatter plot and a proposed fitted regression line; equals S2e for the best-fitting least squares line. median
(sample) The number that is
in the middle of a set of numbers when they are arranged in order. By
convention, the median is the average of the two middle numbers if the number of
numbers is even. median (population) Number in middle of population numbers. median test A hypothesis test used to assess the null hypothesis that two populations have the same centers as measured by the population medians. mode The number in a set of data that occurs the most frequently; a seldom used measure of the center of a data set. model Also probability model: a mathematical set of probability rules, a random physical mechanism, a box model, or a random-number-based simulation for producing data that are as similar as possible to actual real-world data . Monte Carlo method A simulation method for solving probability problems by repeatedly doing an experiment, such as tossing a coin repeatedly, rolling a die repeatedly, repeatedly drawing from a box model, or repeatedly choosing random digits. multiple comparisons Simultaneous comparisons of pairs of population parameters to judge which are distinct, with some measure of overall statistical confidence attached to the results. multiple
correlation coefficient, R2 A measure of the degree of
fit of the best-fitting curve when the regression is nonlinear or linear with
multiple X's; reduces to the ordinary squared correlation r2 when the
regression line is Y= A + BX. negative relationship A relationship between two variables in which one decreases as the other increases; in the special case of a straight line, a negative relationship means that the slope of the regression line is negative. negative slope The slope of a line on which the Y value decreases as the X value increases. nonlinear regression Regression that seeks the relationship between bivariate (X, Y) pairs when the relationship is not linear, as in the case of Y = A + BX + CX2 + random error. nonparametric Said of a statistical procedure that does not require detailed assumptions about the shape of the population distribution; such detailed assumptions are usually specified by various population parameters. nonresponse bias Polling bias resulting from the fact that, even if the intended sample is selected by random sampling, the subset of those actually responding to the survey may be quite different from those not responding. normal distribution Also called the bell-shaped curve or the Gaussian curve: the most widely used continuous distribution; often used to model biological measurements and errors of measurement. Probabilities are computed using the table of Appendix E. null hypothesis The presumed model (such as that of a fair coin) in hypothesis testing. The data provide a measure of how weak or strong the evidence for or against this null hypothesis is; it is the model of step 1 if the six-step method is being used. observational study A study in which data were collected without using randomization with respect to the treatment of interest and often collected for another purpose. Such data are usually not very useful for statistical inference. observed level of significance The level of significance, or strength of evidence, against the null hypothesis as determined by the data. For example, if the observed, sample mean is large, then when sampling assuming the null hypothesis of a population mean of 0, the probability (in a new experiment) of obtaining a sample mean that is larger than this observed mean will be small. This small probability is the observed level of significance. one-sided hypothesis test A hypothesis test in which the alternative hypothesis is that the population parameter lies to one (specified) side of its null hypothesis value. one-way ANOVA Analysis of variance in which the population means vary because of the influence of a single variable or factor, as when wheat yields vary because of differing wheat varieties. outlier A data value that is extreme relative to the rest of the data; farther than +-3 standard deviations from the sample mean is the usual criterion for single-variable data.
parameter A number describing a characteristic of a population or a probability model, such as the theoretical mean, μ. pie chart See circle graph.
placebo A fake treatment that is administered to appear just like a treatment so that the human subject cannot tell whether he or she has received the treatment, for example, a shot of saline solution instead of a drug.
point estimate An estimate of a population parameter that is one specific number (as opposed, say, to an interval).
Poisson
distribution The discrete
probability law of the number of occurrences of a randomly occurring event in a
fixed time interval when the rate of occurrence is fixed across the interval,
separate occurrences are independent, and simultaneous occurrences are
precluded; an example is the number of phone calls arriving in a given period of
time. The distribution is computed as
population The entire collection of objects or people under consideration for statistical study. Often the statistical goal is to use a random sample to make inferences about the population, such as about its center or spread; the population is modeled by a probability distribution. Can be real finite or infinite conceptual.
positive
relationship A relationship between two variables in which one increases as
the other increases; in the special case of a straight line, a positive
relationship means that the slope of the regression line is positive. positive
slope The slope of a line on which the Y value increases as the X value increases. power The probability of rejecting the null hypothesis when it is false. It usually increases as the distance of the true value of the parameter of interest from its null hypothesis value increases.
precision The degree of smallness of the variation of a set of measurements; high-precision measurements display little variation. Precision ignores the vital issue of measurement bias.
probability See experimental probability; frequency interpretation of probability; theoretical probability. quartile The third quartile is roughly the value above which 1/4 of the data lie; the first quartile is roughly the value below which 1/4 of the data lie.
quota sampling Choosing a sample by guaranteeing that certain groups (like men .and women) are represented by the same proportion in the sample as they have in the population; otherwise any individuals who fit the quotas can be selected.
r Symbol for correlation coefficient
random Not predictable, occurring by chance. For example, the outcome of tossing a fair coin is random. random digits Also random numbers: digits (most commonly 0, 1,..., 9) that occur in equally likely and random fashion, as when produced by using a spinner having 10 equal sectors; often produced by a computer program. randomization Choosing by a totally random mechanism which units receive which treatments in an experiment or which individuals to sample from a population, instead of having experts decide, for example. randomization test In the context of hypothesis testing concerning the effects of to the obtained experimental data. The statistic of interest is recomputed in each case to obtain a null hypothesis distribution of the statistic of interest, various treatments on units, repeated random reassignment of the treatments thus allowing the determination of the significance of the original random assignment of treatments to units that produced the observed value of the statistic of interest. random sample A set of data chosen from a population in such a way that each member of the population has an equal probability of being selected. random walk The path taken along a line or in a two-dimensional rectangular grid by an object moving at random. range The measure of spread (variation) that is the difference between the largest number and the smallest number in a set of data; not a robust measure of spread. regression equation The equation of the regression line. regression line A straight line used to estimate the relationship between two variables, based on the points of a scatter plot; often determined by a least squares analysis. relative
frequency (of an event) Same as the experimental probability. relative frequency histogram A frequency histogram vertically scaled so that the sum of the areas of its rectangles is 1. relative frequency polygon A piecewise linear graph obtained by joining the midpoints of the tops of the rectangles of a relative frequency histogram. Its shape approximates the probability law producing the data. residual
The distance, or error, between an actual data point and that predicted
by the statistically inferred model; in the case of regression, the
distance between the actual Y value and the regression line Y' value; sometimes
just Y -
residual variance See error variance. robust A statistic computed from a set of data is robust if it is not overly influenced by the location of a single number in the data set-a desirable property! S2e The average of the squared errors for the best-fitting least squares
regression line, given by the average of the squared vertical distances
between the scatter plot points and the best-fitting line. The subscript e denotes
error. S2Y The sample variance of a set of data whose values are denoted by Y. sample Same as a random sample unless it is known that the sample was not randomly selected. Nonrandom samples cannot be trusted! sample space The set of all possible outcomes of a probability experiment. For example, the probability experiment of throwing a die once has the sample space S = (1,2,3,4,5, 6). sample survey A survey of a population, usually human, made by taking a sample judged to be representative of the population. Use of a random mechanism for choosing the sample is essential. sample variance See variance (sample). sampling with replacement random sampling in which each item drawn is recorded, then put back before another item is drawn. scatter plot A graph of two-variable (bivariate) data in which each point is located by its coordinates (X, Y). selection bias Polling bias resulting from a sampling method that makes some population members more likely to be sampled than others. For example, the Literary Digest poll mentioned in Chapter 10 favored the more affluent. significance, statistical See statistically significant sign test A hypothesis test to assess a hypothesized value for a population median. simple random sample Same as a random sample. This expression is used to distinguish simple random sampling from more complex methods of random sampling, such as stratified random sampling. simulation Any method for generating data from a given probability model; methods used include box models, physical models such as coins or dice, and simulation based on random number generation, often on the computer. single-blind study A study in which the subjects do not know who has received the treatment and who has received the placebo. six-step method A modification of the five-step method to do hypothesis testing in which a sixth decision-making step about the truth or falsity of the step 1 null hypothesis model is included. skewed Said of an asymmetric density that is stretched out in one direction. For example, the density of an exponential distribution is skewed to the right slope
The rate of change of the Y values with respect to the rate of change
of the X values. spread
of data The degree to which data are spread out around their center.
Measures of spread include the mean deviation, variance, standard
deviation, and interquartile range. Also called variation. standard deviation (sample) The square root of the sample variance. It is the most widely used measure of the amount of spread (variation) in a set of data. standard deviation (theoretical) The standard deviation of the population or distribution, given by the square root of the theoretical variance. standard
error of a sample mean A measure of the typical variation of the sample mean
from the population mean; given by σ /
standard
error of a sample proportion
standard
normal density Same as standard normal distribution.
standard normal distribution A normal distribution with a mean of 0 and a variance of l. statistic
A piece of numerical information computed from a sample, such as
statistical experiment The collection of data following accepted statistical randomization practices, allowing one to make a valid statistical inference. statistically significant Said of data behavior that is too unusual to be attributable to chance under the probability model presumed to be producing the data. For example, if the sample mean is so large that its value cannot be reasonably attributable to chance under the null hypothesis of a population mean of 0, the difference of the sample mean from 0 is said to be statistically significant. statistical
regularity The empirical (real-world) fact that the experimental probability
becomes closer and closer to a number, called the theoretical probability,
as the number of trials becomes large; this law is the foundation of statistical
reasoning. statistics (1) The science of gathering, describing, and drawing conclusions from data; (2) reported numerical information, such as in a newspaper. stem-and-leaf plot A graphical display of data using certain digits (such as those in the 10s place) as stems and the remaining digits (such as those in the Is place) as leaves. It is a special kind of histogram. sum of squares The sum of the squares of observed quantities computed in an ANOVA to estimate the contributions from various sources, such as the effect of one variable in a two-way ANOVA. symmetric Said of a density that is symmetric about its theoretical mean, such as a normal density. theoretical probability The true probability of an event; what the experimental probability approaches in a very large number of trials. In the special case of an equally likely outcomes probability model, the theoretical probability is obtained by dividing the number of outcomes that produce the event of interest by the total number of possible outcomes. totally randomized design Assignment of treatments to units in a completely random manner. transformation
of data Use of Xa, log X, or another transformation to
produce a better fitting and simpler model for two-variable (X, Y) data. Can
transform Y, too. treatment
group The group of individuals or units in an experiment that receive the
treatment being studied. t test A special
hypothesis test about the population mean used when the population is known to
be normally distributed, the sample size is small, and the population standard
deviation is unknown. two-sample
problem Any inference problem
about two populations in which the data consist of a random sample from each population. two-sided
hypothesis test A hypothesis
test in which the alternative hypothesis is that the population parameter may
lie on either side of
its null hypothesis value. two-way
ANOVA Analysis of variance in which the population means vary because of the
influence of two variables or
factors, as when wheat yields vary because of differing wheat varieties and
differing soil types. type
I (hypothesis test) error
The error of incorrectly
rejecting a null hypothesis when it is true. type
II (hypothesis test) error The
error of incorrectly accepting a
null hypothesis when it is false. uniform
distribution A probability
model in which each number has the same chance of occurring; can be discrete or
continuous. uniform
distribution (continuous) The
continuous probability law in which
uniform
distribution (discrete) A
distribution in which all equally spaced numbers have the same probability of
occurrence; an example is the distribution computed by
validity
The correctness of a statistical inference. variable
A quantity that varies, often randomly.
For example, the weight of a
randomly chosen member of a
football team is such a variable. Variables are usually represented by letters. variance
(sample) A measure of
the amount of spread (variation) in a set of data. It is the average of the squared distances of
all the data values from their mean. variance
(theoretical) The variance of
the population or distribution. In the case of a discrete distribution
given by p(x), it equals the sum of
(x -m)2p(x),
where m.
is the population mean. variation
See spread. visual
method of linear regression
Visually choosing the line through (
volunteer
effect Bias that arises when
response to a survey is voluntary; it happens because the sample of
volunteers is not representative of
the population, even if the survey is sent to a representative random
sample. within
samples sum of squares
In a one-way ANOVA, the sum over all the samples of each sample's sum of
squares (its variance X sample size). within-sample variability In analysis of variance, the variability of observations within a sample that is caused by the population variance. without replacement Describes random drawings from a set of objects, or from a box model, in which each drawn object is not replaced before the next random drawing. with replacement Describes random drawings from a set of objects, or from a box model, in which each drawn object is replaced before the next random drawing. z score Same as z statistic. z
statistic Also called a standardized score:
z
test A hypothesis test about the population mean m
(or about two population means) using either the fact that the population is
normal or the central limit theorem result that (
|