-
- Fig. 5.1 A histogram of daily Average Wind Speed for every day in 1989; it is unimodal and
skewed to the right, with a possible high outlier
-
- Maximum unusually windy or just the windiest day of the year?
Boxplots
- 5-number summary of a (quantitative) variable -> boxplot (page 81)
-
Draw a single vertical axis spanning the extent of the data; draw short horizontal lines at
the lower and upper quartiles and at the median -> form a box
-
Erect ‘fences’ around the main part of the data; we place the upper fence 1.5IQRs above
the upper quartile and the lower fence 1.IQRs below the lower quartile; never include
the fences in your boxplot
-
We use the fences to grow ‘whiskers’; draw lines from the ends of the box up and down
to the most extreme data values found within the fences
-
We add the outliers by displaying any data values beyond the fences with special
symbols
Comparing Groups with Histograms
-
- Is it windier in the winter or the summer?
-
- Use the same scale
-
- Spring/summer and fall/winter
-
- In the colder months the shape is less strongly skewed and more spread out; wind speed is
higher, several high values
Comparing Groups with Boxplots
-
- E.g. are some months windier than others?
-
- Do some months show more variation? (spread)
-
- Group observations by month -> side by side (fig. 5.4)
-
- Easily see which groups have higher medians, which have the greater IQRs, where the central
50% of the data is located and which have the greater overall range
-
- Wind speeds tend to decrease in the summer
-
- The months in which the winds are both strongest and most variable are November through
March
-
- Many outliers -> that windy day in July certainly wouldn’t stand out in November or
December, but for July, it was remarkable
Outliers
-
- An outlier is a value that doesn’t fit with the rest of the data
-
- Boxplots provide a rule of thumb to highlight these unusual points
-
- Try to understand them in the context of the data
-
- Histogram gives a better idea of how the outlier fits in with the rest of the data
-
- Look at the gap between that case and the rest of the data (maybe error in the data)
-
- Never leave an outlier in pace and proceed as if nothing were unusual
-
- Never drop an outlier from the analysis without comment just because it’s unusual
Timeplots: Order, Please!
- Comparing boxplots – when comparing groups with boxplots
o Comparetheshapes–dotheboxeslooksymmetricorskewed?Arethere
differences between groups?
o Comparethemedians.Whichgroupshasthehighercenter?Isthereanypatternto
the medians?
o ComparetheIQRs–whichgroupismorespreadout?Isthereanypatterntohowthe
IQRs change?
o UsingtheIQRsasabackgroundmeasureofvariation,dothemediansseemtobe
different, or do they just vary much as you’d expect from the overall variation?
o Checkforpossibleoutliers–identifythemifyoucananddiscusswhytheymightbe
unusual; of course, correct them if you find that they are errors
- Timeplot – displays data that change over time; often, successive values are connected with
lines to shot trends more clearly; sometimes a smooth curve is added to the plot to help
show long-term patterns and trends
Chapter 6 The Standard Deviation as a Ruler and the Normal Model
- Women’s heptathlon in the Olympics – seven tracks – different units – how to compare the
scores?
The Standard Deviation as a Ruler
-
- Tells us how the whole collection of the values varies
-
- Fig. 6.1 Stem-and-leaf displays for both the long jump and the shot put
-
- Klüft’s 6.78-m long ump is 0.62 meter longer than the mean jump of 6.16 m -> 0.62/0.23 =
2.70 standard deviations better than the mean // Skujyté’s winning shot is only 2.51 standard
deviations better than the mean
Standardizing with z-Scores
-
- Expressing the distance in standard deviations standardized the performances
-
- To standardize a value, we simply subtract the mean performance in that event and then
divide this difference by the standard deviation:
-
- =
-
- These values are called standardized values, and are commonly denoted with the letter z (call
them z-scores)
-
- A z-score of 2 tells us that a data value is 2 standard deviations above the mean
-
- The farther a data value is from the mean, the more unusual it is, so a z-score of -1.2 is more
extraordinary than a z-score of 1.2
-
- Klüft: 2.70+1.19=3.89
-
- Skujyté: 0.61+2.51=3.12
-
- Klüft won
-
- When we standardize data to get a z-score, we do two things – first, we shift the data by
Shifting
subtracting the mean; then we rescale the values by dividing by their standard deviation
Data
-
- Histogram and boxplot for the men’s weight – some of the men are heavier than the
recommended weight (74kg) -> subtracting 74 kg shifts the entire histogram down but leaves
the spread and the shape exactly the same
-
- When we shift the data by adding (or subtracting) a constant to each value, all measures of
position (center, percentiles, min, max) will increase (or decrease) by the same constant
-
- Adding (or subtracting) a constant to every data value adds (or subtracts) the same constant
to measures of position, but leaves measures of spread unchanged
Rescaling Data
-
- Suppose we want to look at the weights in pounds instead
-
- 2.2 pounds in every kilogram, we’d convert the weights by multiplying each value by 2.2 ->
changes the measurement units
-
- Shape does not change
-
- Mean also multiplied by 2.2 (like all measures of position)
-
- Spread is also 2.2 times larger
Whenwemultiply(ordivide)allthedatavaluesbyanyconstant,allmeasuresofposition
(such as the mean, median and percentiles) and measures of spread (such as the range, the
IQR, and the standard deviation) are multiplied (or divided) by that same constant
Back to z-Scores
-
-
-
-
-
What is
-
-
-
-
-
-
-
-
When we subtract the mean of the data from every data value, we shift the mean to zero
(shifts don’t change standard deviation)
Each shifted value is divided by s -> SD should be divided by s as well (SD was s) -> new
standard deviation becomes zero
Standardizing into z-scores does not change the shape of the distribution of a variable
Standardizing into z-scores changes the center by making the mean 0
Standardizing into z-scores changes the spread by making the standard deviation 1
a z-Score BIG?
How far from 0 does a z-score have to be to be interesting or unusual?
To say more about how big we expect a z-score to be, we need to model the data’s
distribution (model of reality, not reality itself)
‘bell-shaped curves’ (normal models) -> normal models are appropriate for distribution
whose shapes are unimodal and roughly symmetric
There is a normal model for every possible combination of mean and standard deviation
N (μ,σ) with a mean of μ and a standard deviation of σ
This mean and standard deviation are not numerical summaries of data -> parameters of the
The normal model with mean 0 and standard deviation 1 is called the standard normal model
(or the standard normal distribution)
Normality assumption
Nearly normal condition -> the shape of the data’s distribution is unimodal and symmetric:
Check this by making a histogram (or a normal probability plot, which we’ll explain later)
- It turns out that in a normal model, about 68% of the values fall within 1 standard deviation
of the mean, about 95% of the values fall within 2 standard deviations of the mean, and
about 99.7 – almost all – of the values fall within 3 standard deviations of the mean (fig. 6.6)
The First Three Rules for Working with Normal Models
-
Make a picture
-
Make a picture
-
Make a picture
-
- Sketch pictures to help think about normal models
-
- Make a histogram or check the Nearly Normal Condition
The worst-case scenario: Tchebycheff’s Inequality
-
- 5 standard deviations above the mean
-
- But 68-95-99.7 rule applies only to normal models
-
- In any distribution, at least 1 − of the values must lie within ±k standard deviations of the
mean
-
- For k = 1.1 – 1/12 = 0; if the distribution is far from Normal
-
- For k = 2.1 – 1/22 = 3/4; not matter how strange the shape of the distribution, at least 75% of
the values must be within 2 standard deviations of the mean
-
- For k = 3.1 – 1/32 = 8/9; in any distribution, at least 89% of the values lie within 3 standard
deviations of the mean
Valuesbeyond3standarddeviationsfromthemeanareuncommon,normalmodelornot
Finding Normal Percentiles
Finding Normal Percentiles Using Technology
From Percentiles to Scores: z in Reverse
Are You Normal? Find Out with a Normal Probability Plot
- The normal probability plot – if the distribution of the data is roughly normal, the plot is
roughly a diagonal straight line; deviations from a straight line indicate that the distribution is
not normal
How Does a Normal Probability Plot Work?
Chapter 7 Scatterplots, Association, and Correlation
-
- Figure 7.1 scatterplot of the average error in nautical miles of the predicted position of
Atlantic hurricanes, plotted against the Year in which the predictions were made
-
- Predictions have improved -> decline in the average error
-
- This timeplot is an example of a more general kind of display called a scatterplot. Scatterplots
may be the most common displays for data. By just looking at them, you can see patterns,
-
- Points in the upper left and lower right quadrants tend to weaken the positive association
-
- Points with z-scores of zero on either variable don’t vote either way, because zx, zy = 0 (see
also figure 7.4)
-
- To turn these products into a measure of the strength of the association, just add up the zx zy
products for every point in the scatterplot:
zxzy
This summarizes the direction and strength of the association for all the points
-
- To adjust for the fact that the size of the sum gets bigger the more data we have, we divide
the sum by n-1 correlation coefficient:
r = ∑
(see also page 155/156)
Correlation Conditions
-
- Correlation measures the strength of the linear association between two quantitative
variables
-
- Before you use correlation, you must check several conditions:
o QuantitativeVariableCondition–correlationappliesonlytoquantitativevariables
o StraightEnoughCondition
o OutlierCondition–whenyouseeanoutlier,itisoftenagoodideatoreportthe
correlation with and without the point
Correlation Properties
-
- The sign of a correlation coefficient gives the direction of the association
-
- Correlation is always between -1 and +1 – correlation can be exactly equal to -1 and +1 but
these values are unusual in real data
-
- Correlation treats x and y symmetrically – the correlation of x and y is the same as the
correlation of y with x
-
- Correlation has no units (but don’t use percentages)
-
- Correlation is not affected by changes in the center or scale of either variable – changing the
units or baseline of either variable has not effect on the correlation coefficient – correlation
depends only on the z-scores, and they are unaffected by changes in center or scale
-
- Correlation measures the strength of the linear association between the two variables
-
- Correlation is sensitive to outliers – a single outlying value can make a small correlation large
or make a large one small
Warning: Correlation ≠ Causation
-
- Figure 7.5 – the two variables are obviously related to each other but that doesn’t prove that
storks bring babies
-
- A hidden variable that stands behind a relationship and determines it by simultaneously
affecting the other two variables is called a lurking variable
-
- Scatterplots and correlation coefficients never prove causation
Correlation Tables
- The rows and column of the table name the variables, and the cells hold the correlations
-
- But: without any checks for linearity and outliers, the correlation table risks showing truly
small correlations that have been inflated by outliers, truly large correlations that are hidden
by outliers, and correlations of any size that may be meaningless because the underlying
form is not linear
-
- Table 7.1: the diagonal cells of a correlation table always show correlations of exactly 1
*Measuring Trend: Kendall’s Tau
-
- Scales of the sort that attempt to measure attitudes numerically are called Likert scales
-
- Likert scales have order (e.g. assessing the pace of a course on a scale form 1-5)
-
- But the correlation coefficient might not be the appropriate measure using alternative
measure of association: Kendall’s tau
-
- Kendall’s tau is a statistic designed to assess how close the relationship between two
variables is to being monotone – a monotone relationship is one that consistently increases
or decreases, but not necessarily in a linear fashion
-
- Kendall’s tau measures monotonicity directly – for each pair of points in a scatterplot, it
records only whether the slope of a line between those two points is positive, negative, or
zero
*Nonparametric Association: Spearman’s Rho
-
- Spearman’s rho can deal with the two problems of outliers and bends in the data (that make
it impossible to interpret correlation)
-
- Rho replaces the original data values with their ranks within each variable
-
- It replaces the lowest value in x by the number 1 ...
-
- The same method ranking method is applied to the y-variable
-
- Spearman’s rho is the correlation of the two rank variables – it must be between -1 and 1
-
- Both (Spearman and Kendall) are examples of what are called nonparametric or distribution-
free methods
Straightening Scatterplots
- Square of one variable -> more linear relationship
Chapter 8 Linear Regression
-
- Burger King: the scatterplot of the Fat (in grams) versus the Protein (in grams) for food sold
at Burger King shows a positive, moderately strong, linear relationship
-
- The correlation between Fat and Protein is 0.83 (fairly strong relationship)
-
- We can model the relationship with a line and give its equation with two parameters: its
mean μ and standard deviation σ linear model (an equation of a straight line through the
data; but wrong in the sense that it can’t match reality exactly)
Residuals
-
- Figure page 179
-
- The line might suggest that BK Broiler chicken sandwich with 30 grams of protein should
have 36 grams of fat when, in fact, it actually has only 25 grams of fat
- We call the estimate made from a model the predicted value, and write it as to distinguish
it from the true value, y
- The difference between the observed value and its associated predicted value is called the
residual – the residual value tells us how far off the model’s prediction is at that point
- BK Broiler chicken residual: y- = 25-36 = -11g of fat actual fat content is about 11 grams
less than the model predicts
- To find the residuals, we always subtract the predicted value from the observed one
“Best Fit” Means Least Squares
-
- Squaring all residuals and add them up
-
- The sum indicates how well the line we drew fits the data – the smaller the sum, the better
the fit
-
- The line of best fit is the line for which the sum of the squared residuals is smallest, the least
squares line
The Linear Model
-
- Straight line: y = mx + b
-
- Linear model (statistics): =b0+b1x (predicted values = slope + intercept of the line)
-
- The b’s are called the coefficients of the linear model – the coefficient b1 is the slope, which
tells how rapidly changes with respect to x – the coefficient b0 is the intercept, which tells
where the line hits (intercepts) the y-axis
- Burger King: = 6.8 + 0.97Protein (one more gram of protein -> 0.97 more grams of fat;
No protein -> 6.8 grams of fat? No reasonable then the intercept serves only as a starting
value for our predictions)
The Least Squares Line
-
- The correlation (tells us the strength of the linear association), the standard deviation (gibes
us the units), and the means (tells us where to put the line)
-
- Slope b1 = r* sy/sx
-
- Changing the units of x and y affects their standard deviations directly
-
- Units of the slope are always the units of y per unit of x
-
- Intercept: b0= -b1 ̅
-
- Example page 182
-
- Regression almost always means “the linear model fit by least squares”
-
- To use a regression model, we should check the same conditions for regressions as we did for
correlation: the Quantitative Variables Condition, the Straight Enough Condition, and the
Outlier Condition
Correlation and the Line
- Figure 8.3: scatterplot for the BK items of zy (standardized Fat) vs. zx (standardized Protein)
along with their least squares line
-
- Equation: ̅y = r*zx
-
- It says that in moving one standard deviation from the mean in x, we can expect to move
about r standard deviations away from the mean in y
- BK: if we standardize both protein and fat, we can write ̅y = 0.83*zprotein
- It tells us that for every standard deviation above (or below) the mean a menu item is in
protein, we’d predict that its fat content is 0.83 standard deviations above (or below) the
mean fat content
Ingeneral,menuitemsthatareonestandarddeviationawayfromthemeaninxare,on
average, r standard deviations away from the mean in y
How Big Can Predicted Values Get?
- Each predicted y tends to be closer to its mean (in standard deviations) than its
corresponding x was. This property of the linear model is called regression to the mean, and
that’s where we got the term regression line.
Residuals Revisited
-
- Data = Model + Residual
-
- Residual = Data - Model
-
- e=y-
-
- A scatterplot of the residuals versus the x-values should be the most boring scatterplot
you’ve ever seen – it shouldn’t have any interesting features, like a direction or shape – it
should stretch horizontally, with about the same amount of scatter throughout. It should
show no bends, and it should have no outliers.
The Residual Standard Deviation
-
- The standard deviation of the residuals, se, gives us a measure of how much the points
spread around the regression line
-
- New assumption: Equal Variance Assumption with the associated Does the Plot Thicken?
Condition – spread is about the same throughout
-
- s=
e
R2-The Variation Accounted For
-
- -0.5 is doing as well as 0.5 (correlation) but different direction
-
- If we square the correlation coefficient, we’ll get a value between 0 and 1, and the direction
won’t matter
-
- The squared correlation, r2, gives the fraction of the data’s variation accounted for by the
model, and 1-r2 is the fraction of the data’s variation left in the residuals
-
- BK: 31% of the variability in total Fat has been left in the residuals / 69% of the variability in
the fat content of BK sandwiches is accounted for by variation in the protein content
-
- All regression analyses include this statistic, although by tradition, it is written with a capital
letter, R2, and pronounced “R-squared”
How Big Should R2 Be?
-
- R2 depends on the kind of data you are analyzing
-
- Data from scientific experiments often have high percentages
-
- Data from observational studies and surveys often show weak associations -> 50%-30% can
provide evidence of a useful regression
A Tale of Two Regressions
-
- Solving our equation for Protein to get a model for predicting Protein from Fat does not work
-
- Protein = 0.55+0.709Fat
Regression Assumptions and Conditions
-
- Reasonable?
-
- Check Quantitative Variables Condition to be sure a regression is appropriate
-
- Linear model
o Linearityassumption
o StraightEnoughCondition
o DoesthePlotThicken?Condition
o OutlierCondition
-
- For the standard deviation of the residuals to summarize the scatter, all the residuals should
share the same spread
Reality Check: Is the Regression Reasonable?
-
- Direction right?
-
- Size reasonable?
Chapter 18 Sampling Distribution Models
The Central Limit Theorem for Sample Proportions
-
- True proportion: p = 0.45 (45% of all American adults believe in ghosts) (fig. 18.1)
-
- 2000 simulated independent samples of 808 adults (p=0.45); we don’t get the same
proportion for each sample we draw
-
- p = parameter of the model (the probability of a success)
-
- ^p for the observed proportion in a sample
-
- q = for the probability of a failure (q=1-p) and 1q for its observed value
-
- P = general probability
-
- The histogram (Fig.18.1) is a simulation of what we’d get if we could see all the proportions
from all possible samples; that distribution has a special name; it is called the sampling
distribution of the proportions
-
- A sampling distribution model for how a sample proportion varies from sample to sample
allows us to quantify that variation and to talk about how likely it is that we’d observe a
sample proportion in any particular interval
-
- To use a normal model, we need to specify two parameters: its mean and standard
deviation; the center is p , so we’ll put μ, the mean of the Norma, at p
P -> standard deviation of the proportion of successes, ^p -> ^p is the number of successes
divided by the number of trials, n, so the standard deviation is also divided by n:
- Average 3 or 4 dices -> Law of large numbers: as the sample size (number of dice) gets larger,
each sample average is more likely to be closer to the population mean &
It’s becoming bell-shaped and approaching the Normal model
The Central Limit Theorem: The Fundamental Theorem of Statistics
-
- For sampling distributions, we had to check a few conditions
-
- For means, there are almost no conditions at all
-
- The sampling distribution of any mean becomes more nearly Normal as the sample size
grows; all we need is for the observations to be independent and collected with
randomization; we don’t even care about the shape of the population distribution
This surprising fact is the result Laplace proved -> Central Limit Theorem (CLT)
-
- Not only does the distribution of means of many random samples get closer and closer to a
Normal model as the sample size grows, this is true regardless of the shape of the population
distribution
-
- Even skewed or bimodal population -> CLT: means of repeated random samples will tend to
follow a Normal model as the sample size grows
-
- Works better and faster the closer the population distribution is to a Normal model
-
- Works better for larger samples
Assumptions and Conditions (for the CLT)
-
- Independence & Sample Size Assumption
-
- Randomization Condition
10% Condition
Large Enough Sample Condition
But Which Normal?
-
- For proportions, the sampling distribution is centered at the population proportion
-
- For means, it’s centered at the population mean
-
- Means have smaller standard deviations than individuals
-
- The standard deviation of y falls as the sample size grows
- But it only goes down by the square root of the sample size
SD( )=
√
Whenwehavecategoricaldata,wecalculateasampleproportion, ̂;thesampling
distribution of this random variable has a Normal model with a mean at the true proportion
p and a standard deviation of SD ( ̂) =
When we have quantitative data, we calculate a sample mean; ; the sampling distribution
of this random variable has a Normal model with a means at the true mean, μ, and a
standard deviation of SD ( ) =
About variation
-
- Means vary less than individual data values
-
- Variability of sample means decreases as the sample size increases
- 10% Condition
Sample Size Assumption
- Whether the sample is large enough to make the sampling model for the sample proportions
approximately Normal
- Success/Failure Condition: we must expect at least 10 successes and at least 10 failures
Choosing Your Sample Size
-
- Suppose a candidate is planning a poll and wants to estimate voter support within 3% with
95% confidence. How large a sample does she need?
-
- ME = z* ^ ^ /
-
- 0.03 = 1.96 ^ ^ /
-
- For ^p we can guess a value – the worst case is 0.50 /makes ^p^q and n largest
-
- 0.03 = 1.96 . ∗ .
- 0.03√n = 1.96 √0.5 ∗ 0.5 ≈ 32.67
-
- n ≈ 1067.1
-
- We need at least 1068 respondents to keep the margin of error as small as 3% with a
confidence level of 95%
-
- To cut the standard error (and thus the ME) in half, we must quadruple the sample size
Terms
-
- Standard error – when we estimate the standard deviation of a sampling distribution using
statistics found from the data, the estimate is called a standard error
-
- Confidence interval – a level C confidence interval for a model parameter is an interval of
values usually of the form
Estimate ± margin of error
Found from data in such a way that C% of all random samples will yield intervals that capture
the true parameter value
-
- One-proportion z-interval – a confidence interval for the true value of a proportion. The
confidence interval is
^p ± z*SE(^p)
Where z* is a critical value from the Standard Normal model corresponding to the specified
confidence level
-
- Margin of error – in a confidence interval the extent of the interval on either side of the
observed statistic value is called the margin of error. A margin of error is typically the product
of a critical value from the sampling distribution and a standard error from the data. A small
margin of error corresponds to a confidence interval that pins down the parameter precisely.
A large margin of error corresponds to a confidence interval that gives relatively little
information about the estimated parameter. For a proportion ME = z* ^ ^ /
-
- Critical value – the number of standard errors to move away from the mean of the sampling
distribution to correspond to the specified level of confidence. The critical value, denoted z*,
is usually found from a table or with technology
-
- When the P-value is low enough, it says that it’s very unlikely we’d observe data like these if
our null hypothesis were true
-
- We fail to reject the null hypothesis
What to Do with an ‘Innocent’ Defendant
-
- Insufficient evidence to convict the defendant, the jury does not decide that H0 is true and
declare the defendant innocent – juries can only fail to reject the null hypothesis and declare
the defendant ‘not guilty’
-
- And we never declare the null hypothesis to be true because we simply do not know whether
it’s true or not
The Reasoning of Hypothesis Testing
1. Hypotheses
-
- To assess how unlikely our data may be, we need a null model
-
- The null hypothesis specifies a particular parameter value to use in our model. In the usual
shorthand, we write H0: parameter = hypothesized value. The alternative hypothesis, HA,
contains the values of the parameter we consider plausible when we reject the null
2. Model
-
- Specify the model you will use to test the null hypothesis and the parameter of interest
-
- State assumptions and check any corresponding conditions
-
- “Because the conditions are satisfied, I can model the sampling distribution of the proportion
with a Normal model.”
-
- “Because the conditions are not satisfied, I can’t proceed with the test.”
-
- The test about proportions is called a one-proportion z-test
o WetestthehypothesisH:p=p usingthestatisticz= ^
0 0 ^
o Weusethehypothesizedproportiontofindthestandarddeviation,SD(^p)=
3. Mechanics
-
- Actual calculation
-
- Obtain a P-value – the probability that the observed statistic value occurs if the null model is
correct
4. Conclusion
-
- Statement about the null hypothesis – either reject or that we fail to reject
-
- The size of the effect is always a concern when we test hypotheses – a good way to look at
the effect size is to examine a confidence interval
Alternative Alternatives
-
- Old cracking rate: 20%
-
- H0:p=0.20
-
- Someone might be interested in any change in the cracking rate -> HA: p ≠ 0.20
-
- An alternative hypothesis such as this is known as a two-sided alternative because we are
equally interested in deviations on either side of the null hypothesis value. For two-sided
alternatives, the P-value is the probability of deviating in either direction from the null
hypothesis value
- But only interested in lowering the cracking rate below 20% -> HA: p < 0.20
-
- An alternative hypothesis that focuses on deviations from the null hypothesis value in only
one direction is called a one-sided alternative
-
- For a hypothesis test with a one-sided alternative, the P-value is the probability of deviating
only in the direction of the alternative away from the null hypothesis value
P-Values and Decisions: What to Tell About a Hypothesis Test
-
- How small should the P-value be in order for you to reject the null hypothesis? -> highly
context-dependent
-
- Examples page 487
-
- The conclusion about any null hypothesis should be accompanied by the P-value of the test
-
- To complete the analysis, follow your test with a confidence interval for the parameter of
interest, to report the size of the effect
Terms
-
- Null hypothesis – the claim being assessed in a hypothesis test is called the null hypothesis.
Usually, the null hypothesis is a statement of “no change from the traditional value”, “no
effect“, “no different” or “no relationship” For a claim to be a testable null hypothesis, it
must specify a value for some population parameter that can form the basis for assuming a
sampling distribution for a test statistic
-
- Alternative hypothesis – the alternative hypothesis proposes what we should conclude if we
find the null hypothesis to be unlikely
-
- P-value – the probability of observing a value for a test statistic at least as far from the
hypothesized value as the statistic value actually observed if the null hypothesis is true. A
small P-value indicates either that the observation is improbable or that the probability
calculation was based on incorrect assumptions. The assumed truth of the null hypothesis is
the assumption under suspicion
- One-proportion z-test – a test of the null hypothesis that the proportion of a single sample
equals a specified value (H : p = p ) by referring the statistic z = ^ to a Standard Normal
model
-
- Effect size – the difference between the null hypothesis value and the true value of a model
parameter
-
- Two-sided alternative – an alternative hypothesis is two-sided (HA: p ≠ p0) when we are
interested in deviations in either direction away from the hypothesized parameter value
-
- One-sided alternative – an alternative hypothesis is one-sided (e.g. HA: p > p0 or HA: p < p0)
when we are interested in deviations in only one direction away from the hypothesized
parameter value
Chapter 21 More About Tests and Intervals
-
- Florida: no longer are riders 21 and older required to wear helmets
-
- Police reports of motorcycle accidents: Before the change in the helmet law, 60% of youths
involved in a motorcycle accident had been wearing their helmets; three years following the
law change, considering these riders to be a representative sample of the larger population –
they observed 781 young riders who were involved in accidents – of these, 50.7% (396) were
wearing helmets
Zero In on the Null
-
- One good way to identify both the null and alternative hypotheses is to think about the Why
of the situation
-
- The null hypotheses for the Florida study could be that the true rate of helmet use remained
the same among young riders after the law changed
-
- It makes more sense to use what you want to show as the alternative
How to Think About P-Values
-
- A P-value actually is a conditional probability. It tells us the probability of getting results at
least as unusual as the observed statistic, given that the null hypothesis is true
-
- The P-value is not the probability that the null hypothesis is true – it is a probability about the
data
-
- All we can say is that, given the null hypothesis, there is a 3% chance (P-value of 0.03) of
observing the statistic value that we have actually observed
What to do with a High P-value
-
- 0.793 ?
-
- Big P-values just mean that what we’ve observed isn’t surprising
-
- A big P-value doesn’t prove that the null hypothesis is true, but it certainly offers no evidence
that it’s not true
-
- When we see a large P-value, all we can say is that we ‘don’t reject the null hypothesis’
Alpha Levels
-
- Sometimes we have to decide whether or not to reject the null hypothesis
-
- We can define ‘rare event’ arbitrarily by setting a threshold for our P-value. If our P-value
falls below that point, we’ll reject the null hypothesis. We call such results statistically
significant. The threshold is called an alpha level
-
- Common alpha levels are 0.1, 0.05, 0.01 and 0.001
-
- E.g. assessing safety of air bags -> low alpha level
-
- E.g. if folks prefer their pizza with or without pepperoni -> alpha = 0.1
-
- We often choose 0.05
-
- Assess alpha level before you look at the data
-
- The alpha level is also called the significance level – when we reject the null hypothesis, we
say that the test is ‘significant at that level’
-
- E.g. we might say that we reject the null hypothesis ‘at the 5% level of significance’
-
- If the P-value does not fall below alpha -> the data have failed to provide sufficient evidence
to reject the null hypothesis.
-
- If the P-value is too high -> “we fail to reject the null hypothesis” (-> there is insufficient
evidence to conclude that the practitioners are performing better than they would if they
were just guessing)
Significant vs. Important
-
- Statistically significant -> P-value lower than our alpha level
-
- Don’t be lulled into thinking that statistical significance carries with it any sense of practical
importance or impact
Confidence Intervals and Hypothesis Tests
-
- For the motorcycle helmet example, a 95% confidence interval would give 0.507 ± 1.96 *
0.0179 = (0.472, 0.542) or 47.2& to 54.2% -> previous rate would be 50% -> in the interval ->
not able to reject the null hypothesis
-
- In general, a confidence interval with a confidence level of C% corresponds to a two-sided
hypothesis test with an alpha level of 100-C% (e.g. 95% confidence interval -> two sided
hypothesis test at alpha 5%
-
- For a one-sided test with alpha 5%, the corresponding confidence interval has a confidence
level of 90% - that’s 5% in each tail in general, a confidence interval with a confidence
level of C% corresponds to a one-sided hypothesis test with an alpha level of 1⁄2(100-C)%
A Confidence Interval for Small Samples
- When the Success/failure Condition fails, all is not lost – a simple adjustment to the
calculation lets us make a 95% confidence interval anyway
-
- Add four phony observations – two to the successes, two to the failures
-
- Adjusted proportion: = and, for convenience, we write = n + 4
- Adjusted interval: ± z* 1 − /
- Called the Agresti-Coull interval or the ‘plus-four’ interval
-
- The null hypothesis is true, but we mistakenly reject it (Type I error) – e.g. a healthy person is
diagnosed as with disease (the null hypothesis is usually the assumption that a person is
healthy)
-
- The null hypothesis is false, but we fail to reject it (Type II error) – e.g. an infected person is
diagnosed as disease free
-
- Which of these errors is more serious, depends on the situation, the cost, and your point of
view
-
- Page 512
-
- When you choose level alpha, you’re setting the probability of a Type I error to alpha
-
- We assign the letter ß to the probability of this mistake
-
- We could reduce 1 for all alternative parameter values by increasing alpha – but we’d make
more Type I errors -> tension between Type I and Type II errors
-
- The only way to reduce both types of error is to collect more evidence or, in statistical terms,
to collect more data
Power
-
- The power of a test is the probability that it correctly rejects a false null
-
- When the power is high, we can be confident that we’ve looked hard enough
-
- We know that ß is the probability that a test fails to reject a false null hypothesis, so the
power of the test is the probability that it does reject: 1-ß
The variance of the sum or difference of two independent random variables is the sum of
their variances
Variance (X – Y) = Var(X)+Var(Y), so
SD (X – Y) = + = +
- Only applies when X and Y are independent
The Standard Deviation of the Difference between Two Proportions
- The standard deviations of the sample proportions are SD (^p ) = and SD (^p ) =
, so the variance of the difference in the proportions is
Var(^p -^p ) = ( 2 + ( )2 = +
1 2
- The standard deviation is the square root of that variance
SD(^p -^p )= +
1 2
SE(^p -^p )= ^ ^ + ^ ^
12
- Example page 527! 2 !!!
Assumptions and Conditions
-
- Independence Assumption: within each group, the data should be based on results for
independent individuals
Randomization condition
10% condition
-
- Independent Groups Assumption: the two groups we’re comparing must also be
independent of each other
Sample Size Assumption
-
- Success/failure condition: both groups are big enough that at least 10 successes and at least
10 failures have been observed in each
The Sampling Distribution
-
-
-
-
-
-
Will I
-
A two-proportion z-interval:
Confidence interval: (^p1-^p2) ± z* x SE(^p1-^p2)
Where we find the standard error of the difference
SE(^p -^p )= ^ ^ + ^ ^
12
The critical value z* depends on the particular confidence level, C, that we specify
Example page 529!
Snore When I’m 64?
Of the 995 respondents, 37% of adults reported that they snored at least a few nights a week
during the past year
-
- Split into two age categories, 26% of the 184 people under 30 snored, compared with 39% of
the 811 in the older group
-
- Is this difference of 13% real or due only to natural fluctuations in the sample we’ve chosen?
-
- Null hypothesis? -> we hypothesize that there is no difference in the proportions
H0: p1-p2 = 0
Everyone into the Pool
-
- SE(^p -^p )= ^ ^ + ^ ^
12
-
- But to do a hypothesis test, we assume that the null hypothesis is true (proportions are
equal) -> so there should be just a single value of ^p in the SE formula (and, of course, ^q is
just 1-^p)
-
- Snoring example: overall we saw 48+318 = 366 snores out of a total of 184+811 = 995 adults
who responded to this question -> 0.3678
-
- Combining the counts like this to get an overall proportion is called pooling
-
- Pooled proportion (for success): ^p = where Success1 is the number of
pooled
successes in group 1 (Success1=n1*p1)
- We then put this pooled value into the formula, substituting it for both sample proportions in
the standard error formula:
SE (^p -^p ) = ^ ^ + ^ ^
pooled 1 2
= . ∗ . + . ∗ . =0.039
Improving the Success/Failure Condition
-
- We should not refuse to test the effectiveness just because it failed the success/failure
condition
-
- For that reason, in a two-proportion z-test, the proper success/failure test uses the expected
frequencies, which we can find from the pooled proportion
-
- Only 1 case of HPV was diagnosed among 7897 women who received the vaccine, compared
to 91 cases diagnosed among 7899 who received a placebo
-
- ^p = =0.0058
pooled
n1^ppooled = 7899(0.0058) = 46
n2^ppooled = 7897(0.0058) = 46
Compared to What?
-
- We’ll reject our null hypothesis if we see a large enough difference in the two proportions
-
- Large? We just compare it to its standard deviation (standard error, pooled)
-
- Since the sampling distribution is Normal, we can divide the observed difference by its
standard error to get a z-score -> tells us how many SE the observed difference is away from
0
-
- Then we can use the 68-95-99.7 Rule
-
- Result: two proportion z-test
-
- z = ^ ^
- When the conditions are met and the null hypothesis is true, this statistic follows the
standard Normal mode, so we can use that model to obtain a P-value
Chapter 23 Inferences About Means
-
- Motor vehicle crashes resulted in 119 deaths each day
-
- Speeding is a contributing factor in 31% of all fatal accidents
-
- Triphammer Road – exceeding 30 miles per hour?
-
- Interested in both in estimating the true mean speed and in testing whether it exceeds the
posted speed limit
-
- Quantitative data usually report a value for each individual three rules of data analysis
and plot the data
-
- Quantitative data means and standard deviations; inferences sampling distributions
-
- Confidence intervals, then we add and subtract a margin of error; for proportions: ^p ± ME
-
- Margin of error –> ^p ± z*SE(^p)
-
- CLT: SD( = σ/√n (example page 552)
-
- If we don’t know σ estimate the population parameter σ with s, the sample standard
deviation based on the data; the resulting standard error is SE( = s/√n
-
- Gosset: we need not only to allow for the extra variation with larger margins of error and P-
values, but we even need a new sampling distribution model; in fact we need a whole family
of models, depending on the sample size, n; these models are unimodal, symmetric, bell-
shaped models, but the smaller our sample, the more we must stretch out the tails
Gosset’s t
-
- With s/√n, an estimate of the standard deviation, the shape of the sampling model changes
t-distribution Student’s t
-
- Gosset’s model is always bell-shaped, but the details change with different sample sizes
-
- So the Student’s t-models form a whole family of related distributions that depend on a
parameter known as degrees of freedom (df tdf)
A Confidence Interval for Means
- To make confidence intervals or test hypothesis for means df = n-1
A Practical Sampling Distribution Model for Means
-
- t=
-
- df=n-1
-
- SE( )= s/√n
One-Sample t-Interval for the mean
-
- ±t*n-1*SE( )
-
- Critical value depends on the particular confidence level, C, and the number of degrees of
freedom, n-1
-
- Example page 554
-
- Table T: the tables run down the page for as many degrees of freedom as can fit; as the
degrees of freedom increase, the t-model gets closer and closer to the Normal, so the tables
give a final row with the critical value from the Normal model and label it ∞ df
-
- If you cannot find a row for the df you need, just use the next smaller df in the table
Significance and Importance
-
- Statistically significant does not mean actually important or meaningful
-
- It is always a good idea when we test a hypothesis to also check the confidence interval and
think about the likely values for the mean
Intervals and Tests
-
- The confidence interval contains all the null hypothesis values we can’t reject with these data
-
- More precisely, a level C confidence interval contains all of the plausible null hypothesis
values that would not be rejected by a two-sided hypothesis test at alpha level 1-C; so a 95%
confidence interval matches a 1-95=0.05 level two-sided test for these data
-
- Confidence intervals are naturally two-sided, so they match exactly with two-sided
Sample
-
-
-
hypothesis tests; when, the hypothesis is one-sided, the corresponding alpha level is (1-C)/2
Size
If we need great precision, however, we’ll want a smaller ME larger sample size
We can solve this equation for n (ME=T*n-1 s/√n)
Without knowing n, we don’t know the degrees of freedom and we can’t find the critical
value, t*n-1 use the corresponding z* value
*The Sign Test – Back to Yes and No
-
- Yes (1) an no (0)
-
- Null hypothesis says that the median is 30; if that null hypothesis were true, we’d expect the
proportion of cars driving faster than 30 mph to be 0.50; on the other hand, if the median
speed were greater than 30 mph, we’d expect to see more cars driving faster than 30
-
- If we test a median by counting the number of values above and below that value, it’s called
a sign test – the sign test is a distribution free method (example page 567)
-
- Simpler, fewer assumptions
-
- But only works even when the data have outliers or a skewed distribution
Comparing Means
-
- Generic or brand-name batteries?
-
- Difference in mean lifetimes?
Plot the Data
-
- Boxplots of the data for two groups, placed side by side
-
- Figure 24.1 -> difference large enough? Random fluctuation? statistical inference
Comparing Two Means
-
- Difference between the mean battery lifetimes of the two brands, μ1-μ2
-
- Confidence interval, standard deviation, sampling model
-
- For independent random variables, the variance of their difference is the sum of their
individual variances, Var (Y-X)=Var(Y)+Var(X)
- The confidence interval we build is called a two-sample t-interval (for the difference in
means). The corresponding hypothesis test is called a two-sample t-test.
Assumptions and Conditions
-
- Independence Assumption – the data in each group must be drawn independently and at
random from a homogeneous population, or generated by a randomized comparative
experiment
o RandomizationCondition
o 10%Condition
-
- Normal Population Assumption
o NearlyNormalCondition–wemustcheckthisforbothgroups;aviolationbyeither
one violates the condition
-
n<15 – you should not use these methods if the histogram or Normal
probability plot shows severe skewness
-
n’s closer to 40 – mildly skewed histogram is OK
-
n>40 CLT
-
- Independent Groups Assumption – to use the two-sample t methods, the two groups we are
comparing must be independent of each other
Two-Sample t-interval for the difference between means
- Confidence interval: ( 1 − 2) ± t*df x SE ( 1 − 2)
- Standard error of the difference of the means SE( 1 − 2) = +
A Test for the Difference between Two Means
-
- Two-sample t-test for the difference between means
-
- Hypothesized difference Δ0 = 0
-
- We then compare the difference in the means with the standard error of that difference
-
- Example page 588/589
Back into the Pool
- For means, there is also a pooled t-test (but knowing that two means are equal doesn’t say
anything about whether their variances are equal)
-
- Regression -> straight line; but: not all men who have 38-inch waists have the same %Body
Fat (the distribution of 38-inch men is unimodal and symmetric -> fig. 27.2/27.3)
-
- We want a model -> therefore, an idealized regression line – the model assumes that the
means of the distribution of %Body Fat for each Waist size fall along the line, even though
the individuals are scattered around it
-
- μy = β0 + β 1x (model = intercept + slope)
-
- Model makes errors (ε) – some individuals lie above and some below the line
-
- y=β0+β1x+ε
-
- We estimate the β’s by finding a regression line, = b0 + b 1x; the residuals, e = y- , are the
sample-based versions of the errors, ε
Assumptions and Conditions
- Linearity Assumption
o StraightEnoughCondition–scatterplotlooksstraight(bylookingatascatterplotof
the residuals against x or against the predicted values,
o QuantitativeDataCondition
-
- Independence Assumption – the errors in the true underlying regression model must be
mutually independent
o RandomizationCondition
-
- Equal Variance Assumption – the variability of y should be about the same for all values of x
o DoesthePlotThicken?Condition–checkthespreadaroundthelineisnearly
constant
-
- Normal Population Assumption – the errors around the idealized regression line at each
values of x follow a Normal model
o NearlyNormalCondition–ateachvalueofxthereisadistributionofy-valuesthat
follows a Normal mode, and each of these Normal models is centered on the line and
has the same standard deviation
o OutlierCondition
Which Comes First: The Conditions or the Residuals?
-
Make a scatterplot of the data to check the Straight Enough Condition.
-
If the data are straight enough, fit a regression and find the residuals, e, and predicted
values, .
3. Make a scatterplot of the residuals against x or against the predicted values. This plot
should have no pattern. Check in particular for any bend (which would suggest that the
data weren’t all that straight after all), for any thickening (or thinning), and, of course, for
any outliers.
-
If the data are measured over time, plot the residuals against time to check for evidence
of patterns that might suggest they are not independent.
-
If the scatterplots look OK, then make a histogram and Normal probability plot of the
residuals to check the Nearly Normal Condition.
-
If all the conditions seem to be reasonably satisfied, go ahead with inference.
Intuition About Regression Inference
-
- The sample-to-sample variation is what generates the sampling distribution for the
coefficients
-
- 3 aspects of the scatterplot affect the standard error of the regression slope:
o Spreadaroundtheline–lessscatteraroundthelinemeanstheslopewillbemore
consistent from sample to sample. The spread around the line is measured with the
residual standard deviation, se. You can always find se in the regression output, often
just labeled s.
s = ∑ ^ )
e
The less scatter around the line, the smaller the residual standard deviation and the
stronger the relationship between x and y
o Spreadofthex’s–ifsx,thestandarddeviationofxislarge,itprovidesamorestable
regression
o Samplesize–havingalargersamplesize,n,givesmoreconsistentestimatesfrom
sample to sample
Standard Error for the Slope
- SE(b)=
1 √
- When we standardize the slopes by subtracting the model mean and dividing by their
standard error, we get a Student’s t-model, this time with n-2 degrees of freedom
- ß ~
) n-2
What About the Intercept?
- ß ~
) n-2
Regression Inference
-
- We can test a hypothesis about it and make confidence intervals
-
- Usual null hypothesis about the slope is that it’s equal to 0 (would say that y doesn’t tend to
change linearly when x changes = no linear association)
- TotestH:ß=0,wefindt =
0 1 n-2 )
- A 95% confidence interval for ß is: b1±t*n-2 x SE(b1)
Another Example
-
- Contest in which participants try to guess the exact minute that a wooden tripod placed on
the frozen Tanana River will fall through the breaking ice
-
- We cannot use regression to tell the causes of any change – but we can estimate the rate of
change (if any) and use it to make better predictions
-
- Example page 686-689
Standard Errors for Predicted Values
- A confidence interval can tell us how precise that prediction will be
- We can predict the mean %Body Fat for all men whose Waist size is 28 inches with a lot more
precision than we can predict the %Body Fat of a particular individual whose Waist size
happens to be 38 inches
- We are predicting the value for a new individual, one that was not part of the original data
set -> “x sub new” (xv)
-
- Regression equation predicts %Body Fat as v=b0+b1xv
-
- Now that we have the predicted value, we construct both intervals around this same
number; both intervals take the form: v± t*n-2 x SE (t* is the same for both)
- Easier to predict a data point near the middle of the data set than far from the center
- SE( )= 2 1) − )2+ 2 + 2
v
Confidence Intervals for Predicted Values
-
- Example all men and individual page 690
-
- The narrower interval is a confidence interval for the predicted mean value at xv, and the
Logistic
wider interval is a prediction interval for an individual with that x-value
Regression
-
- Researchers investigating factors for increased risk f diabetes examined data on 768 adult
women of Pima Indian heritage (BMI (weight/height))
-
- From the boxplots, we see that the group with diabetes has a higher mean BMI
-
- BMI as the response and Diabetes as the predictor displayed – but researches interested in
predicting the increased risk of Diabetes due to increased BMI
-
- Fig. 27.13 dichotomous variable
-
- Fig. 27.14 treating like quantitative data -> regression line
-
- Setting all negative probabilities to 0 and all probabilities greater than 1 to 1
-
- Fig. 27.16 smooth curve models
-
- There are many curved in mathematics with shapes like this that we might use for our model.
One of the most common is the logistic curve -> logistic regression
-
- ln ( ̂ /1 − ̂ )= b0+b1x
-
- When p is a probability, p/1-p is the odds in favor of a success
When the probability of success, p, = 1/3, we’d get the ratio / =1/2
/
Chapter 30 Multiple Regression
- Height
Just do it
-
- A regression with two or more predictor variables is called a multiple regression
-
- For simple regression, we found the Least Squares solution, the one whose coefficients made
the sum of the squared residuals as small as possible. For multiple regression, we’ll do the
same thing but this time with more coefficients
-
- R2 gives the fraction of the variability of %Body Fat accounted for by the multiple regression
- Summary of checking conditions
o ChecktheStraightEnoughConditionwithscatterplotsofthey-variableagainsteach
x-variable
o Ifthescatterplotsarestraightenough,fitamultipleregressionmodeltothedata
o Findtheresidualsandpredictedvalues
o Makeascatterplotoftheresidualsagainstthepredictedvalues.Thisplotshouldlook
patternless. Check in particular for any bend and for any thickening
o Suitablerandomizationused?Representativeofsomeidentifiablepopulation?
Checking if they are not independent by plotting the residuals against time to look
for patterns
o Interpretationandprediction
o Ifyouwishtotesthypothesesaboutthecoefficientsorabouttheoverallregression,
then make a histogram and Normal probability plot of the residuals to check the
Nearly Normal Condition
Multiple Regression Inference I: I Thought I Saw an ANOVA Table...
-
- Is this multiple regression model any good at all?
-
- If all the coefficients (except the intercept) were zero, we’d have
=b0+0x1 +...+0xk
And we’d just set b0 =
H0:β1 =β2 =...=βk =0
- We can test this hypothesis with a statistic that is labeled with the letter F – bigger F-values
mean smaller P-values
Multiple Regression Inference II: Testing the Coefficients
-
- Only if we reject the null hypothesis, we can move on to check the test statistics for the
individual coefficients
-
- For each coefficient, we test H0: β1=0 against the (two-sided) alternative that it isn’t zero; the
regression table gives a standard error for each coefficient and the ratio of the estimated
coefficient to its standard error
-
- If the assumptions and conditions are met, these ratios follow a Student’s t-distribution
tn-k-1 = bj-0 / SE (bj)
-
- The degrees of freedom is the number of data values minus the number of predictors
-
- CI in the usual way (estimate ± margin of error); margin of error is just the product of the
standard error and a critical value
CI for βj: bj ± t*n-k-1 SE (bj)
How’s That, Again?
-
- y=β0+β1x1+...+βkxk +ε
-
- Wrong conclusion that each βj tells us the effect of its associated predictor, xj, on the
response variable, y
Another Example: Modeling Infant Mortality
- Variables available: child deaths, percent f teens who drop out of high school, percent of low-
birth-weight babies, teen births, and teen deaths by accident, homicide, and suicide