Course: Data Analysis, book: Stats: Data and Models (Richard D. De Veaux)
third edition
Chapter 2 Data
But What Are Data?
Chapter 2 Data
But What Are Data?
-
- Data does not have to be numerical
-
- Sometimes values look like numerical values but are just numerals servings as labels (e.g.
Amazon Standard Item Number)
-
- Data values are useless without their context
-
- The W’s: WHO WHAT (essential) (AND IN WHAT UNITS) WHEN WHERE WHY HOW -> context
for data values Who
-
- The rows of a data table correspond to individual cases about Whom (or which) we record
some characteristics
-
- Respondents – individuals who answer a survey
-
- Subjects/participants – people on whom we experiment
-
- Experimental units - like subjects but animals, plants, websites and other inanimate subjects
-
- Records – rows in a database
-
- Cases, e.g. Amazon table. Individual CD orders
-
- Cases are often a sample of cases selected from some larger population that we’d like to
understand
-
- Sample should be representative of the population (snapshot image)
What and Why
-
- Variables – characteristics recorded about each individual (usually columns)
-
- Variables play different roles, and you can’t tell a variable’s role just by looking at it
-
- Start by counting how much cases belong in each category
-
- Some variables have measurements units; units tell how each value has been measured
(miles per hour, or degrees Celsius tell us the scale of measurement)
-
- Categorical variable – when a variable names categories and answers questions about how
cases fall into those categories (usually, we think about the counts of cases that fall into each
category; except the identifier variable)
-
- Quantitative variable – when a measured variable with units answers questions about the
quantity of what is measured (they must have units)
-
- Some variables can answer both kinds of questions
-
- E.g. educational value (1=worthless, ...) -> a teacher might just count the number of students
who gave each response for the course (categorical variable) or the teacher wants to see whether the course is improving she might treat the responses as the amount of perceived value (quantitative variable); then the teacher has to imagine that it has ‘educational value units’
- - ‘ordinal’ variables – variables that report order without natural units Counts Count
- Using counts in two ways: when we count the cases in each category of a categorical
variable, the category labels are the What and the individuals are the Who of our data; or
when we want to measure the amount of something (by counting)
Identifying Identifiers
- E.g. student ID number -> numerical, but no quantitative variable -> special categorical variable (as many categories as individuals) -> not interesting, just for identification -> Identifier variables -> not useful, but they make it possible to combine data from different sources, to protect confidentiality and to provide unique labels (e.g. ASIN)
Where, When, and How
histogram is skewed or has outliers, we’re usually better off with the median. What About Spread? The Standard Deviation
The Big Picture
(such as the mean, median and percentiles) and measures of spread (such as the range, the IQR, and the standard deviation) are multiplied (or divided) by that same constant
Back to z-Scores
Chapter 8 Linear Regression
When we have quantitative data, we calculate a sample mean; ; the sampling distribution
- The interval calculated and interpreted here is sometimes called a one-proportion z-interval What Does “95% Confidence” Really Mean?
- Formally, what we mean is that “95% of samples of this size will produce confidence intervals that capture the true proportion.” This is correct, but a little long winded, so we sometimes say, “we are 95% confident that the true proportion lies in our interval.” Our uncertainty is about whether the particular sample we have at hand is one of the successful ones or one of the 5% that fail to produce an interval that captures the true value.
Margin of Error: Certainty vs. Precision
Independence Assumption
- 0.03√n = 1.96 √0.5 ∗ 0.5 ≈ 32.67
o Weusethehypothesizedproportiontofindthestandarddeviation,SD(^p)=
3. Mechanics
- Adjusted interval: ± z* 1 − /
- See lecture slides!
Chapter 25 Paired Samples and Blocks
o OutlierCondition
Which Comes First: The Conditions or the Residuals?
consistent from sample to sample. The spread around the line is measured with the residual standard deviation, se. You can always find se in the regression output, often just labeled s.
s = ∑ ^ )
e
The less scatter around the line, the smaller the residual standard deviation and the stronger the relationship between x and y
o Spreadofthex’s–ifsx,thestandarddeviationofxislarge,itprovidesamorestable regression
o Samplesize–havingalargersamplesize,n,givesmoreconsistentestimatesfrom sample to sample
Assumptions and Conditions
Identifying Identifiers
- E.g. student ID number -> numerical, but no quantitative variable -> special categorical variable (as many categories as individuals) -> not interesting, just for identification -> Identifier variables -> not useful, but they make it possible to combine data from different sources, to protect confidentiality and to provide unique labels (e.g. ASIN)
Where, When, and How
-
- Who (whom each row of your data table refers to -> cases) , What (what the variables or the
columns of the table record -> variables) and Why (why you are examining the data/what
you want to know) essential
-
- Where and When also important/helpful
-
- How the data are collected can make the difference between insight and nonsense (e.g.
voluntary survey on the Internet often worthless)
-
- Important is the design of sound methods for collecting data
Terms
-
- Context – the context ideally tells Who was measured, What was measured, How the data
were collected, Where the data were collected, and When and Why the study was
performed
-
- Data – systematically recorded information, whether numbers or labels, together with is
context
-
- Data table – an arrangement of data in which each row represents a case and each column
represents a variable
-
- Case – a case is an individual about whom or which we have data
-
- Sample – the cases we actually examine in seeking to understand the much longer
population
-
- Population – all the cases we wish we knew about
-
- Variable – a variable holds information about the same characteristic for many cases
-
- Units – a quantity or amount adopted as a standard of measurement, such as dollars, hours,
or grams
-
- Categorical variable – a variable that names categories (whether with words or numerals) is
called categorical
-
- Quantitative variable – a variable in which the numbers act as numerical values is called
quantitative; quantitative variables always have units
-
- Identifier variable – a variable holding a unique name, ID number, or other identification for
a case. Identifiers are particularly useful in matching data from two different databases or relations
-
Chapter 3 Displaying and Describing Categorical Data
The Three Rules of Data Analysis 95
-
- Make a picture – think clearly about patterns and relationships hiding in the data
-
- Make a picture – display shows important features: the extraordinary (possibility wrong) data
values or unexpected patterns
-
- Make a picture – well-chosen picture to tell others about your data
Frequency Tables: Making Piles
-
- Putting 2201 people on the Titanic into piles -> by ticket Class, counting up how many had
each kind of ticket -> frequency table, which records the totals and the category names
-
- Ticket Class: ‘First’, ‘Second’, ‘Third’ and ‘Crew’
-
- Putting 2201 people on the Titanic into piles -> by ticket Class, counting up how many had
each kind of ticket -> frequency table, which records the totals and the category names
-
- Make a picture – think clearly about patterns and relationships hiding in the data
-
-
- Counts are useful, but sometimes we want to know the fraction or proportion of the data in
each category, so we divide the counts by the total number of cases; usually we multiply by
100 express the proportions as percentages; a relative frequency table displays the
percentages, rather than the counts, of the values in each category; both types of tables
show how the cases are distributed across the categories; in this way they describe the
distribution of a categorical variable because they name the possible categories and tell how
frequently each occurs
The Area Principle
-
- Figure 3.2 -> bad picture can distort out understanding
-
- More impressed by the area than by other aspects of each ship image
-
- Wrong impressions -> crew only about 40%
-
- The best data displays observe a fundamental principle of graphing data called the area
principle – the area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents
-
- A bar chart displays the distribution of a categorical variable, showing the counts for each
category next to each other for easy comparison
-
- Relative frequency bar chart -> shows the relative proportion of passengers falling into each
of these classes
Pie Charts
-
- Pie charts show the whole group of cases as a circle; they slice the circle into pieces whose
sizes are proportional to the fraction of the whole in each category
-
- Before you make a bar chart or a pie chart, always check the Categorical Data Condition: The
data are counts or percentages of individuals in categories
Class
Count
%
First
325
14.77
Second
285
12.95
Third
706
32.08
Crew
885
40.21
- If you want to make a relative frequency bar chart or a pie chart, you’ll need to also make sure that the categories don’t overlap so that no individual is counted twice
Contingency Tables: Children and First-Class Ticket Holders First?
-
- Was there a relationship between the kind of ticket a passenger held and the passenger’s
chances of making it into the lifeboat? -> two-way table
-
- Table 3.4 – because the table shows how the individuals are distributed along each variable,
contingent on the value of the other variable, such a table is called a contingency table
-
- When presented like this, in the margins of a contingency table, the frequency distribution of
one of the variables (survival/class) is called its marginal distribution (example page 24)
-
- Each cell of the table gives the count for a combination of values of the two variables
-
- Possibility of percentage of row, of column or of table (e.g. table 3.6)
-
- Be careful – always ask “percentage of what?”
Conditional Distributions
-
- Interesting questions are contingent, e.g. whether the chance of surviving the Titanic sinking
depended on ticket class
-
- First, ask how the distribution of ticket Class changes between survivors and non-survivors ->
row percentages
-
- We restrict the Who first to survivors and make a pie chart for them; then we refocus the
Who on the non-survivors and make their pie chart -> pie charts show the distribution of
ticket classes for each row (survivors and non-survivors) – the distribution we create this way
are called conditional distributions because they show the distribution of one variable for
just those cases that satisfy a condition on another variable (figure 3.6)
-
- Or we could look at the distribution of Survival for each category of ticket Class (table 3.8)
-
- Fig. 3.7: Side-by-side bar chart – showing the conditional distribution of Survival for each
ticket class
Can be simplified by dropping one category (only two variables, dead or alive; knowing the percentage that survived tells us the percentage that died)
-
- In a contingency table, when the distribution of one variable is the same for all categories of
another, we say that the variables are independent
Segmented Bar Charts
-
- Interesting questions are contingent, e.g. whether the chance of surviving the Titanic sinking
depended on ticket class
Simpson’s Paradox
-
- Sometimes averages can be misleading or don’t make sense at all
-
- When using averages of proportions across several different groups, it is important to make
sure that the groups really are comparable
-
- Table 3.10 – Moe is better overall, but Jill is better both during the day and at night ->
Simpson’s Paradox
-
- The problem is unfair averaging over different groups -> Jill has more difficult night flights, so
her overall average is heavily influenced by her nighttime average / Moe benefits from more
and easier day flights no fair comparison
-
- The moral of Simpson’s paradox is to be careful when you average across different levels of a
second variable; it’s always better to compare percentages or other averages within each
level of the other variable; overall average is misleading
Terms
-
- Frequency table (relative frequency table) – lists the categories in a categorical variables and
gives the count (or percentage) of observations for each category
-
- Distribution – the distribution of a variable gives the possible values of the variable and the
relative frequency of each value
-
- Area principle – in a statistical display, each data value should be represented by the same
amount of area
-
- Bar chart (relative frequency bar chart) – show a bar whose area represents the count (or
percentage) of observations each category of a categorical variable
-
- Pie chart – show how a ‘whole’ divides into categories by showing a wedge of a circle whose
area corresponds to the proportion in each category
-
- Categorical data condition – the methods in this chapter are appropriate for displaying and
describing categorical data; be careful not to use them with quantitative data
-
- Contingency table – a contingency table displays counts, and sometimes, percentages of
individuals falling into named categories on two or more variables; the table categorizes the
individuals on all variables at once to reveal possible patterns in one variable that may be
contingent on the category of the other
-
- Marginal distribution – in a contingency table, the distribution of either variable alone is
called the marginal distribution; the counts or percentages are the totals found in the
margins (last row or column) of the table
-
- Conditional distribution – the distribution of a variable restricting the Who to consider only a
smaller group of individuals is called a conditional distribution
-
- Independence – variables are said to be independent if the conditional distribution of one
variable is the same for each category of the other (but we can’t conclude that one variable
has not effect whatsoever on another; all we know is that little effect was observed in our
study)
-
- Segmented bar chart – a segmented bar chart displays the conditional distribution of a
categorical variable within each category of another variable
-
- Simpson’s paradox – when averages are taken across different groups, they can appear to
contradict the overall averages
Chapter 4 Displaying and Summarizing Quantitative Data
Histograms
-
- Usually we slice up all the possible values into equal-width bins; we then count the number
of cases that fall into each bin; the bins, together with these counts, give the distribution of
the quantitative variable and provide the building blocks for the histogram; by representing
the counts as bars and plotting them against the bin values, the histogram displays the
distribution at a glance
-
- Fig. 4.1 – e.g. 230 earthquakes with magnitudes between 7.0 and 7.2 (each bin has a width of
0.2)
-
- The standard rule for a value that falls exactly on a bin boundary is to put it into the next
higher bin
-
- Most earthquakes are between 5.5 and 8.5 -> earthquake of 9 is extraordinary
-
- The binds slice up all the values of the quantitative variable, so any spaces in a histogram are
actual gaps in the data, indicating a region where there are no values
-
- Relative frequency histogram – replacing the counts on the vertical axis with the percentage
of the total number of cases falling in each bin Stem-and-Leaf Displays
-
- Don’t show the data values themselves
-
- Like a histogram, but it shows the individual values
-
- To display the scores 83, 76 and 88 together, we could write
8|38
7|6
-
- Because the leaves show the individual values, we can sometimes see even more in the data
than the distribution’s shape
-
- If you have scores of 432, 540, 571 and 638 -> truncate (or round) the number to two places,
using the first digit as them stem and the second as the leaf (indicating that 6 | 3 means 630- 639)
6|3
5|47
4|3 Dotplots
-
- Simple display that places a dot along an axis for each case in the data
-
- Show basic facts about the distribution
-
- Possible clusters (two different race distances)
-
- Fig. 4.4
-
- Some dotplots stretch out horizontally, like a histogram, or run vertically, like a stem-and-leaf
display
Think Before You Draw
-
- Think carefully to decide which type of graph to make
-
- Check Categorical Data Condition before making a pie chart or a bar chart
-
- Don’t show the data values themselves
-
- Before making a stem-and-leaf display, a histogram, or a dotplot, you need to check the
Quantitative Data Condition: The data are values of a quantitative variable whose units are
known
-
- You can’t display categorical data in a histogram or quantitative data in a bar chart
-
- When you describe a distribution, you should always tell about three things: its shape, center
and spread
The Shape of a Distribution
-
- Does the histogram have a single, central hump or several separated humps?
These humps are called modes; a histogram with one peak, such as the earthquake magnitudes, is dubbed unimodal; histograms with two peaks are bimodal and those with three or more are called multimodal
A histogram that doesn’t appear to have any mode and in which all the bars are approximately the same height is called uniform
-
- Is the histogram symmetric? Can you fold it along a vertical line through the middle and have
the edges match pretty closely, or are more of the value on one side?
A symmetric histogram can fold in the middle so that the two sides almost match
The thinner ends of a distribution are called the tails – if one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail
-
- Do any unusual features stick out?
You should always mention any stranglers, or outliers, that stand away from the body of the distribution (either very important or an error)
-
- Gaps help us see multiple modes and encourage us to notice when the data may come from
different sources or contain more than one group
The Center of the Distribution: The Median
-
- When we think of a typical value, we usually look for the center of the distribution (easy with
unimodal, symmetric distribution)
-
- When the distribution is skewed or possibly multimodal, it’s not immediately clear
-
- One natural choice of typical value is the value hat is literally in the middle, with half the
values below it and half above it
-
- The middle value that divided the histogram into two equal areas is called the median
-
- For the tsunamis (page 52), there are 176 earthquakes, so the median is found at the
(176+1)/2 = 88.5th place in the sorted data
-
- If n is odd, the median is the middle value
Counting in from the ends, we find this value in the (n+1)/2 position
-
- When n is even, there are two middle values – so, in this case, the median is the average of
the two values in positions n/2 and n/2 + 1
Spread: Home on the Range
-
- The more the data vary, however, the less the median alone can tell us
-
- We need to measure, how spread out are the data values
-
- The more the data vary, however, the less the median alone can tell us
-
- When we describe a distribution numerically, we always report a measure of its spread along
with its center
-
- The range of the data is defined as the difference between the maximum and minimum
values Range = max – min
-
- The maximum magnitude of these earthquakes is 9.0 and the minimum is 3.7 -> the range is
5.3
-
- Disadvantage of the range: a single extreme value can make it very large, giving a value that
doesn’t really represent the data overall
Spread: The Interquartile Range
-
- Ignoring the extremes and concentrate on the middle of the data
-
- Divide the data in half at the median, now divide both halves in half again, cutting the data
into four quarters -> quartiles
-
- One quarter of the data lies below the lower quartile, and one quarter of the data lies above
the upper quartile, so half the data lies between them; the quartiles border the middle half
of the data
-
- When n is odd, some statisticians include the median in both halves; others omit it
-
- The difference between the quartiles tells us how much territory the middle half of the data
covers and is called the interquartile range; it’s commonly abbreviated IQR
IQR = upper quartile – lower quartile
E.g. IQR of the earthquakes: 1.0 -> the middle half of the earthquake magnitudes extends across a (interquartile) range of 1.0 Richter scale units
-
- The IQR is almost always a reasonable summary of the spread of a distribution
-
- One exception is when the data are strongly bimodal
-
- For any percentage there is a corresponding percentile that cuts off that percentage of the
data below it. The 10th and 90th percentiles, for example, identify the values below which
10% and 90% (respectively) of the data lie. The median, of course, is the 50th percentile.
-
- The lower and upper quartiles are also known as the 25th and 75th percentiles of the data,
respectively, since the lower quartile falls above 25% of the data and the upper quartile falls above 75% of the data
5-Number Summary
-
- The 5-number summary of a distribution reports its median, quartiles, and extremes
(maximum and minimum)
-
- E.g. earthquake Magnitudes:
-
- Counts are useful, but sometimes we want to know the fraction or proportion of the data in
each category, so we divide the counts by the total number of cases; usually we multiply by
100 express the proportions as percentages; a relative frequency table displays the
percentages, rather than the counts, of the values in each category; both types of tables
show how the cases are distributed across the categories; in this way they describe the
distribution of a categorical variable because they name the possible categories and tell how
frequently each occurs
-
- Also report the number of data values and the identity of the cases (the Who)
Summarizing Symmetric Distributions: The Mean
-
- = =
-
- The formula says to add up all the values of the variable and divide that sum by the number
of data values
-
- A median is also a kind of average
-
- The value we calculated is called the mean, y-bar
-
- The mean feels like the center because it is the point where the histogram balances
Mean or Median?
-
- Mean just make sense with symmetric data (not if the distribution is skewed or has outliers)
-
- For asymmetric distribution, the median is a better summary of the center
-
- Because the median considers only the order of the values, it is resistant to values that are
extraordinarily large or small; it simply notes that they are one of the ‘big ones’ or the ‘small
ones’ and ignores their distance from the center
-
- Mean just make sense with symmetric data (not if the distribution is skewed or has outliers)
histogram is skewed or has outliers, we’re usually better off with the median. What About Spread? The Standard Deviation
-
- IQR ignores how individual values vary
-
- Standard deviation takes into account how far each value is from the mean
-
- Like the mean, the standard deviation is appropriate only for symmetric data
-
- Examine how far each data value is from the mean -> called deviation
-
- Square each deviation
-
- When we add up these squared deviations and find their average (almost), we call the result
the variance
-
- s2 =
-
- To get back to the original units, we take the square root of s2 - the result, s, is the standard
deviation
-
- s = √
- Example:
Original Values
|
Deviations
|
Squared Deviations
|
14
|
14-17=-3
|
(-3)2= 9
|
13
|
13-17=-4
|
(-4)2=16
|
20
|
20-17=3
|
32=9
|
22
|
22-17=5
|
52=25
|
18
|
18-17=1
|
12=1
|
19
|
19-17=2
|
22=4
|
13
|
13-17=-4
|
(-4)2=16
|
Add up the squared deviations: 9+16++25+1+4+16=80
Now divide by n-1: 80/6=13.33
Finally, take the square root: s=√13.33=3.65
Finally, take the square root: s=√13.33=3.65
Think About Variation
-
- If many data values are scattered far from the center, the IQR and the standard deviation will
be large
-
- Measures of spread tell how well other summaries describe the data
What to Tell About a Quantitative Variable
-
- Start by making a histogram or stem-and-leaf display, and discuss the shape of the
distribution
-
- Next, discuss the center and spread
o AlwayspairthemedianwiththeIQRandthemeanwiththestandarddeviation
o Skewedshape->medianandIQR
o Symmetricshape->meanandstandarddeviation(maybemedianandIQRaswell)
-
- Discuss unusual features
o Reasonformultiplemodes(e.g.gender)->splitdataintoseparategroups
o Pointingoutoutliers(meanandstandarddeviationoncewithoutliers,oncewithout)
-
- Example page 62: if there is just a small outlier, and the median and the mean are close, the
outlier does not seem to be a problem -> using mean and standard deviation
Terms
-
- Distribution – the distribution of a quantitative variable slices up all the possible values of the
variable into equal-width bins and gives the number of values (or counts) falling into each bin
-
- Histogram (relative frequency histogram) – a histogram uses adjacent bars to show the
distribution of a quantitative variable. Each bar represents the frequency (or relative
frequency) of values falling in each bin
-
- Gap – a region of the distribution where there are no values
-
- Stem-and-leaf display – shows quantitative data values in a way that sketches the
distribution of the data. It’s best described in detail by example
-
- Dotplot – dotplot graphs a dot for each case against a single axis
-
- Shape – to describe the shape of a distribution, look for single vs. multiple modes, symmetry
vs. skewness and outliers and gaps
-
- Mode – a hump or local high point in the shape of the distribution of a variable. The apparent
location of modes can change as the scale of a histogram is changed
-
- Unimodal (bimodal) – having one mode. This is useful term for describing the shape of a
histogram when it’s generally mound-shaped. Distributions with two modes are called
bimodal. Those with more than two are multimodal
-
- Uniform – a distribution that’s roughly flat is said to be uniform
-
- Symmetric – a distribution is symmetric if the two halves on either side of the center look
approximately like mirror images of each other
Chapter 5 Understanding and Comparing Distributions
-
- Start by making a histogram or stem-and-leaf display, and discuss the shape of the
distribution
The Big Picture
-
- Fig. 5.1 A histogram of daily Average Wind Speed for every day in 1989; it is unimodal and
skewed to the right, with a possible high outlier
-
- Maximum unusually windy or just the windiest day of the year?
Boxplots
-
Draw a single vertical axis spanning the extent of the data; draw short horizontal lines at
the lower and upper quartiles and at the median -> form a box
-
Erect ‘fences’ around the main part of the data; we place the upper fence 1.5IQRs above
the upper quartile and the lower fence 1.IQRs below the lower quartile; never include
the fences in your boxplot
-
We use the fences to grow ‘whiskers’; draw lines from the ends of the box up and down
to the most extreme data values found within the fences
-
We add the outliers by displaying any data values beyond the fences with special
symbols
-
- Is it windier in the winter or the summer?
-
- Use the same scale
-
- Spring/summer and fall/winter
-
- In the colder months the shape is less strongly skewed and more spread out; wind speed is
higher, several high values Comparing Groups with Boxplots
-
- E.g. are some months windier than others?
-
- Do some months show more variation? (spread)
-
- Group observations by month -> side by side (fig. 5.4)
-
- Easily see which groups have higher medians, which have the greater IQRs, where the central
50% of the data is located and which have the greater overall range
-
- Wind speeds tend to decrease in the summer
-
- The months in which the winds are both strongest and most variable are November through
March
-
- Many outliers -> that windy day in July certainly wouldn’t stand out in November or
December, but for July, it was remarkable Outliers
-
- An outlier is a value that doesn’t fit with the rest of the data
-
- Boxplots provide a rule of thumb to highlight these unusual points
-
- Try to understand them in the context of the data
-
- Histogram gives a better idea of how the outlier fits in with the rest of the data
-
- Look at the gap between that case and the rest of the data (maybe error in the data)
-
- Never leave an outlier in pace and proceed as if nothing were unusual
-
- Never drop an outlier from the analysis without comment just because it’s unusual
Timeplots: Order, Please!
-
- A display of values against time is sometimes called a timeplot
-
- Without monthly division, we can see a calm period during the summer
-
- More variable and stronger during the early and late parts of the year
*Smoothing Timeplots
-
- You could draw a smooth curve or trace through a timeplot (page 89)
-
- Smooth trace can highlight long-term patterns and help us see them through the more local
variation Looking into the Future
-
- For example seasonal patterns -> probably safe to predict a less windy June next year and a
windier November
-
- But we wouldn’t predict another storm on November 21
-
- But not in every case, e.g. stock rises
-
- Stock prices, unemployment rates, and other economic, social or psychological concepts are
much harder to predict than physical quantities Re-expressing Data: A First Look
Re-expressing to Improve Symmetry
-
- Data skewed -> difficult to summarize with a center and spread
-
- Fig. 5.9 -> some CEOs received extraordinarily high compensations, while the majority
received relatively ‘little’ -> mean value is 10307000 while the median is only 4700000
-
- One approach is to re-express, or transform, the data by applying a simple function to make
the skewed distribution more symmetric – we could take the square root or logarithm of
each compensation value -> more symmetric; you can identify real outliers
-
- Variables that are skewed to the right often benefit from a re-expression by square roots,
logs, or reciprocals
Re-expressing to Equalize Spread Across Groups
-
- Fig. 5.11 the nicotine levels for both nonsmoking groups are too low to be seen (can’t be
negative and are skewed to the high end)
-
- Re-expressing -> logarithm
Terms
-
- Boxplot – displays the 5-number summary as a central box with whiskers that extend to the
non-outlying data values; boxplots are particularly effective for comparing groups and for
displaying outliers
-
- Outlier – any point more than 1.5 IQR from either end of the box in a boxplot is nominated as
an outlier
-
- Far Outlier – if a point is more than 3.0 IQR from either end of the box in a boxplot
-
- Comparing distributions – when comparing the distributions of several groups using
histograms or stem-and-leaf displays, consider their o Shape
o Center o Spread
-
- Data skewed -> difficult to summarize with a center and spread
-
- For example seasonal patterns -> probably safe to predict a less windy June next year and a
windier November
- Comparing boxplots – when comparing groups with boxplots
o Comparetheshapes–dotheboxeslooksymmetricorskewed?Arethere
differences between groups?
o Comparethemedians.Whichgroupshasthehighercenter?Isthereanypatternto
the medians?
o ComparetheIQRs–whichgroupismorespreadout?Isthereanypatterntohowthe
IQRs change?
o UsingtheIQRsasabackgroundmeasureofvariation,dothemediansseemtobe
different, or do they just vary much as you’d expect from the overall variation?
o Checkforpossibleoutliers–identifythemifyoucananddiscusswhytheymightbe
unusual; of course, correct them if you find that they are errors
- Timeplot – displays data that change over time; often, successive values are connected with
lines to shot trends more clearly; sometimes a smooth curve is added to the plot to help show long-term patterns and trends
Chapter 6 The Standard Deviation as a Ruler and the Normal Model
- Women’s heptathlon in the Olympics – seven tracks – different units – how to compare the scores?
The Standard Deviation as a Ruler
o Comparetheshapes–dotheboxeslooksymmetricorskewed?Arethere
differences between groups?
o Comparethemedians.Whichgroupshasthehighercenter?Isthereanypatternto
the medians?
o ComparetheIQRs–whichgroupismorespreadout?Isthereanypatterntohowthe
IQRs change?
o UsingtheIQRsasabackgroundmeasureofvariation,dothemediansseemtobe
different, or do they just vary much as you’d expect from the overall variation?
o Checkforpossibleoutliers–identifythemifyoucananddiscusswhytheymightbe
unusual; of course, correct them if you find that they are errors
- Timeplot – displays data that change over time; often, successive values are connected with
lines to shot trends more clearly; sometimes a smooth curve is added to the plot to help show long-term patterns and trends
Chapter 6 The Standard Deviation as a Ruler and the Normal Model
- Women’s heptathlon in the Olympics – seven tracks – different units – how to compare the scores?
The Standard Deviation as a Ruler
-
- Tells us how the whole collection of the values varies
-
- Fig. 6.1 Stem-and-leaf displays for both the long jump and the shot put
-
- Klüft’s 6.78-m long ump is 0.62 meter longer than the mean jump of 6.16 m -> 0.62/0.23 =
2.70 standard deviations better than the mean // Skujyté’s winning shot is only 2.51 standard deviations better than the mean
Standardizing with z-Scores
-
- Expressing the distance in standard deviations standardized the performances
-
- To standardize a value, we simply subtract the mean performance in that event and then
divide this difference by the standard deviation:
-
- =
-
- These values are called standardized values, and are commonly denoted with the letter z (call
them z-scores)
-
- A z-score of 2 tells us that a data value is 2 standard deviations above the mean
-
- The farther a data value is from the mean, the more unusual it is, so a z-score of -1.2 is more
extraordinary than a z-score of 1.2
-
- Klüft: 2.70+1.19=3.89
-
- Skujyté: 0.61+2.51=3.12
-
- Klüft won
-
- When we standardize data to get a z-score, we do two things – first, we shift the data by
Shifting
subtracting the mean; then we rescale the values by dividing by their standard deviation
Data
-
- Histogram and boxplot for the men’s weight – some of the men are heavier than the
recommended weight (74kg) -> subtracting 74 kg shifts the entire histogram down but leaves
the spread and the shape exactly the same
-
- When we shift the data by adding (or subtracting) a constant to each value, all measures of
position (center, percentiles, min, max) will increase (or decrease) by the same constant
-
- Adding (or subtracting) a constant to every data value adds (or subtracts) the same constant
to measures of position, but leaves measures of spread unchanged
Rescaling Data
-
- Suppose we want to look at the weights in pounds instead
-
- 2.2 pounds in every kilogram, we’d convert the weights by multiplying each value by 2.2 ->
changes the measurement units
-
- Shape does not change
-
- Mean also multiplied by 2.2 (like all measures of position)
-
- Spread is also 2.2 times larger
-
- Suppose we want to look at the weights in pounds instead
(such as the mean, median and percentiles) and measures of spread (such as the range, the IQR, and the standard deviation) are multiplied (or divided) by that same constant
Back to z-Scores
-
-
- - -
What is
- -
-
-
- -
- -
- - -
What is
- -
-
-
- -
- -
When we subtract the mean of the data from every data value, we shift the mean to zero
(shifts don’t change standard deviation)
Each shifted value is divided by s -> SD should be divided by s as well (SD was s) -> new standard deviation becomes zero
Standardizing into z-scores does not change the shape of the distribution of a variable Standardizing into z-scores changes the center by making the mean 0
Standardizing into z-scores changes the spread by making the standard deviation 1
a z-Score BIG?
How far from 0 does a z-score have to be to be interesting or unusual?
To say more about how big we expect a z-score to be, we need to model the data’s distribution (model of reality, not reality itself)
‘bell-shaped curves’ (normal models) -> normal models are appropriate for distribution whose shapes are unimodal and roughly symmetric
There is a normal model for every possible combination of mean and standard deviation
N (μ,σ) with a mean of μ and a standard deviation of σ
This mean and standard deviation are not numerical summaries of data -> parameters of the
Each shifted value is divided by s -> SD should be divided by s as well (SD was s) -> new standard deviation becomes zero
Standardizing into z-scores does not change the shape of the distribution of a variable Standardizing into z-scores changes the center by making the mean 0
Standardizing into z-scores changes the spread by making the standard deviation 1
a z-Score BIG?
How far from 0 does a z-score have to be to be interesting or unusual?
To say more about how big we expect a z-score to be, we need to model the data’s distribution (model of reality, not reality itself)
‘bell-shaped curves’ (normal models) -> normal models are appropriate for distribution whose shapes are unimodal and roughly symmetric
There is a normal model for every possible combination of mean and standard deviation
N (μ,σ) with a mean of μ and a standard deviation of σ
This mean and standard deviation are not numerical summaries of data -> parameters of the
model
=
The normal model with mean 0 and standard deviation 1 is called the standard normal model
(or the standard normal distribution)
Normality assumption
Nearly normal condition -> the shape of the data’s distribution is unimodal and symmetric: Check this by making a histogram (or a normal probability plot, which we’ll explain later)
Normality assumption
Nearly normal condition -> the shape of the data’s distribution is unimodal and symmetric: Check this by making a histogram (or a normal probability plot, which we’ll explain later)
The 68-95-99.7 Rule
- It turns out that in a normal model, about 68% of the values fall within 1 standard deviation
of the mean, about 95% of the values fall within 2 standard deviations of the mean, and
about 99.7 – almost all – of the values fall within 3 standard deviations of the mean (fig. 6.6)
The First Three Rules for Working with Normal Models
Finding Normal Percentiles
Finding Normal Percentiles Using Technology
From Percentiles to Scores: z in Reverse
Are You Normal? Find Out with a Normal Probability Plot
- The normal probability plot – if the distribution of the data is roughly normal, the plot is roughly a diagonal straight line; deviations from a straight line indicate that the distribution is not normal
How Does a Normal Probability Plot Work?
Chapter 7 Scatterplots, Association, and Correlation
The First Three Rules for Working with Normal Models
-
Make a picture
-
Make a picture
-
Make a picture
-
- Sketch pictures to help think about normal models
-
- Make a histogram or check the Nearly Normal Condition
The worst-case scenario: Tchebycheff’s Inequality
-
- 5 standard deviations above the mean
-
- But 68-95-99.7 rule applies only to normal models
-
- In any distribution, at least 1 − of the values must lie within ±k standard deviations of the
mean
-
- For k = 1.1 – 1/12 = 0; if the distribution is far from Normal
-
- For k = 2.1 – 1/22 = 3/4; not matter how strange the shape of the distribution, at least 75% of
the values must be within 2 standard deviations of the mean
-
- For k = 3.1 – 1/32 = 8/9; in any distribution, at least 89% of the values lie within 3 standard
deviations of the mean
Finding Normal Percentiles
Finding Normal Percentiles Using Technology
From Percentiles to Scores: z in Reverse
Are You Normal? Find Out with a Normal Probability Plot
- The normal probability plot – if the distribution of the data is roughly normal, the plot is roughly a diagonal straight line; deviations from a straight line indicate that the distribution is not normal
How Does a Normal Probability Plot Work?
Chapter 7 Scatterplots, Association, and Correlation
-
- Figure 7.1 scatterplot of the average error in nautical miles of the predicted position of
Atlantic hurricanes, plotted against the Year in which the predictions were made
-
- Predictions have improved -> decline in the average error
-
- This timeplot is an example of a more general kind of display called a scatterplot. Scatterplots
2
may be the most common displays for data. By just looking at them, you can see patterns,
trends, relationships, and even the occasional extraordinary value sitting apart from the
others
- Between two quantitative variables
Looking at Scatterplots
Correlation
others
- Between two quantitative variables
Looking at Scatterplots
-
- Direction of the association is important
-
- A pattern like this that runs from the upper left to the lower right is said to be negative
-
- A pattern running the other way is called positive
-
- The second thing to look for in a scatterplot is its form: if there is a straight line relationship,
it will appear as a cloud or swarm of points stretched out in a generally consistent, straight
form
-
- E.g. the scatterplot of Prediction Error vs. Year has such an underlying linear form, although
some points stray away from it
-
- Straight, curved, something exotic, or no pattern?
-
- The third feature to look for in a scatterplot is how strong the relationship is
-
- At one extreme, do the point appear tightly clustered in a single stream (whether straight,
curved, or bending all over the place)
-
- Or at the other extreme, does the swarm of points seem to form a vague cloud through
which we can barely discern any trend or pattern?
-
- Fourth: look for unusual features and unexpected. Often the most interesting thing to see in
a scatterplot is something you never thought to look for, e.g. an outlier standing away from
the overall pattern of the scatterplot
-
- Also look for clusters or subgroups that stand away from the rest of the plot or that show a
trend in a different direction Roles for Variables
Correlation
-
- Height (x-axis, explanatory variable) and Weight taller students tend to weigh more
-
- Figure 7.2: form is fairly straight, although there seems to be a high outlier, as the plot
shows; pattern looks straight, clearly positive
-
- The units shouldn’t matter to our measure of strength, we can remove them by
standardizing each variable for each point, instead of the value (x, y) we will have the standardized coordinates (zx, zy) to standardize values, we subtract the mean of each variable and then divide by its standard deviation:
(z , z ) = ( ̅, ) xy
-
- Figure 7.3: scatterplot of standardized heights and weights – scale on both axes are standard
deviation units the underlying linear pattern seems steeper in the standardized plot (due
to the scales of the axes are now the same) equal scaling gives a neutral way of drawing
the scatterplot and a fairer impression of the strength of the association
-
- In a positive association, y tends to increase as x increases points in the upper right and
lower left strengthen that impression
-
- Points in the upper left and lower right quadrants tend to weaken the positive association
-
- Points with z-scores of zero on either variable don’t vote either way, because zx, zy = 0 (see
also figure 7.4)
-
- To turn these products into a measure of the strength of the association, just add up the zx zy
products for every point in the scatterplot:
zxzy
This summarizes the direction and strength of the association for all the points
-
- To adjust for the fact that the size of the sum gets bigger the more data we have, we divide
the sum by n-1 correlation coefficient: r = ∑
(see also page 155/156) Correlation Conditions
-
- Correlation measures the strength of the linear association between two quantitative
variables
-
- Before you use correlation, you must check several conditions:
o QuantitativeVariableCondition–correlationappliesonlytoquantitativevariables o StraightEnoughCondition
o OutlierCondition–whenyouseeanoutlier,itisoftenagoodideatoreportthe
correlation with and without the point Correlation Properties
-
- Correlation measures the strength of the linear association between two quantitative
variables
-
- The sign of a correlation coefficient gives the direction of the association
-
- Correlation is always between -1 and +1 – correlation can be exactly equal to -1 and +1 but
these values are unusual in real data
-
- Correlation treats x and y symmetrically – the correlation of x and y is the same as the
correlation of y with x
-
- Correlation has no units (but don’t use percentages)
-
- Correlation is not affected by changes in the center or scale of either variable – changing the
units or baseline of either variable has not effect on the correlation coefficient – correlation
depends only on the z-scores, and they are unaffected by changes in center or scale
-
- Correlation measures the strength of the linear association between the two variables
-
- Correlation is sensitive to outliers – a single outlying value can make a small correlation large
or make a large one small Warning: Correlation ≠ Causation
-
- Figure 7.5 – the two variables are obviously related to each other but that doesn’t prove that
storks bring babies
-
- A hidden variable that stands behind a relationship and determines it by simultaneously
affecting the other two variables is called a lurking variable
-
- Scatterplots and correlation coefficients never prove causation
Correlation Tables
- The rows and column of the table name the variables, and the cells hold the correlations
-
- But: without any checks for linearity and outliers, the correlation table risks showing truly
small correlations that have been inflated by outliers, truly large correlations that are hidden
by outliers, and correlations of any size that may be meaningless because the underlying
form is not linear
-
- Table 7.1: the diagonal cells of a correlation table always show correlations of exactly 1
*Measuring Trend: Kendall’s Tau
-
- Scales of the sort that attempt to measure attitudes numerically are called Likert scales
-
- Likert scales have order (e.g. assessing the pace of a course on a scale form 1-5)
-
- But the correlation coefficient might not be the appropriate measure using alternative
measure of association: Kendall’s tau
-
- Kendall’s tau is a statistic designed to assess how close the relationship between two
variables is to being monotone – a monotone relationship is one that consistently increases
or decreases, but not necessarily in a linear fashion
-
- Kendall’s tau measures monotonicity directly – for each pair of points in a scatterplot, it
records only whether the slope of a line between those two points is positive, negative, or zero
*Nonparametric Association: Spearman’s Rho
-
- Spearman’s rho can deal with the two problems of outliers and bends in the data (that make
it impossible to interpret correlation)
-
- Rho replaces the original data values with their ranks within each variable
-
- It replaces the lowest value in x by the number 1 ...
-
- The same method ranking method is applied to the y-variable
-
- Spearman’s rho is the correlation of the two rank variables – it must be between -1 and 1
-
- Both (Spearman and Kendall) are examples of what are called nonparametric or distribution-
free methods Straightening Scatterplots
-
- Spearman’s rho can deal with the two problems of outliers and bends in the data (that make
it impossible to interpret correlation)
Chapter 8 Linear Regression
-
- Burger King: the scatterplot of the Fat (in grams) versus the Protein (in grams) for food sold
at Burger King shows a positive, moderately strong, linear relationship
-
- The correlation between Fat and Protein is 0.83 (fairly strong relationship)
-
- We can model the relationship with a line and give its equation with two parameters: its
mean μ and standard deviation σ linear model (an equation of a straight line through the data; but wrong in the sense that it can’t match reality exactly)
Residuals
-
- Figure page 179
-
- The line might suggest that BK Broiler chicken sandwich with 30 grams of protein should
-
- Figure page 179
have 36 grams of fat when, in fact, it actually has only 25 grams of fat
- We call the estimate made from a model the predicted value, and write it as to distinguish
it from the true value, y
- The difference between the observed value and its associated predicted value is called the
residual – the residual value tells us how far off the model’s prediction is at that point
- BK Broiler chicken residual: y- = 25-36 = -11g of fat actual fat content is about 11 grams
less than the model predicts
- To find the residuals, we always subtract the predicted value from the observed one
“Best Fit” Means Least Squares
-
- Squaring all residuals and add them up
-
- The sum indicates how well the line we drew fits the data – the smaller the sum, the better
the fit
-
- The line of best fit is the line for which the sum of the squared residuals is smallest, the least
squares line The Linear Model
-
- Straight line: y = mx + b
-
- Linear model (statistics): =b0+b1x (predicted values = slope + intercept of the line)
-
- The b’s are called the coefficients of the linear model – the coefficient b1 is the slope, which
tells how rapidly changes with respect to x – the coefficient b0 is the intercept, which tells
where the line hits (intercepts) the y-axis
- Burger King: = 6.8 + 0.97Protein (one more gram of protein -> 0.97 more grams of fat;
No protein -> 6.8 grams of fat? No reasonable then the intercept serves only as a starting value for our predictions)
The Least Squares Line
- Burger King: = 6.8 + 0.97Protein (one more gram of protein -> 0.97 more grams of fat;
No protein -> 6.8 grams of fat? No reasonable then the intercept serves only as a starting value for our predictions)
The Least Squares Line
-
- The correlation (tells us the strength of the linear association), the standard deviation (gibes
us the units), and the means (tells us where to put the line)
-
- Slope b1 = r* sy/sx
-
- Changing the units of x and y affects their standard deviations directly
-
- Units of the slope are always the units of y per unit of x
-
- Intercept: b0= -b1 ̅
-
- Example page 182
-
- Regression almost always means “the linear model fit by least squares”
-
- To use a regression model, we should check the same conditions for regressions as we did for
correlation: the Quantitative Variables Condition, the Straight Enough Condition, and the Outlier Condition
Correlation and the Line
- Figure 8.3: scatterplot for the BK items of zy (standardized Fat) vs. zx (standardized Protein)
along with their least squares line
-
- Equation: ̅y = r*zx
-
- It says that in moving one standard deviation from the mean in x, we can expect to move
about r standard deviations away from the mean in y
- BK: if we standardize both protein and fat, we can write ̅y = 0.83*zprotein
- BK: if we standardize both protein and fat, we can write ̅y = 0.83*zprotein
- It tells us that for every standard deviation above (or below) the mean a menu item is in
protein, we’d predict that its fat content is 0.83 standard deviations above (or below) the
mean fat content
Ingeneral,menuitemsthatareonestandarddeviationawayfromthemeaninxare,on average, r standard deviations away from the mean in y
How Big Can Predicted Values Get?
- Each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean, and that’s where we got the term regression line.
Residuals Revisited
Ingeneral,menuitemsthatareonestandarddeviationawayfromthemeaninxare,on average, r standard deviations away from the mean in y
How Big Can Predicted Values Get?
- Each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean, and that’s where we got the term regression line.
Residuals Revisited
-
- Data = Model + Residual
-
- Residual = Data - Model
-
- e=y-
-
- A scatterplot of the residuals versus the x-values should be the most boring scatterplot
you’ve ever seen – it shouldn’t have any interesting features, like a direction or shape – it should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers.
The Residual Standard Deviation
-
- The standard deviation of the residuals, se, gives us a measure of how much the points
spread around the regression line
-
- New assumption: Equal Variance Assumption with the associated Does the Plot Thicken?
Condition – spread is about the same throughout
-
- s=
e
R2-The Variation Accounted For
-
- The standard deviation of the residuals, se, gives us a measure of how much the points
spread around the regression line
-
- -0.5 is doing as well as 0.5 (correlation) but different direction
-
- If we square the correlation coefficient, we’ll get a value between 0 and 1, and the direction
won’t matter
-
- The squared correlation, r2, gives the fraction of the data’s variation accounted for by the
model, and 1-r2 is the fraction of the data’s variation left in the residuals
-
- BK: 31% of the variability in total Fat has been left in the residuals / 69% of the variability in
the fat content of BK sandwiches is accounted for by variation in the protein content
-
- All regression analyses include this statistic, although by tradition, it is written with a capital
letter, R2, and pronounced “R-squared”
-
- R2 depends on the kind of data you are analyzing
-
- Data from scientific experiments often have high percentages
-
- Data from observational studies and surveys often show weak associations -> 50%-30% can
provide evidence of a useful regression
A Tale of Two Regressions
-
- Solving our equation for Protein to get a model for predicting Protein from Fat does not work
-
- Protein = 0.55+0.709Fat
Regression Assumptions and Conditions
-
- Reasonable?
-
- Check Quantitative Variables Condition to be sure a regression is appropriate
-
- Linear model
o Linearityassumption
o StraightEnoughCondition
o DoesthePlotThicken?Condition o OutlierCondition
-
- For the standard deviation of the residuals to summarize the scatter, all the residuals should
share the same spread
Reality Check: Is the Regression Reasonable?
-
- Direction right?
-
- Size reasonable?
Chapter 18 Sampling Distribution Models
The Central Limit Theorem for Sample Proportions
-
- Direction right?
-
- True proportion: p = 0.45 (45% of all American adults believe in ghosts) (fig. 18.1)
-
- 2000 simulated independent samples of 808 adults (p=0.45); we don’t get the same
proportion for each sample we draw
-
- p = parameter of the model (the probability of a success)
-
- ^p for the observed proportion in a sample
-
- q = for the probability of a failure (q=1-p) and 1q for its observed value
-
- P = general probability
-
- The histogram (Fig.18.1) is a simulation of what we’d get if we could see all the proportions
from all possible samples; that distribution has a special name; it is called the sampling
distribution of the proportions
-
- A sampling distribution model for how a sample proportion varies from sample to sample
allows us to quantify that variation and to talk about how likely it is that we’d observe a
sample proportion in any particular interval
-
- To use a normal model, we need to specify two parameters: its mean and standard
deviation; the center is p , so we’ll put μ, the mean of the Norma, at p
P -> standard deviation of the proportion of successes, ^p -> ^p is the number of successes divided by the number of trials, n, so the standard deviation is also divided by n:
σ (^p) = SD(^p) = √ = √
-N(p,√ )
- p = 0.45 -> SD (^p) = = . ∗ . = 0.0175
- The model becomes a better and better representation of the distribution of the sample, proportions as the sample size gets bigger
Assumptions and Conditions
-
- scale for Normal model:
-
- Normal model -> 68-95-99.7 Rule
-
- Since 2*1.75% = 3.5%, we see that the CBS poll estimating belief in ghosts at 48% is
consistent with our guess of 45%
-
- This is what we mean by sampling error – it’s not really an error at all, but just variability
you’d expect to see from one sample to another -> sampling variability
- The model becomes a better and better representation of the distribution of the sample, proportions as the sample size gets bigger
Assumptions and Conditions
-
- The Independence Assumption: The sampled values must be independent of each other
The Sample Size Assumption: The sample size, n, must be large enough
-
- Randomization Condition: subjects should have been randomly assigned
10% Condition: the sample size, n, must be no larger than 10% of the population Success/Failure Condition: p*n > 10 and q*n > 10
A Sampling Distribution Model
-
- Laplace: the larger the sample size, the better the model works
-
- No longer is a proportion something we just compute for a set of data; we now see it as a
random variable quantity that has a probability distribution, and thanks to Laplace we have a
model for that distribution sampling distribution model for the proportion
-
- They inform us about the amount of variation we should expect when we sample
-
- They act as a bridge from the real world of data to the imaginary model of the statistic and
enable us to say something about the population when all we have is data from the real
world
-
- Margin of error
What about Quantitative Data?
-
- The means have also a sampling distribution that we can model with a Normal mode
-
- Laplace theoretical result applies to means, too
Simulating the Sampling Distribution of a Mean
-
- Toss a fair die 10000 times (fig.18.5)
-
- Toss a pair of dice and record the average of the two (fig.18.6) -> more likely to get an
average near 3.5 -> triangular distribution
-
- Toss a fair die 10000 times (fig.18.5)
-
- The means have also a sampling distribution that we can model with a Normal mode
-
- Laplace: the larger the sample size, the better the model works
0.3975
|
0.4150
|
0.4325
|
0.4500
|
0.4675
|
0.4850
|
0.5025
|
-3σ
|
-2σ
|
-1σ
|
P
|
1σ
|
2σ
|
3σ
|
- Average 3 or 4 dices -> Law of large numbers: as the sample size (number of dice) gets larger,
each sample average is more likely to be closer to the population mean &
It’s becoming bell-shaped and approaching the Normal model
The Central Limit Theorem: The Fundamental Theorem of Statistics
It’s becoming bell-shaped and approaching the Normal model
The Central Limit Theorem: The Fundamental Theorem of Statistics
-
- For sampling distributions, we had to check a few conditions
-
- For means, there are almost no conditions at all
-
- The sampling distribution of any mean becomes more nearly Normal as the sample size
grows; all we need is for the observations to be independent and collected with randomization; we don’t even care about the shape of the population distribution This surprising fact is the result Laplace proved -> Central Limit Theorem (CLT)
-
- Not only does the distribution of means of many random samples get closer and closer to a
Normal model as the sample size grows, this is true regardless of the shape of the population
distribution
-
- Even skewed or bimodal population -> CLT: means of repeated random samples will tend to
follow a Normal model as the sample size grows
-
- Works better and faster the closer the population distribution is to a Normal model
-
- Works better for larger samples
Assumptions and Conditions (for the CLT)
-
- Independence & Sample Size Assumption
-
- Randomization Condition
10% Condition
Large Enough Sample Condition
-
- Independence & Sample Size Assumption
-
- For proportions, the sampling distribution is centered at the population proportion
-
- For means, it’s centered at the population mean
-
- Means have smaller standard deviations than individuals
-
- The standard deviation of y falls as the sample size grows
- But it only goes down by the square root of the sample size
SD( )=
√
Whenwehavecategoricaldata,wecalculateasampleproportion, ̂;thesampling
Whenwehavecategoricaldata,wecalculateasampleproportion, ̂;thesampling
distribution of this random variable has a Normal model with a mean at the true proportion
p and a standard deviation of SD ( ̂) =
When we have quantitative data, we calculate a sample mean; ; the sampling distribution
of this random variable has a Normal model with a means at the true mean, μ, and a
standard deviation of SD ( ) =
About variation
About variation
√
-
- Means vary less than individual data values
-
- Variability of sample means decreases as the sample size increases
-
- If only we had a much larger sample, we could get the standard deviation of the sampling
distribution really under control so that the sample mean could tell us still more about the
unknown population mean
-
- The square root limits how much we can make a sample tell about the population (law of
diminishing returns)
The Real World and the Model World
-
- Real world distribution of the sample (histogram, bar chart, table)
-
- Math world sampling distribution model of the statistic, a Normal model based on the CLT
-
- Don’t think the CLT says that the data are Normally distributed as long as the sample is large
enough
-
- The CLT doesn’t talk about the distribution of the data from the sample; it talks about the
sample means and sample proportions of many different random samples drawn from the same population
Sampling Distribution Models
-
- Statistic itself is a random variable
-
- Shows us the distribution of possible values that the statistic could have had
-
- Sample-to-sample variability generates the sampling distribution
-
- Sampling distributions arise because samples vary – each random sample will contain
different cases and, so, a different value of the statistic
-
- Although we can always simulate a sampling distribution, the CLT saves us the trouble for
means and proportions Terms
-
- Sampling distribution model – different random samples give different values for a statistic;
the sampling distribution model shows the behavior of the statistic over all the possible
samples for the same size n
-
- Sampling variability – the variability we expect to see from one random sample to another
-
- Sampling error – sampling variability
-
- Sampling distribution model for a proportion – if assumptions of independence and random
sampling are met, and we expect at least 10 successes and 10 failures, then the sampling distribution of a proportion is modeled by a Normal model with a mean equal to the true
proportion value, p, and a standard deviation equal to
-
- Central Limit Theorem – CLT states that the sampling distribution model of the sample mean
(and proportion) from a random sample is approximately Normal for large n, regardless of
the distribution of the population, as long as the observations are independent
-
- Sampling distribution model for a mean – if assumptions of independence and random
sampling are met, and the sample size is large enough, the sampling distribution of the
sample mean is modeled by a Normal model with a mean equal to the population mean, μ,
and a standard deviation equal to
Chapter 19 Confidence Intervals for Proportions (sea fans)
-
- Statistic itself is a random variable
√
A Confidence Interval
-
- We know it’s approximately Normal and that its mean is the proportion of all infected sea
fans on the Las Redes Reef
-
- ^p=51.9% (centered at p)
-
- SD=
-
- We don’t know p
-
- Whenever we estimate the standard deviation of a sampling distribution, we call it a
standard error
SE (^p) = ^ ^ /
-
- Sea fans: 4.9%
-
- Because it’s Normal, it says that about 68% of all samples of 104 sea fans will have ^p’s
within 1 SE (0.049), of p; about 95% of all these samples will be within p ± 2 SEs if I reach out 2 SEs, or 2* 0.049, away from me on both sides, I’m 95% sure that p will be within my grasp
- The interval calculated and interpreted here is sometimes called a one-proportion z-interval What Does “95% Confidence” Really Mean?
- Formally, what we mean is that “95% of samples of this size will produce confidence intervals that capture the true proportion.” This is correct, but a little long winded, so we sometimes say, “we are 95% confident that the true proportion lies in our interval.” Our uncertainty is about whether the particular sample we have at hand is one of the successful ones or one of the 5% that fail to produce an interval that captures the true value.
Margin of Error: Certainty vs. Precision
-
- Our confidence interval had the form ^p ± 2 SE (^p)
-
- The extent of the interval on either side of ^p is called the margin of error (ME)
Critical Values
-
- To change the confidence level, we’d need to change the number of SEs so that the size of
the margin of error corresponds to the new level
-
- This number of SE is called the critical value -> z* -> Table Z (Table D)
-
- For a 95% confidence interval, you’ll find the precise critical value is z* = 1.96 -> 1.96
standard deviations of the mean Assumptions and Conditions
-
- To change the confidence level, we’d need to change the number of SEs so that the size of
the margin of error corresponds to the new level
Independence Assumption
-
- Whether you decide that the Independence Assumption is plausible depends on your
knowledge of the situation
-
- Randomization condition
- 10% Condition
Sample Size Assumption
- Whether the sample is large enough to make the sampling model for the sample proportions approximately Normal
- Success/Failure Condition: we must expect at least 10 successes and at least 10 failures Choosing Your Sample Size
- Whether the sample is large enough to make the sampling model for the sample proportions approximately Normal
- Success/Failure Condition: we must expect at least 10 successes and at least 10 failures Choosing Your Sample Size
-
- Suppose a candidate is planning a poll and wants to estimate voter support within 3% with
95% confidence. How large a sample does she need?
-
- ME = z* ^ ^ /
-
- 0.03 = 1.96 ^ ^ /
-
- For ^p we can guess a value – the worst case is 0.50 /makes ^p^q and n largest
-
- 0.03 = 1.96 . ∗ .
- 0.03√n = 1.96 √0.5 ∗ 0.5 ≈ 32.67
-
- n ≈ 1067.1
-
- We need at least 1068 respondents to keep the margin of error as small as 3% with a
confidence level of 95%
-
- To cut the standard error (and thus the ME) in half, we must quadruple the sample size
Terms
-
- Standard error – when we estimate the standard deviation of a sampling distribution using
statistics found from the data, the estimate is called a standard error
-
- Confidence interval – a level C confidence interval for a model parameter is an interval of
values usually of the form
Estimate ± margin of error
Found from data in such a way that C% of all random samples will yield intervals that capture the true parameter value
-
- One-proportion z-interval – a confidence interval for the true value of a proportion. The
confidence interval is
^p ± z*SE(^p)
Where z* is a critical value from the Standard Normal model corresponding to the specified confidence level
-
- Margin of error – in a confidence interval the extent of the interval on either side of the
observed statistic value is called the margin of error. A margin of error is typically the product
of a critical value from the sampling distribution and a standard error from the data. A small
margin of error corresponds to a confidence interval that pins down the parameter precisely.
A large margin of error corresponds to a confidence interval that gives relatively little
information about the estimated parameter. For a proportion ME = z* ^ ^ /
-
- Critical value – the number of standard errors to move away from the mean of the sampling
distribution to correspond to the specified level of confidence. The critical value, denoted z*, is usually found from a table or with technology
Chapter 20 Testing Hypotheses about Proportions
- Cracking ingots: in one plant only about 80% of thee ingots have been free of cracks -> changes to reduce the cracking proportion -> since then, 400 ingots have been cast and only 17% of them have cracked
Natural sampling variability or evidence to assure management that the true cracking rate now is really below 20%
Testhypothesesaboutmodels Hypotheses
- With p (0.2) and SD(^p) (0.02) -> we can find out how likely it would be to see the observed value of ^p=17%
z = (0.17-0.20)/0.02 = -1.5
How likely is it to observe a value at least 1.5 standard deviations below the mean of a Normal model? -> 0.067 (table A) (probability of observing a cracking rate of 17%)
A Trial as a Hypothesis Test
- Cracking ingots: in one plant only about 80% of thee ingots have been free of cracks -> changes to reduce the cracking proportion -> since then, 400 ingots have been cast and only 17% of them have cracked
Natural sampling variability or evidence to assure management that the true cracking rate now is really below 20%
Testhypothesesaboutmodels Hypotheses
-
- Hypotheses are working models that we adopt temporarily
-
- We assume that they have in fact made no difference and that apparent improvement is just
random fluctuation (sampling error) -> called the null hypothesis
-
- Null hypotheses (H0), specifies a population model parameter of interest and proposes a
value for that parameter
H0: parameter = hypothesized value Ingots: H0: p = 0.20
-
- The alternative hypothesis, which we denote HA, contains the values of the parameter that
we consider plausible if we reject the null hypothesis
Ingots: management interested in reducing the cracking rate, so their alternative is HA: p<0.20
-
- 400 new ingots have been cast -> success/failure condition satisfied and independent ->
normal sampling distribution model
-
- SD(^p)= = . ∗ . = 0.02
- With p (0.2) and SD(^p) (0.02) -> we can find out how likely it would be to see the observed value of ^p=17%
z = (0.17-0.20)/0.02 = -1.5
How likely is it to observe a value at least 1.5 standard deviations below the mean of a Normal model? -> 0.067 (table A) (probability of observing a cracking rate of 17%)
A Trial as a Hypothesis Test
-
- Evaluating the evidence in light of the presumption of innocence and judges whether the
evidence against the defendant would be plausible if the defendant were in fact innocent
-
- You must judge for yourself in each situation whether the probability of observing your data
is small enough to constitute ‘reasonable doubt’
P-Values
-
- We want to find the probability of seeing data like these given that the null hypothesis is true
-> P-value
-
- When the P-value is high, we haven’t seen anything unlikely or surprising at all
-
- We want to find the probability of seeing data like these given that the null hypothesis is true
-> P-value
-
- When the P-value is low enough, it says that it’s very unlikely we’d observe data like these if
our null hypothesis were true
-
- We fail to reject the null hypothesis
What to Do with an ‘Innocent’ Defendant
-
- Insufficient evidence to convict the defendant, the jury does not decide that H0 is true and
declare the defendant innocent – juries can only fail to reject the null hypothesis and declare
the defendant ‘not guilty’
-
- And we never declare the null hypothesis to be true because we simply do not know whether
it’s true or not
The Reasoning of Hypothesis Testing
-
- Insufficient evidence to convict the defendant, the jury does not decide that H0 is true and
declare the defendant innocent – juries can only fail to reject the null hypothesis and declare
the defendant ‘not guilty’
-
- To assess how unlikely our data may be, we need a null model
-
- The null hypothesis specifies a particular parameter value to use in our model. In the usual
shorthand, we write H0: parameter = hypothesized value. The alternative hypothesis, HA,
contains the values of the parameter we consider plausible when we reject the null
-
- Specify the model you will use to test the null hypothesis and the parameter of interest
-
- State assumptions and check any corresponding conditions
-
- “Because the conditions are satisfied, I can model the sampling distribution of the proportion
with a Normal model.”
-
- “Because the conditions are not satisfied, I can’t proceed with the test.”
-
- The test about proportions is called a one-proportion z-test
o WetestthehypothesisH:p=p usingthestatisticz= ^
o Weusethehypothesizedproportiontofindthestandarddeviation,SD(^p)=
3. Mechanics
-
- Actual calculation
-
- Obtain a P-value – the probability that the observed statistic value occurs if the null model is
correct
-
- Statement about the null hypothesis – either reject or that we fail to reject
-
- The size of the effect is always a concern when we test hypotheses – a good way to look at
the effect size is to examine a confidence interval Alternative Alternatives
-
- Old cracking rate: 20%
-
- H0:p=0.20
-
- Someone might be interested in any change in the cracking rate -> HA: p ≠ 0.20
-
- An alternative hypothesis such as this is known as a two-sided alternative because we are
equally interested in deviations on either side of the null hypothesis value. For two-sided alternatives, the P-value is the probability of deviating in either direction from the null hypothesis value
-
- Old cracking rate: 20%
- But only interested in lowering the cracking rate below 20% -> HA: p < 0.20
-
- An alternative hypothesis that focuses on deviations from the null hypothesis value in only
one direction is called a one-sided alternative
-
- For a hypothesis test with a one-sided alternative, the P-value is the probability of deviating
only in the direction of the alternative away from the null hypothesis value
P-Values and Decisions: What to Tell About a Hypothesis Test
-
- How small should the P-value be in order for you to reject the null hypothesis? -> highly
context-dependent
-
- Examples page 487
-
- The conclusion about any null hypothesis should be accompanied by the P-value of the test
-
- To complete the analysis, follow your test with a confidence interval for the parameter of
interest, to report the size of the effect Terms
-
- Null hypothesis – the claim being assessed in a hypothesis test is called the null hypothesis.
Usually, the null hypothesis is a statement of “no change from the traditional value”, “no
effect“, “no different” or “no relationship” For a claim to be a testable null hypothesis, it
must specify a value for some population parameter that can form the basis for assuming a
sampling distribution for a test statistic
-
- Alternative hypothesis – the alternative hypothesis proposes what we should conclude if we
find the null hypothesis to be unlikely
-
- P-value – the probability of observing a value for a test statistic at least as far from the
hypothesized value as the statistic value actually observed if the null hypothesis is true. A
small P-value indicates either that the observation is improbable or that the probability
calculation was based on incorrect assumptions. The assumed truth of the null hypothesis is
the assumption under suspicion
- One-proportion z-test – a test of the null hypothesis that the proportion of a single sample
equals a specified value (H : p = p ) by referring the statistic z = ^ to a Standard Normal
0 0 ^
model
-
- Effect size – the difference between the null hypothesis value and the true value of a model
parameter
-
- Two-sided alternative – an alternative hypothesis is two-sided (HA: p ≠ p0) when we are
interested in deviations in either direction away from the hypothesized parameter value
-
- One-sided alternative – an alternative hypothesis is one-sided (e.g. HA: p > p0 or HA: p < p0)
when we are interested in deviations in only one direction away from the hypothesized parameter value
Chapter 21 More About Tests and Intervals
-
- Florida: no longer are riders 21 and older required to wear helmets
-
- Police reports of motorcycle accidents: Before the change in the helmet law, 60% of youths
involved in a motorcycle accident had been wearing their helmets; three years following the law change, considering these riders to be a representative sample of the larger population –
-
- Florida: no longer are riders 21 and older required to wear helmets
they observed 781 young riders who were involved in accidents – of these, 50.7% (396) were
wearing helmets
Zero In on the Null
Zero In on the Null
-
- One good way to identify both the null and alternative hypotheses is to think about the Why
of the situation
-
- The null hypotheses for the Florida study could be that the true rate of helmet use remained
the same among young riders after the law changed
-
- It makes more sense to use what you want to show as the alternative
-
- A P-value actually is a conditional probability. It tells us the probability of getting results at
least as unusual as the observed statistic, given that the null hypothesis is true
-
- The P-value is not the probability that the null hypothesis is true – it is a probability about the
data
-
- All we can say is that, given the null hypothesis, there is a 3% chance (P-value of 0.03) of
observing the statistic value that we have actually observed
What to do with a High P-value
-
- 0.793 ?
-
- Big P-values just mean that what we’ve observed isn’t surprising
-
- A big P-value doesn’t prove that the null hypothesis is true, but it certainly offers no evidence
that it’s not true
-
- When we see a large P-value, all we can say is that we ‘don’t reject the null hypothesis’
Alpha Levels
-
- Sometimes we have to decide whether or not to reject the null hypothesis
-
- We can define ‘rare event’ arbitrarily by setting a threshold for our P-value. If our P-value
falls below that point, we’ll reject the null hypothesis. We call such results statistically
significant. The threshold is called an alpha level
-
- Common alpha levels are 0.1, 0.05, 0.01 and 0.001
-
- E.g. assessing safety of air bags -> low alpha level
-
- E.g. if folks prefer their pizza with or without pepperoni -> alpha = 0.1
-
- We often choose 0.05
-
- Assess alpha level before you look at the data
-
- The alpha level is also called the significance level – when we reject the null hypothesis, we
say that the test is ‘significant at that level’
-
- E.g. we might say that we reject the null hypothesis ‘at the 5% level of significance’
-
- If the P-value does not fall below alpha -> the data have failed to provide sufficient evidence
to reject the null hypothesis.
-
- If the P-value is too high -> “we fail to reject the null hypothesis” (-> there is insufficient
evidence to conclude that the practitioners are performing better than they would if they were just guessing)
-
- Sometimes we have to decide whether or not to reject the null hypothesis
-
- 0.793 ?
Significant vs. Important
-
- Statistically significant -> P-value lower than our alpha level
-
- Don’t be lulled into thinking that statistical significance carries with it any sense of practical
importance or impact
Confidence Intervals and Hypothesis Tests
-
- For the motorcycle helmet example, a 95% confidence interval would give 0.507 ± 1.96 *
0.0179 = (0.472, 0.542) or 47.2& to 54.2% -> previous rate would be 50% -> in the interval ->
not able to reject the null hypothesis
-
- In general, a confidence interval with a confidence level of C% corresponds to a two-sided
hypothesis test with an alpha level of 100-C% (e.g. 95% confidence interval -> two sided
hypothesis test at alpha 5%
-
- For a one-sided test with alpha 5%, the corresponding confidence interval has a confidence
level of 90% - that’s 5% in each tail in general, a confidence interval with a confidence
level of C% corresponds to a one-sided hypothesis test with an alpha level of 1⁄2(100-C)%
A Confidence Interval for Small Samples
-
- For the motorcycle helmet example, a 95% confidence interval would give 0.507 ± 1.96 *
0.0179 = (0.472, 0.542) or 47.2& to 54.2% -> previous rate would be 50% -> in the interval ->
not able to reject the null hypothesis
-
- Add four phony observations – two to the successes, two to the failures
-
- Adjusted proportion: = and, for convenience, we write = n + 4
- Adjusted interval: ± z* 1 − /
Making
Errors
- Called the Agresti-Coull interval or the ‘plus-four’ interval
-
- The null hypothesis is true, but we mistakenly reject it (Type I error) – e.g. a healthy person is
diagnosed as with disease (the null hypothesis is usually the assumption that a person is
healthy)
-
- The null hypothesis is false, but we fail to reject it (Type II error) – e.g. an infected person is
diagnosed as disease free
-
- Which of these errors is more serious, depends on the situation, the cost, and your point of
view
-
- Page 512
-
- When you choose level alpha, you’re setting the probability of a Type I error to alpha
-
- We assign the letter ß to the probability of this mistake
-
- We could reduce 1 for all alternative parameter values by increasing alpha – but we’d make
more Type I errors -> tension between Type I and Type II errors
-
- The only way to reduce both types of error is to collect more evidence or, in statistical terms,
to collect more data Power
-
- The power of a test is the probability that it correctly rejects a false null
-
- When the power is high, we can be confident that we’ve looked hard enough
-
- We know that ß is the probability that a test fails to reject a false null hypothesis, so the
power of the test is the probability that it does reject: 1-ß
Effect Size
-
- We call the distance between the null hypothesis value (for example), p0, and the truth, p,
the effect size
-
- Not knowing the true value, we estimate the effect size as the difference between the null
and observed value
-
- Small effects -> more Type II errors -> lower power
-
- The power of a test depends on the size of the effect and the standard deviation
A Picture Worth Words . )
-
- The power of a test is the probability that it rejects a false null hypothesis. The upper figure
shows the null hypothesis model. We’d reject the null in a one-sided test if we observed a
value of ^p in the red region to the right of the critical value, p*.
The lower figure shows the true model. If the true value of p is greater than p0, then we’re more likely to observe a value that exceeds the critical value and make the correct decision to reject the null hypothesis. The power of the test is the purple region on the right of the lower figure. Of course, even drawing samples whose observed proportions are distributed around p, we’ll sometimes get a value in the red region on the left and make a Type II error of failing to reject the null.
-
- Power = 1-ß
-
- Reducing alpha to lower the chance of committing a Type I error will move the critical value,
p*, to the right (in this example). This will have the effect of increasing ß, the probability of a
Type II error, and correspondingly reducing the power.
-
- The larger the real difference between the hypothesized value, p0, and the true population
value, p, the smaller the chance of making a Type II error and the greater the power of the test. If the two proportions are very far apart, the two models will barely overlap, and we will not be likely to make any Type II errors at all – but then, we are unlikely to really need a formal hypothesis-testing procedure to see such an obvious difference.
Reducing Both Type I and Type II Errors
-
- The power of a test is the probability that it rejects a false null hypothesis. The upper figure
shows the null hypothesis model. We’d reject the null in a one-sided test if we observed a
value of ^p in the red region to the right of the critical value, p*.
-
- If we can make both curves narrower (fig. 21.4), then both the probability of Type I errors
and the probability of Type II errors will decrease, and the power of the test will increase
-
- The only way is to reduce the standard deviations by increasing the sample size (pictures of
sampling distribution models!) the standard deviation of the sampling distribution model
decreases only as the square root of the sample size, so to halve the standard deviations we
must quadruple the sample size
Terms
-
- Alpha level – the threshold P-value that determines when we reject a null hypothesis. If we
observe a statistic whose P-value based on the null hypothesis is less than alpha, we reject
that null hypothesis
-
- Statistically significant – when the P-value falls below the alpha level, we say that the test is
“statistically significant” at that alpha level
-
- Significance level – the alpha level is also called the significance level, most often in a phrase
such as a conclusion that a particular test is ‘significant at the 5% significance level’
-
- Type I error – the error of rejecting a null hypothesis when in fact it is true (also called a false
positive). The probability of a Type I error is alpha
-
- Type II error – the error of failing to reject a null hypothesis when in fact it is false (false
negative). The probability of a Type II error is commonly denoted beta and depends on the
effect size
-
- Power – the probability that a hypothesis test will correctly reject a false null hypothesis is
the power of the test. To find power, we must specify a particular alternative parameter
value as the true value. For any specific value in the alternative, the power is 1-ß
-
- Effect size – the different between the null hypothesis value and true value of a model
parameter is called the effect size
Chapter 22 Comparing Two Proportions
-
- Male drivers wear seat belts less often than women do
-
- Men’s belt-wearing jumped more than 16 percentage points when they had a female
passenger
-
- Female driver wore belts more than 70% of the time, regardless of the sex of their
passengers
-
- Of 4208 male drivers with female passengers, 2777 (66%) were belted
-
- Among 2763 male drivers with male passengers only, 1363 (49.3%) wore seat belts
-
- Shift in men’s risk-taking behavior when women are present?
-
- What would we estimate the true size of that gap to be?
Another Ruler
-
- Difference in the sample: 16.7%
-
- True difference?
-
- Difference between the two proportions and its standard deviation?
Pythagorean Theorem of Statistics (chapter 16):
-
- Difference in the sample: 16.7%
The variance of the sum or difference of two independent random variables is the sum of
their variances
Variance (X – Y) = Var(X)+Var(Y), so
SD (X – Y) = + = + - Only applies when X and Y are independent
The Standard Deviation of the Difference between Two Proportions
- The standard deviations of the sample proportions are SD (^p ) = and SD (^p ) =
, so the variance of the difference in the proportions is
Var(^p -^p ) = ( 2 + ( )2 = + 1 2
Variance (X – Y) = Var(X)+Var(Y), so
SD (X – Y) = + = + - Only applies when X and Y are independent
The Standard Deviation of the Difference between Two Proportions
- The standard deviations of the sample proportions are SD (^p ) = and SD (^p ) =
, so the variance of the difference in the proportions is
Var(^p -^p ) = ( 2 + ( )2 = + 1 2
1 2
- The standard deviation is the square root of that variance
SD(^p -^p )= +
1 2
SE(^p -^p )= ^ ^ + ^ ^
SE(^p -^p )= ^ ^ + ^ ^
12
- Example page 527! 2 !!!
Assumptions and Conditions
-
- Independence Assumption: within each group, the data should be based on results for
independent individuals
Randomization condition
10% condition
-
- Independent Groups Assumption: the two groups we’re comparing must also be
independent of each other
Sample Size Assumption
-
- Success/failure condition: both groups are big enough that at least 10 successes and at least
10 failures have been observed in each The Sampling Distribution
-
-
-
-
- -
Will I -
-
- -
Will I -
A two-proportion z-interval:
Confidence interval: (^p1-^p2) ± z* x SE(^p1-^p2) Where we find the standard error of the difference
SE(^p -^p )= ^ ^ + ^ ^ 12
The critical value z* depends on the particular confidence level, C, that we specify Example page 529!
Snore When I’m 64?
Of the 995 respondents, 37% of adults reported that they snored at least a few nights a week during the past year
Confidence interval: (^p1-^p2) ± z* x SE(^p1-^p2) Where we find the standard error of the difference
SE(^p -^p )= ^ ^ + ^ ^ 12
The critical value z* depends on the particular confidence level, C, that we specify Example page 529!
Snore When I’m 64?
Of the 995 respondents, 37% of adults reported that they snored at least a few nights a week during the past year
-
- Split into two age categories, 26% of the 184 people under 30 snored, compared with 39% of
the 811 in the older group
-
- Is this difference of 13% real or due only to natural fluctuations in the sample we’ve chosen?
-
- Null hypothesis? -> we hypothesize that there is no difference in the proportions
H0: p1-p2 = 0 Everyone into the Pool
-
- SE(^p -^p )= ^ ^ + ^ ^
12
-
- But to do a hypothesis test, we assume that the null hypothesis is true (proportions are
equal) -> so there should be just a single value of ^p in the SE formula (and, of course, ^q is
just 1-^p)
-
- Snoring example: overall we saw 48+318 = 366 snores out of a total of 184+811 = 995 adults
who responded to this question -> 0.3678
-
- SE(^p -^p )= ^ ^ + ^ ^
12
-
- Combining the counts like this to get an overall proportion is called pooling
-
- Pooled proportion (for success): ^p = where Success1 is the number of
pooled
successes in group 1 (Success1=n1*p1)
- We then put this pooled value into the formula, substituting it for both sample proportions in
the standard error formula:
SE (^p -^p ) = ^ ^ + ^ ^ pooled 1 2
= . ∗ . + . ∗ . =0.039
Improving the Success/Failure Condition
the standard error formula:
SE (^p -^p ) = ^ ^ + ^ ^ pooled 1 2
= . ∗ . + . ∗ . =0.039
Improving the Success/Failure Condition
-
- We should not refuse to test the effectiveness just because it failed the success/failure
condition
-
- For that reason, in a two-proportion z-test, the proper success/failure test uses the expected
frequencies, which we can find from the pooled proportion
-
- Only 1 case of HPV was diagnosed among 7897 women who received the vaccine, compared
to 91 cases diagnosed among 7899 who received a placebo
-
- ^p = =0.0058
pooled
n1^ppooled = 7899(0.0058) = 46 n2^ppooled = 7897(0.0058) = 46
Compared to What?
-
- We’ll reject our null hypothesis if we see a large enough difference in the two proportions
-
- Large? We just compare it to its standard deviation (standard error, pooled)
-
- Since the sampling distribution is Normal, we can divide the observed difference by its
standard error to get a z-score -> tells us how many SE the observed difference is away from
0
-
- Then we can use the 68-95-99.7 Rule
-
- Result: two proportion z-test
-
- z = ^ ^
^ ^
- When the conditions are met and the null hypothesis is true, this statistic follows the
standard Normal mode, so we can use that model to obtain a P-value
Chapter 23 Inferences About Means
Getting
Started
-
- Motor vehicle crashes resulted in 119 deaths each day
-
- Speeding is a contributing factor in 31% of all fatal accidents
-
- Triphammer Road – exceeding 30 miles per hour?
-
- Interested in both in estimating the true mean speed and in testing whether it exceeds the
posted speed limit
-
- Quantitative data usually report a value for each individual three rules of data analysis
and plot the data
-
- Quantitative data means and standard deviations; inferences sampling distributions
-
- Confidence intervals, then we add and subtract a margin of error; for proportions: ^p ± ME
-
- Margin of error –> ^p ± z*SE(^p)
-
- CLT: SD( = σ/√n (example page 552)
-
- If we don’t know σ estimate the population parameter σ with s, the sample standard
deviation based on the data; the resulting standard error is SE( = s/√n
-
- Gosset: we need not only to allow for the extra variation with larger margins of error and P-
values, but we even need a new sampling distribution model; in fact we need a whole family
of models, depending on the sample size, n; these models are unimodal, symmetric, bell-
shaped models, but the smaller our sample, the more we must stretch out the tails
Gosset’s t
-
- With s/√n, an estimate of the standard deviation, the shape of the sampling model changes
t-distribution Student’s t
-
- Gosset’s model is always bell-shaped, but the details change with different sample sizes
-
- So the Student’s t-models form a whole family of related distributions that depend on a
parameter known as degrees of freedom (df tdf) A Confidence Interval for Means
-
- With s/√n, an estimate of the standard deviation, the shape of the sampling model changes
t-distribution Student’s t
-
- t=
-
- df=n-1
-
- SE( )= s/√n
One-Sample t-Interval for the mean
-
- ±t*n-1*SE( )
-
- Critical value depends on the particular confidence level, C, and the number of degrees of
freedom, n-1
-
- Example page 554
-
- ±t*n-1*SE( )
-
- Figure 23.2: the t-model (solid curve) on 2 degrees of freedom has fatter tails than the
Normal model (dashed curve); so the 68-95-99.7 Rule doesn’t work for t-models with only a
few degrees of freedom
-
- Student’s t-models are unimodal, symmetric, and bell-shaped, just like the Normal
-
- But t-models with only a few degrees of freedom have much fatter tails than the Normal
-
- As the degrees of freedom increases, the t-models looks more and more like the Normal
-
- If you know σ, use z; whenever you use s to estimate σ, use t
Assumptions and Conditions (Student’s t-models)
-
- Independence Assumption – the data values should be independent
o Randomizationconditions–thedataarisefromarandomsampleorsuitably
randomized experiment
o 10% Condition - the sample is no more than 10% of the population
-
- Normal Population Assumption – Student’s t-models won’t work for data that are badly
skewed
o NearlyNormalCondition–thedatacomefromadistributionthatisunimodaland symmetric (make a histogram or probability plot); the normality depends on the sample size:
-
n<15 or so – the data should follow a Normal model pretty closely
-
n between 15 and 40 or so – data unimodal and reasonably symmetric
-
larger than 40 or 50 – the t-methods are safe to use unless the data are
extremely skewed
-
n<15 or so – the data should follow a Normal model pretty closely
-
- Table T: to find a critical value, locate the row of the table corresponding to the degrees of
freedom and the column corresponding to the probability you want
More Cautions About Interpreting Confidence Intervals
- “90% of intervals that could be found in this way would cover the true value” or “I am 90% confident that the true mean speed is between 29.5 and 32.5 mph”
Make A Picture ...
- Make a histogram of the data and verify that its distribution is unimodal and symmetric and that is has no outliers
- Make a Normal probability plot to see that it’s reasonably straight A test for the Mean
-
- Hypothesis test called the one-sample t-test for the mean (example: true mean speed in fact
greater than the 30 mph speed limit?)
-
- The assumptions and conditions for the one-sample t-test for the mean are the same as for
the one-sample t-interval
We test the hypothesis H : μ=μ using the statistic t = 0 0 n-1
-
- Example page 560
-
- Hypothesis test called the one-sample t-test for the mean (example: true mean speed in fact
greater than the 30 mph speed limit?)
-
- Independence Assumption – the data values should be independent
Finding t-Values by Hand
-
- Table T: the tables run down the page for as many degrees of freedom as can fit; as the
degrees of freedom increase, the t-model gets closer and closer to the Normal, so the tables
give a final row with the critical value from the Normal model and label it ∞ df
-
- If you cannot find a row for the df you need, just use the next smaller df in the table
Significance and Importance
-
- Statistically significant does not mean actually important or meaningful
-
- It is always a good idea when we test a hypothesis to also check the confidence interval and
think about the likely values for the mean Intervals and Tests
-
- The confidence interval contains all the null hypothesis values we can’t reject with these data
-
- More precisely, a level C confidence interval contains all of the plausible null hypothesis
values that would not be rejected by a two-sided hypothesis test at alpha level 1-C; so a 95%
confidence interval matches a 1-95=0.05 level two-sided test for these data
-
- Confidence intervals are naturally two-sided, so they match exactly with two-sided
Sample
- - -
- - -
hypothesis tests; when, the hypothesis is one-sided, the corresponding alpha level is (1-C)/2
Size
If we need great precision, however, we’ll want a smaller ME larger sample size
We can solve this equation for n (ME=T*n-1 s/√n)
Without knowing n, we don’t know the degrees of freedom and we can’t find the critical value, t*n-1 use the corresponding z* value
If we need great precision, however, we’ll want a smaller ME larger sample size
We can solve this equation for n (ME=T*n-1 s/√n)
Without knowing n, we don’t know the degrees of freedom and we can’t find the critical value, t*n-1 use the corresponding z* value
*The Sign Test – Back to Yes and No
-
- Yes (1) an no (0)
-
- Null hypothesis says that the median is 30; if that null hypothesis were true, we’d expect the
proportion of cars driving faster than 30 mph to be 0.50; on the other hand, if the median
speed were greater than 30 mph, we’d expect to see more cars driving faster than 30
-
- If we test a median by counting the number of values above and below that value, it’s called
a sign test – the sign test is a distribution free method (example page 567)
-
- Simpler, fewer assumptions
-
- But only works even when the data have outliers or a skewed distribution
Comparing Means
-
- Generic or brand-name batteries?
-
- Difference in mean lifetimes?
Plot the Data
-
- Boxplots of the data for two groups, placed side by side
-
- Figure 24.1 -> difference large enough? Random fluctuation? statistical inference
Comparing Two Means
-
- Generic or brand-name batteries?
-
- Difference between the mean battery lifetimes of the two brands, μ1-μ2
-
- Confidence interval, standard deviation, sampling model
-
- For independent random variables, the variance of their difference is the sum of their
individual variances, Var (Y-X)=Var(Y)+Var(X)
- SD( 1− 2)= +
- SE( 1− 2)= +
- The confidence interval we build is called a two-sample t-interval (for the difference in
means). The corresponding hypothesis test is called a two-sample t-test.
( 1− 2)±ME
WhereME=t*xSE( 1− 2)
Assumptions and Conditions
-
- Independence Assumption – the data in each group must be drawn independently and at
random from a homogeneous population, or generated by a randomized comparative
experiment
o RandomizationCondition
o 10%Condition
-
- Normal Population Assumption
o NearlyNormalCondition–wemustcheckthisforbothgroups;aviolationbyeither one violates the condition
-
n<15 – you should not use these methods if the histogram or Normal
probability plot shows severe skewness
-
n’s closer to 40 – mildly skewed histogram is OK
-
n>40 CLT
-
n<15 – you should not use these methods if the histogram or Normal
probability plot shows severe skewness
-
- Independent Groups Assumption – to use the two-sample t methods, the two groups we are
comparing must be independent of each other Two-Sample t-interval for the difference between means
- Standard error of the difference of the means SE( 1 − 2) = +
A Test for the Difference between Two Means
- For means, there is also a pooled t-test (but knowing that two means are equal doesn’t say anything about whether their variances are equal)
A Test for the Difference between Two Means
-
- Two-sample t-test for the difference between means
-
- Hypothesized difference Δ0 = 0
-
- We then compare the difference in the means with the standard error of that difference
-
- Example page 588/589
- For means, there is also a pooled t-test (but knowing that two means are equal doesn’t say anything about whether their variances are equal)
-
- If we were willing to assume that their variances are equal, we could pool the data from two
groups to estimate the common variance; we’d estimate this pooled variance from the data,
so we’d still use a Student’s t-model pooled t-test (for the difference between means)
-
- But difficult to assume therefore, use a two-sample t-test instead
The Pooled t-test
-
- Equal variance assumption – the variances of the two populations from which the samples
have been drawn are equal: σ12 = σ22
-
- Similar Spread Condition – looking at the boxplots to check that the spreads are not wildly
-
- Equal variance assumption – the variances of the two populations from which the samples
have been drawn are equal: σ12 = σ22
-
- s2
-
- SE
= 2 2
( - )= + =s 1 2
( - )= + =s 1 2
different
pooled
pooled
+
pooled
-
- df = n1+n2-2
-
- substitute the pooled-t estimate of the standard error and its degrees of freedom into the
steps of the confidence interval or hypothesis test, and you’ll be using the pooled-t method Turkey’s Quick Test
-
- 7,10and13
-
- Basis for the test: boxplots don’t overlap
-
- To use Turkey’s test, one group must have the highest value and the other, the lowest. We
just count how many values in the high group are higher than all the values of the lower group. Add to this the number of values in the low group that are lower than all the values of the higher group (count ties as 1⁄2)
-
- Now if this total is 7 or more, we can reject the null hypothesis of equal means at alpha =
0.05
-
- The “critical values” of 10 and 13 give us alpha’s of 0.01 and 0.001
-
- Only assumption: two samples are independent
- See lecture slides!
Chapter 25 Paired Samples and Blocks
-
- Speed-skating races are run in pairs
-
- Some fans thought there might have been an advantage to starting on the outside
-
- The data for the races run two at a time not independent
Paired Data
-
- Data such as these are called paired
-
- We can focus on the difference in times for each racing pair
-
- When pairs arise from an experiment, the pairing is a type of blocking
-
- When they arise from an observational study, it is a form of matching
-
- Data such as these are called paired
-
- There is no test to determine whether the data are paired – you must determine that from
understanding how they were collected and what they mean
-
- Pairwise differences -> because it is the difference we care about, we’ll treat them as if they
were the data, ignoring the original two columns. Now that we have only one column of
values to consider, we can use a simple one-sample t-test. Mechanically, a paired t-test is
just a one-sample t-test for the means of these pairwise differences (the sample size is the
number of pairs)
Assumptions and Conditions
-
- Paired Data Assumption – the data must be paired (two-sample t methods aren’t valid
without independent groups, and paired groups aren’t independent
-
- Independence Assumption – the difference must be independent of each other
o RandomizationCondition–focusourattentiononwheretherandomnessshouldbe o 10%condition–doesn’tapplytorandomizedexperiments,wherenosamplingtakes
place
-
- Normal Population Assumption – the population of differences follows a Normal model
o NearlyNormalCondition–canbecheckedwithahistogramorNormalprobability plot of the differences - but not of the individual groups
-
- Example paired t-test page 615
Confidence Intervals for Matched Pairs
-
- Married couples, husbands tend to be slightly older than wives
-
- Data paired, couples at random
-
- Interested in the mean age difference within couples
-
- Confidence interval for the true mean difference in ages?
-
- Example page 618
Blocking
-
- Married couples, husbands tend to be slightly older than wives
-
- Paired Data Assumption – the data must be paired (two-sample t methods aren’t valid
without independent groups, and paired groups aren’t independent
-
- A paired design is an example of blocking
-
- The fact of the pairing determines how many degrees of freedom are available
-
- Matching pairs generally removes so much extra variation that it more than compensates for
having only half the degrees of freedom
-
- We record a 0 for every paired difference that’s negative and a 1 for each positive difference,
ignoring pairs for which the difference is exactly 0
-
- We test the associated proportion p=0.5 using a z-test
-
- As with other distribution-free tests, the advantage of the sign test for matched pairs is that
we don’t require the Nearly Normal Condition for the paired difference
Chapter 26 Comparing Counts
-
- Zodiac signs of 256 heads of the largest 400 companies
-
- Successful people more likely to born under some signs than others?
-
- Zodiac signs of 256 heads of the largest 400 companies
Goodness-of-Fit
-
- Uniformly distributed? -> 1/12 of them under each sign? (256/12 -> 21.3 births per sign)
-
- How closely do the observed numbers of births per sign fit this simple “null” model?
-
- A hypothesis test to address this question is called a test of “goodness-of-fit” – it involves
testing a hypothesis
-
- Confidence interval doesn’t make sense
-
- We need a test that includes all 12 hypothesized proportions
Assumptions and Conditions
-
- Counted Data Condition – the data must be counts for the categories of categorical variable
-
- Independence Assumption – the counts in the cells should be independent of each other
o RandomizationCondition–theindividualswhohavebeencountedshouldbea random sample from the population of interest
-
- Sample Size Assumption
o ExpectedCellFrequencyCondition–weshouldexpecttoseeatleast5individualsin
each cell
Calculations
-
- Difference between these observed and expected counts, denoted (Obs-Exp)
-
- We divide each squared difference by the expected count for that cell
-
- The test statistic, called the chi-square statistic, is found by adding up the sum of the squares
of the deviations between the observed and expected counts divided by the expected counts
-
- Χ2 = ∑
-
- The number of degrees of freedom for a goodness-of-fit test is n-1
-
- n is not the sample size, but instead is the number of categories (12 signs -> 11 df)
Chi-Square P-Values
-
- Chi-square statistic is used only for testing hypotheses
-
- If the observed counts don’t match the expected, the statistic will be large
-
- This chi-square test is always one-sided
-
- If the calculated statistic value is large enough, we’ll reject the null hypothesis
-
- Read the X2 table (Table X) just find the row for the correct number of degrees of freedom
and read across to find where your calculated X2 value falls
-
- There is no direction to the rejection of the null model; all we know is that is doesn’t fit
-
- Example page 637
The Chi-Square Calculation
-
Find the expected values
-
Compute the residuals
-
Square the residuals
-
Compute the components
-
Find the sum of the components
-
Find the degrees of freedom
-
Test the hypothesis
-
Find the expected values
-
- Counted Data Condition – the data must be counts for the categories of categorical variable
Sign
|
Observed
|
Expected
|
Residual (Obs-
Exp)
|
(Obs-Exp)2
|
Component =
2
|
But I Believe the Model...
-
- The hypothesis-testing procedure allows us only to reject the null or fail to reject it
-
- If you choose uniform as the null hypothesis, you can only fail to reject it
Comparing Observed Distributions
-
- Example page 641 – whether the plans of students are the same at different colleges
-
- Two-way table – each cell of the table shows how many students from a particular college
made a certain choice
-
- We want to test whether the student’s choices are the same across all four colleges; the z-
test for two proportions generalizes to a chi-square test of homogeneity
-
- Here we are asking whether choices are the same among different groups, so we find the
expected counts for each category directly from the data
-
- Homogeneity means that things are the same -> we ask whether the post-graduation choices
made by students are the same for these four colleges
-
- The homogeneity test comes with a built-in null hypothesis: we hypothesize that the
distribution does not change from group to group Assumptions and Conditions
-
- Counted Data Condition – the data must be counts
-
- As long as we don’t want to generalize, we don’t have to check the Randomization Condition
or the 10% Condition
-
- Expected Cell Frequency Condition – expected count in each cell must be at least 5
Calculations
-
- The expected counts are those proportions applied to the number of students in each
graduating class fill in expected values for each cell check condition calculate
component for each cell summing all components across all cells
-
- Degrees of freedom = (R-1)(C-1)
-
- Example page 643
Examining the Residuals
-
- Whenever we reject the null hypothesis, it is a good idea to examine residuals
-
- We need to know each residual’s standard deviation
-
- To standardize a cell’s residual, we just divide by the square root of its expected value
c =
-
- Notice that these standardized residuals are just the square roots of the components we
calculated for each, and their sign indicates whether we observed more cases than we
expected, or fewer
-
- Counted Data Condition – the data must be counts
-
- Example page 641 – whether the plans of students are the same at different colleges
- Now that we have subtracted the mean (zero) and divided by their standard deviations,
these are z-scores (null hypothesis true? -> CLT and 68-95-99.7 Rule)
Independence
Independence
-
- Example: whether the risk of hepatitis C was related to whether people had tattoos and to
where they got their tattoos (two-way table)
-
- These data differ from the kinds of data we’ve considered before in this chapter because
they categorize subjects from a single group on two categorical variables rather than on only
one
-
- Contingency tables categorize counts on two (or more) variables so that we can see whether
the distribution of counts on one variable is contingent on the other
-
- Independence means that the probability that a randomly selected patient has hepatitis C
should not change when we learn the patient’s tattoo status if Hepatitis Status is
independent of tattoos, we’d expect the proportion of people testing positive for hepatitis to
be the same for the three levels of Tattoo Status a chi-square test for independence
Are the variables independent?
Assumptions and Conditions
-
- We still need counts and enough data so that the expected values are at least 5 in each cell
-
- In case of independence we want to generalize -> check if it is a representative random
sample from, and fewer than 10% of, that population
-
- Example page 648
Examine the Residuals
-
- We should examine the residuals because we have rejected the null hypothesis
-
- Standardize each residual sum of the squares = chi-square value
-
- Figure 26.6 (standardized residuals) large and positive value (tattoos obtained in a tattoo
parlor who have hepatitis C), indicating there are more people in that cell than the null hypothesis of independence would predict / a negative value says that there are fewer people in this cell than independence would expect
Chi-Square and Causation
-
- Just as correlation between quantitative variables does not demonstrate causation, a failure
of independence between two categorical variables does not show a cause-and-effect
relationship between them, nor should we say that one variable depends on the other
-
- Lurking variables can be responsible for the observed lack of independence
Chapter 27 Inferences for Regression
-
- %Body Fat plotted against Waist size fo r a sample of 250 males of various ages (fig. 27.1)
-
- Equation of the least squares line: % = -42.7 + 1.7 Waist
-> on average, %Body Fat is greater by 1.7 percent for each additional in around the waist The Population and the Sample
-
- %Body Fat plotted against Waist size fo r a sample of 250 males of various ages (fig. 27.1)
-
- Regression -> straight line; but: not all men who have 38-inch waists have the same %Body
Fat (the distribution of 38-inch men is unimodal and symmetric -> fig. 27.2/27.3)
-
- We want a model -> therefore, an idealized regression line – the model assumes that the
means of the distribution of %Body Fat for each Waist size fall along the line, even though
the individuals are scattered around it
-
- μy = β0 + β 1x (model = intercept + slope)
-
- Model makes errors (ε) – some individuals lie above and some below the line
-
- y=β0+β1x+ε
-
- We estimate the β’s by finding a regression line, = b0 + b 1x; the residuals, e = y- , are the
sample-based versions of the errors, ε Assumptions and Conditions
- Linearity Assumption
o StraightEnoughCondition–scatterplotlooksstraight(bylookingatascatterplotof
o StraightEnoughCondition–scatterplotlooksstraight(bylookingatascatterplotof
the residuals against x or against the predicted values,
o QuantitativeDataCondition
-
- Independence Assumption – the errors in the true underlying regression model must be
mutually independent
o RandomizationCondition
-
- Equal Variance Assumption – the variability of y should be about the same for all values of x
o DoesthePlotThicken?Condition–checkthespreadaroundthelineisnearly constant
-
- Normal Population Assumption – the errors around the idealized regression line at each
values of x follow a Normal model
o OutlierCondition
Which Comes First: The Conditions or the Residuals?
-
Make a scatterplot of the data to check the Straight Enough Condition.
-
If the data are straight enough, fit a regression and find the residuals, e, and predicted
values, .
3. Make a scatterplot of the residuals against x or against the predicted values. This plot
3. Make a scatterplot of the residuals against x or against the predicted values. This plot
should have no pattern. Check in particular for any bend (which would suggest that the
data weren’t all that straight after all), for any thickening (or thinning), and, of course, for
any outliers.
-
If the data are measured over time, plot the residuals against time to check for evidence
of patterns that might suggest they are not independent.
-
If the scatterplots look OK, then make a histogram and Normal probability plot of the
residuals to check the Nearly Normal Condition.
-
If all the conditions seem to be reasonably satisfied, go ahead with inference.
-
- The sample-to-sample variation is what generates the sampling distribution for the
coefficients
-
- 3 aspects of the scatterplot affect the standard error of the regression slope:
consistent from sample to sample. The spread around the line is measured with the residual standard deviation, se. You can always find se in the regression output, often just labeled s.
s = ∑ ^ )
e
The less scatter around the line, the smaller the residual standard deviation and the stronger the relationship between x and y
o Spreadofthex’s–ifsx,thestandarddeviationofxislarge,itprovidesamorestable regression
o Samplesize–havingalargersamplesize,n,givesmoreconsistentestimatesfrom sample to sample
Standard Error for the Slope
- SE(b)=
1 √
- When we standardize the slopes by subtracting the model mean and dividing by their
- When we standardize the slopes by subtracting the model mean and dividing by their
standard error, we get a Student’s t-model, this time with n-2 degrees of freedom
- ß ~
) n-2
What About the Intercept?
- ß ~ ) n-2
Regression Inference
) n-2
What About the Intercept?
- ß ~ ) n-2
Regression Inference
-
- We can test a hypothesis about it and make confidence intervals
-
- Usual null hypothesis about the slope is that it’s equal to 0 (would say that y doesn’t tend to
change linearly when x changes = no linear association)
- TotestH:ß=0,wefindt =
0 1 n-2 )
- A 95% confidence interval for ß is: b1±t*n-2 x SE(b1)
- A 95% confidence interval for ß is: b1±t*n-2 x SE(b1)
Another Example
-
- Contest in which participants try to guess the exact minute that a wooden tripod placed on
the frozen Tanana River will fall through the breaking ice
-
- We cannot use regression to tell the causes of any change – but we can estimate the rate of
change (if any) and use it to make better predictions
-
- Example page 686-689
Standard Errors for Predicted Values
- We can predict the mean %Body Fat for all men whose Waist size is 28 inches with a lot more
precision than we can predict the %Body Fat of a particular individual whose Waist size
happens to be 38 inches
- We are predicting the value for a new individual, one that was not part of the original data
set -> “x sub new” (xv)
-
- Regression equation predicts %Body Fat as v=b0+b1xv
-
- Now that we have the predicted value, we construct both intervals around this same
number; both intervals take the form: v± t*n-2 x SE (t* is the same for both)
- Easier to predict a data point near the middle of the data set than far from the center
- Easier to predict a data point near the middle of the data set than far from the center
- SE( )= 2 1) − )2+ 2 + 2
v
Confidence Intervals for Predicted Values
Confidence Intervals for Predicted Values
-
- Example all men and individual page 690
-
- The narrower interval is a confidence interval for the predicted mean value at xv, and the
Logistic
wider interval is a prediction interval for an individual with that x-value
Regression
Regression
-
- Researchers investigating factors for increased risk f diabetes examined data on 768 adult
women of Pima Indian heritage (BMI (weight/height))
-
- From the boxplots, we see that the group with diabetes has a higher mean BMI
-
- BMI as the response and Diabetes as the predictor displayed – but researches interested in
predicting the increased risk of Diabetes due to increased BMI
-
- Fig. 27.13 dichotomous variable
-
- Fig. 27.14 treating like quantitative data -> regression line
-
- Setting all negative probabilities to 0 and all probabilities greater than 1 to 1
-
- Fig. 27.16 smooth curve models
-
- There are many curved in mathematics with shapes like this that we might use for our model.
One of the most common is the logistic curve -> logistic regression
-
- ln ( ̂ /1 − ̂ )= b0+b1x
-
- When p is a probability, p/1-p is the odds in favor of a success
When the probability of success, p, = 1/3, we’d get the ratio / =1/2 /
Chapter 30 Multiple Regression
- Height Just do it
-
- A regression with two or more predictor variables is called a multiple regression
-
- For simple regression, we found the Least Squares solution, the one whose coefficients made
the sum of the squared residuals as small as possible. For multiple regression, we’ll do the
same thing but this time with more coefficients
-
- R2 gives the fraction of the variability of %Body Fat accounted for by the multiple regression
-
- A regression with two or more predictor variables is called a multiple regression
model
-
- Degr ees of freedom is the number of observations minus 1 for each coefficient estimated
-
- % = -3.10 + 1.77 Waist – 0.60 Height
-
- = % − %
So, What’s New?
-
- The meaning of the coefficients in the regression model has changed in a subtle but
important way
-
- Multiple regression is an extraordinarily versatile calculation, underlying many widely used
Statistics methods
-
- Offers first glimpse into statistical models that use more than two quantitative variables
What Multiple Regression Coefficients Mean
-
- The meaning of the coefficients in the regression model has changed in a subtle but
important way
-
- Fig. 30.1 scatterplot of %Body Fat against Height little relationship between these
variables
-
- The multiple regression coefficient of Height takes account of the other predictor, Waist size,
in the regression model
-
- Only looking at all men whose waist size is about 37 inches -> negative relationship between
Height and %Body Fat because taller men probably have less body fat than shorter men who
have the same waist size
-
- For men with that waist size, an extra inch of height is associated with a decrease of about
0.60% in body fat
-
- Looking on all waist sizes at the same time? -> plotting the residuals of %Body Fat after a
regression on Waist size against the residuals of Height after regressing it on Waist size
(“partial regression plot”) -> showing the relationship of %Body Fat to Height after removing
the linear effects of Waist size
-
- A partial regression plot for a particular predictor has a slope that is the same as the multiple
regression coefficient for that predictor. It also has the same residuals as the full multiple
regression, so you can spot any outliers or influential points and tell whether they’ve
affected the estimation of this particular coefficient
The Multiple Regression Model
Assumptions and Conditions
-
- Linearity Assumption
o StraightEnoughConditionforeachofthepredictors
-
- Independence Assumption
o RandomizationCondition
-
- Equal Variance Assumption – the variability of the errors should be the same for all values of
each predictor
o Does the Plot Thicken? Condition – scatterplots of the regression residuals against each x or against the predicted values, , offer a visual check
-
- Normality Assumption – errors around the idealized regression model at any specified values
of the x-variables follow a Normal model
o NearlyNormalCondition
- Summary of checking conditions
o ChecktheStraightEnoughConditionwithscatterplotsofthey-variableagainsteach
x-variable
o Ifthescatterplotsarestraightenough,fitamultipleregressionmodeltothedata
o Findtheresidualsandpredictedvalues
o Makeascatterplotoftheresidualsagainstthepredictedvalues.Thisplotshouldlook
patternless. Check in particular for any bend and for any thickening
o Suitablerandomizationused?Representativeofsomeidentifiablepopulation?
Checking if they are not independent by plotting the residuals against time to look
for patterns
o Interpretationandprediction
o Ifyouwishtotesthypothesesaboutthecoefficientsorabouttheoverallregression,
then make a histogram and Normal probability plot of the residuals to check the Nearly Normal Condition
Multiple Regression Inference I: I Thought I Saw an ANOVA Table...
o ChecktheStraightEnoughConditionwithscatterplotsofthey-variableagainsteach
x-variable
o Ifthescatterplotsarestraightenough,fitamultipleregressionmodeltothedata
o Findtheresidualsandpredictedvalues
o Makeascatterplotoftheresidualsagainstthepredictedvalues.Thisplotshouldlook
patternless. Check in particular for any bend and for any thickening
o Suitablerandomizationused?Representativeofsomeidentifiablepopulation?
Checking if they are not independent by plotting the residuals against time to look
for patterns
o Interpretationandprediction
o Ifyouwishtotesthypothesesaboutthecoefficientsorabouttheoverallregression,
then make a histogram and Normal probability plot of the residuals to check the Nearly Normal Condition
Multiple Regression Inference I: I Thought I Saw an ANOVA Table...
-
- Is this multiple regression model any good at all?
-
- If all the coefficients (except the intercept) were zero, we’d have
=b0+0x1 +...+0xk
And we’d just set b0 =
H0:β1 =β2 =...=βk =0
- We can test this hypothesis with a statistic that is labeled with the letter F – bigger F-values
mean smaller P-values
Multiple Regression Inference II: Testing the Coefficients
Multiple Regression Inference II: Testing the Coefficients
-
- Only if we reject the null hypothesis, we can move on to check the test statistics for the
individual coefficients
-
- For each coefficient, we test H0: β1=0 against the (two-sided) alternative that it isn’t zero; the
regression table gives a standard error for each coefficient and the ratio of the estimated
coefficient to its standard error
-
- If the assumptions and conditions are met, these ratios follow a Student’s t-distribution
tn-k-1 = bj-0 / SE (bj)
-
- The degrees of freedom is the number of data values minus the number of predictors
-
- CI in the usual way (estimate ± margin of error); margin of error is just the product of the
standard error and a critical value CI for βj: bj ± t*n-k-1 SE (bj)
How’s That, Again?
-
- y=β0+β1x1+...+βkxk +ε
-
- Wrong conclusion that each βj tells us the effect of its associated predictor, xj, on the
response variable, y
Another Example: Modeling Infant Mortality
-
- All variables no outliers and Nearly Normal distributions
-
- One useful way o check many of our conditions is with a scatterplot matrix (fig. 30.6) -> array
of scatterplots set up so that the plots in each row have the same variable on their y-axis and
those in each column have the same variable on their x-axis
-
- On the diagonal, rather than plotting a variable against itself, you’ll usually find either a
Normal probability plot or a histogram of the variable to help us assess the Nearly Normal
Condition
-
- Example page 797
Comparing Multiple Regression Models
-
- How do we know that some other choice of predictors might not provide a better model?
-
- Many people look at the R2 value, and certainly we are not likely to be happy with a model
that accounts for only a small fraction of the variability of y
-
- Keep in mind that the meaning of a regression coefficient depends on all the other predictors
in the model, so it is best to keep the number of predictors as small as possible
-
- Predictors that are easy to understand are usually better choices than obscure variables
Adjusted R2
-
- The adjusted R2 statistic is a rough attempt to adjust for the simple fact that when we add
another predictor to a multiple regression, the R2 can’t go down and will most likely go up
-
- We can write a formula for R2 using the sums of squares in the ANOVA table portion of the
regression output table:
R2 = = 1-
-
- Adjusted R2 simply substitutes the corresponding Mean Squares for the SS’s:
R2 = 1-
adj
-
- Because the Mean Squares are Sums of Squares divided by degrees of freedom, they are
adjusted for the number of predictors in the model
-
- As a result, the adjusted R2 value won’t necessarily increase when a new predictor is added
-
- It no longer tells the fraction of variability accounted for by the model
-
- The adjusted R2 statistic is a rough attempt to adjust for the simple fact that when we add
another predictor to a multiple regression, the R2 can’t go down and will most likely go up
-
- How do we know that some other choice of predictors might not provide a better model?