Powered By Blogger

Tuesday, January 24, 2017

STABB 22 Statistics Notes University of Toronto


Course: Data Analysis, book: Stats: Data and Models (Richard D. De Veaux) third edition
Chapter 2 Data
But What Are Data?
  • -  Data does not have to be numerical
  • -  Sometimes values look like numerical values but are just numerals servings as labels (e.g.
    Amazon Standard Item Number)
  • -  Data values are useless without their context
  • -  The W’s: WHO WHAT (essential) (AND IN WHAT UNITS) WHEN WHERE WHY HOW -> context
    for data values Who
  • -  The rows of a data table correspond to individual cases about Whom (or which) we record some characteristics
  • -  Respondents – individuals who answer a survey
  • -  Subjects/participants – people on whom we experiment
  • -  Experimental units - like subjects but animals, plants, websites and other inanimate subjects
  • -  Records – rows in a database
  • -  Cases, e.g. Amazon table. Individual CD orders
  • -  Cases are often a sample of cases selected from some larger population that we’d like to
    understand
  • -  Sample should be representative of the population (snapshot image)
    What and Why
    • -  Variables – characteristics recorded about each individual (usually columns)
    • -  Variables play different roles, and you can’t tell a variable’s role just by looking at it
    • -  Start by counting how much cases belong in each category
    • -  Some variables have measurements units; units tell how each value has been measured
      (miles per hour, or degrees Celsius tell us the scale of measurement)
    • -  Categorical variable – when a variable names categories and answers questions about how
      cases fall into those categories (usually, we think about the counts of cases that fall into each
      category; except the identifier variable)
    • -  Quantitative variable – when a measured variable with units answers questions about the
      quantity of what is measured (they must have units)
    • -  Some variables can answer both kinds of questions
    • -  E.g. educational value (1=worthless, ...) -> a teacher might just count the number of students
      who gave each response for the course (categorical variable) or the teacher wants to see whether the course is improving she might treat the responses as the amount of perceived value (quantitative variable); then the teacher has to imagine that it has ‘educational value units’
    • -  ‘ordinal’ variables – variables that report order without natural units Counts Count

- Using counts in two ways: when we count the cases in each category of a categorical variable, the category labels are the What and the individuals are the Who of our data; or when we want to measure the amount of something (by counting)
Identifying Identifiers
- E.g. student ID number -> numerical, but no quantitative variable -> special categorical variable (as many categories as individuals) -> not interesting, just for identification -> Identifier variables -> not useful, but they make it possible to combine data from different sources, to protect confidentiality and to provide unique labels (e.g. ASIN)
Where, When, and How
  • -  Who (whom each row of your data table refers to -> cases) , What (what the variables or the columns of the table record -> variables) and Why (why you are examining the data/what you want to know) essential
  • -  Where and When also important/helpful
  • -  How the data are collected can make the difference between insight and nonsense (e.g.
    voluntary survey on the Internet often worthless)
  • -  Important is the design of sound methods for collecting data
    Terms
  • -  Context – the context ideally tells Who was measured, What was measured, How the data were collected, Where the data were collected, and When and Why the study was performed
  • -  Data – systematically recorded information, whether numbers or labels, together with is context
  • -  Data table – an arrangement of data in which each row represents a case and each column represents a variable
  • -  Case – a case is an individual about whom or which we have data
  • -  Sample – the cases we actually examine in seeking to understand the much longer
    population
  • -  Population – all the cases we wish we knew about
  • -  Variable – a variable holds information about the same characteristic for many cases
  • -  Units – a quantity or amount adopted as a standard of measurement, such as dollars, hours,
    or grams
  • -  Categorical variable – a variable that names categories (whether with words or numerals) is
    called categorical
  • -  Quantitative variable – a variable in which the numbers act as numerical values is called
    quantitative; quantitative variables always have units
  • -  Identifier variable – a variable holding a unique name, ID number, or other identification for
    a case. Identifiers are particularly useful in matching data from two different databases or relations


  • Chapter 3 Displaying and Describing Categorical Data
    The Three Rules of Data Analysis 95
    • -  Make a picture – think clearly about patterns and relationships hiding in the data
    • -  Make a picture – display shows important features: the extraordinary (possibility wrong) data
      values or unexpected patterns
    • -  Make a picture – well-chosen picture to tell others about your data
      Frequency Tables: Making Piles
      • -  Putting 2201 people on the Titanic into piles -> by ticket Class, counting up how many had each kind of ticket -> frequency table, which records the totals and the category names
      • -  Ticket Class: ‘First’, ‘Second’, ‘Third’ and ‘Crew’
      •    
    • -  Counts are useful, but sometimes we want to know the fraction or proportion of the data in each category, so we divide the counts by the total number of cases; usually we multiply by 100 express the proportions as percentages; a relative frequency table displays the percentages, rather than the counts, of the values in each category; both types of tables show how the cases are distributed across the categories; in this way they describe the distribution of a categorical variable because they name the possible categories and tell how frequently each occurs
      The Area Principle
    • -  Figure 3.2 -> bad picture can distort out understanding
    • -  More impressed by the area than by other aspects of each ship image
    • -  Wrong impressions -> crew only about 40%
    • -  The best data displays observe a fundamental principle of graphing data called the area
      principle – the area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents
    Bar Charts
    • -  A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison
    • -  Relative frequency bar chart -> shows the relative proportion of passengers falling into each of these classes
      Pie Charts
    • -  Pie charts show the whole group of cases as a circle; they slice the circle into pieces whose sizes are proportional to the fraction of the whole in each category
    • -  Before you make a bar chart or a pie chart, always check the Categorical Data Condition: The data are counts or percentages of individuals in categories
    Class
    Count
    %
    First
    325
    14.77
    Second
    285
    12.95
    Third
    706
    32.08
    Crew
    885
    40.21
    - If you want to make a relative frequency bar chart or a pie chart, you’ll need to also make sure that the categories don’t overlap so that no individual is counted twice
    Contingency Tables: Children and First-Class Ticket Holders First?
    • -  Was there a relationship between the kind of ticket a passenger held and the passenger’s chances of making it into the lifeboat? -> two-way table
    • -  Table 3.4 – because the table shows how the individuals are distributed along each variable, contingent on the value of the other variable, such a table is called a contingency table
    • -  When presented like this, in the margins of a contingency table, the frequency distribution of one of the variables (survival/class) is called its marginal distribution (example page 24)
    • -  Each cell of the table gives the count for a combination of values of the two variables
    • -  Possibility of percentage of row, of column or of table (e.g. table 3.6)
    • -  Be careful – always ask “percentage of what?”
      Conditional Distributions
      • -  Interesting questions are contingent, e.g. whether the chance of surviving the Titanic sinking depended on ticket class
      • -  First, ask how the distribution of ticket Class changes between survivors and non-survivors -> row percentages
      • -  We restrict the Who first to survivors and make a pie chart for them; then we refocus the Who on the non-survivors and make their pie chart -> pie charts show the distribution of ticket classes for each row (survivors and non-survivors) – the distribution we create this way are called conditional distributions because they show the distribution of one variable for just those cases that satisfy a condition on another variable (figure 3.6)
      • -  Or we could look at the distribution of Survival for each category of ticket Class (table 3.8)
      • -  Fig. 3.7: Side-by-side bar chart – showing the conditional distribution of Survival for each
        ticket class
        Can be simplified by dropping one category (only two variables, dead or alive; knowing the percentage that survived tells us the percentage that died)

      • -  In a contingency table, when the distribution of one variable is the same for all categories of another, we say that the variables are independent
        Segmented Bar Charts
    - Treats each bar as the ‘whole’ and divides it proportionally into segments corresponding to the percentage in each group (fig. 3.9 -> distribution of ticket Class are different not independent)
    Simpson’s Paradox
    • -  Sometimes averages can be misleading or don’t make sense at all
    • -  When using averages of proportions across several different groups, it is important to make
      sure that the groups really are comparable
    • -  Table 3.10 – Moe is better overall, but Jill is better both during the day and at night ->
      Simpson’s Paradox
    • -  The problem is unfair averaging over different groups -> Jill has more difficult night flights, so her overall average is heavily influenced by her nighttime average / Moe benefits from more and easier day flights no fair comparison
    • -  The moral of Simpson’s paradox is to be careful when you average across different levels of a second variable; it’s always better to compare percentages or other averages within each level of the other variable; overall average is misleading
      Terms
    • -  Frequency table (relative frequency table) – lists the categories in a categorical variables and gives the count (or percentage) of observations for each category
    • -  Distribution – the distribution of a variable gives the possible values of the variable and the relative frequency of each value
    • -  Area principle – in a statistical display, each data value should be represented by the same amount of area
    • -  Bar chart (relative frequency bar chart) – show a bar whose area represents the count (or percentage) of observations each category of a categorical variable
    • -  Pie chart – show how a ‘whole’ divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category
    • -  Categorical data condition – the methods in this chapter are appropriate for displaying and describing categorical data; be careful not to use them with quantitative data
    • -  Contingency table – a contingency table displays counts, and sometimes, percentages of individuals falling into named categories on two or more variables; the table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the category of the other
    • -  Marginal distribution – in a contingency table, the distribution of either variable alone is called the marginal distribution; the counts or percentages are the totals found in the margins (last row or column) of the table
    • -  Conditional distribution – the distribution of a variable restricting the Who to consider only a smaller group of individuals is called a conditional distribution
    • -  Independence – variables are said to be independent if the conditional distribution of one variable is the same for each category of the other (but we can’t conclude that one variable has not effect whatsoever on another; all we know is that little effect was observed in our study)
    • -  Segmented bar chart – a segmented bar chart displays the conditional distribution of a categorical variable within each category of another variable
    • -  Simpson’s paradox – when averages are taken across different groups, they can appear to contradict the overall averages
    Chapter 4 Displaying and Summarizing Quantitative Data
    Histograms
    • -  Usually we slice up all the possible values into equal-width bins; we then count the number of cases that fall into each bin; the bins, together with these counts, give the distribution of the quantitative variable and provide the building blocks for the histogram; by representing the counts as bars and plotting them against the bin values, the histogram displays the distribution at a glance
    • -  Fig. 4.1 – e.g. 230 earthquakes with magnitudes between 7.0 and 7.2 (each bin has a width of 0.2)
    • -  The standard rule for a value that falls exactly on a bin boundary is to put it into the next higher bin
    • -  Most earthquakes are between 5.5 and 8.5 -> earthquake of 9 is extraordinary
    • -  The binds slice up all the values of the quantitative variable, so any spaces in a histogram are
      actual gaps in the data, indicating a region where there are no values
    • -  Relative frequency histogram – replacing the counts on the vertical axis with the percentage
      of the total number of cases falling in each bin Stem-and-Leaf Displays
      • -  Don’t show the data values themselves
      • -  Like a histogram, but it shows the individual values
      • -  To display the scores 83, 76 and 88 together, we could write
        8|38
        7|6
      • -  Because the leaves show the individual values, we can sometimes see even more in the data
        than the distribution’s shape
      • -  If you have scores of 432, 540, 571 and 638 -> truncate (or round) the number to two places,
        using the first digit as them stem and the second as the leaf (indicating that 6 | 3 means 630- 639)
        6|3
        5|47

        4|3 Dotplots
      • -  Simple display that places a dot along an axis for each case in the data
      • -  Show basic facts about the distribution
      • -  Possible clusters (two different race distances)
      • -  Fig. 4.4
      • -  Some dotplots stretch out horizontally, like a histogram, or run vertically, like a stem-and-leaf display
        Think Before You Draw
      • -  Think carefully to decide which type of graph to make
      • -  Check Categorical Data Condition before making a pie chart or a bar chart
    • -  Before making a stem-and-leaf display, a histogram, or a dotplot, you need to check the Quantitative Data Condition: The data are values of a quantitative variable whose units are known
    • -  You can’t display categorical data in a histogram or quantitative data in a bar chart
    • -  When you describe a distribution, you should always tell about three things: its shape, center
      and spread
      The Shape of a Distribution

    • -  Does the histogram have a single, central hump or several separated humps?
      These humps are called modes; a histogram with one peak, such as the earthquake magnitudes, is dubbed unimodal; histograms with two peaks are bimodal and those with three or more are called multimodal
      A histogram that doesn’t appear to have any mode and in which all the bars are approximately the same height is called uniform

    • -  Is the histogram symmetric? Can you fold it along a vertical line through the middle and have the edges match pretty closely, or are more of the value on one side?
      A symmetric histogram can fold in the middle so that the two sides almost match
      The thinner ends of a distribution are called the tails – if one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail

    • -  Do any unusual features stick out?
      You should always mention any stranglers, or outliers, that stand away from the body of the distribution (either very important or an error)

    • -  Gaps help us see multiple modes and encourage us to notice when the data may come from different sources or contain more than one group
      The Center of the Distribution: The Median
    • -  When we think of a typical value, we usually look for the center of the distribution (easy with unimodal, symmetric distribution)
    • -  When the distribution is skewed or possibly multimodal, it’s not immediately clear
    • -  One natural choice of typical value is the value hat is literally in the middle, with half the
      values below it and half above it
    • -  The middle value that divided the histogram into two equal areas is called the median
    • -  For the tsunamis (page 52), there are 176 earthquakes, so the median is found at the
      (176+1)/2 = 88.5th place in the sorted data
    How do medians work?
    • -  If n is odd, the median is the middle value
      Counting in from the ends, we find this value in the (n+1)/2 position

    • -  When n is even, there are two middle values – so, in this case, the median is the average of the two values in positions n/2 and n/2 + 1
      Spread: Home on the Range
      • -  The more the data vary, however, the less the median alone can tell us
      • -  We need to measure, how spread out are the data values

    • -  When we describe a distribution numerically, we always report a measure of its spread along with its center
    • -  The range of the data is defined as the difference between the maximum and minimum values Range = max – min
    • -  The maximum magnitude of these earthquakes is 9.0 and the minimum is 3.7 -> the range is 5.3
    • -  Disadvantage of the range: a single extreme value can make it very large, giving a value that doesn’t really represent the data overall
      Spread: The Interquartile Range
    • -  Ignoring the extremes and concentrate on the middle of the data
    • -  Divide the data in half at the median, now divide both halves in half again, cutting the data
      into four quarters -> quartiles
    • -  One quarter of the data lies below the lower quartile, and one quarter of the data lies above
      the upper quartile, so half the data lies between them; the quartiles border the middle half
      of the data
    • -  When n is odd, some statisticians include the median in both halves; others omit it
    • -  The difference between the quartiles tells us how much territory the middle half of the data
      covers and is called the interquartile range; it’s commonly abbreviated IQR
      IQR = upper quartile – lower quartile
      E.g. IQR of the earthquakes: 1.0 -> the middle half of the earthquake magnitudes extends across a (interquartile) range of 1.0 Richter scale units

    • -  The IQR is almost always a reasonable summary of the spread of a distribution
    • -  One exception is when the data are strongly bimodal
    • -  For any percentage there is a corresponding percentile that cuts off that percentage of the
      data below it. The 10th and 90th percentiles, for example, identify the values below which
      10% and 90% (respectively) of the data lie. The median, of course, is the 50th percentile.
    • -  The lower and upper quartiles are also known as the 25th and 75th percentiles of the data,
      respectively, since the lower quartile falls above 25% of the data and the upper quartile falls above 75% of the data
      5-Number Summary
      • -  The 5-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum)
      • -  E.g. earthquake Magnitudes:







  • -  Also report the number of data values and the identity of the cases (the Who) Summarizing Symmetric Distributions: The Mean

  • -   = =
  • -  The formula says to add up all the values of the variable and divide that sum by the number of data values
  • -  A median is also a kind of average
  • -  The value we calculated is called the mean, y-bar
  • -  The mean feels like the center because it is the point where the histogram balances
    Mean or Median?
    • -  Mean just make sense with symmetric data (not if the distribution is skewed or has outliers)
    • -  For asymmetric distribution, the median is a better summary of the center
    • -  Because the median considers only the order of the values, it is resistant to values that are
      extraordinarily large or small; it simply notes that they are one of the ‘big ones’ or the ‘small
      ones’ and ignores their distance from the center
Ifthehistogramissymmetricandherearenooutliers,we’llpreferthemean.However,ifthe
histogram is skewed or has outliers, we’re usually better off with the median. What About Spread? The Standard Deviation
  • -  IQR ignores how individual values vary
  • -  Standard deviation takes into account how far each value is from the mean
  • -  Like the mean, the standard deviation is appropriate only for symmetric data
  • -  Examine how far each data value is from the mean -> called deviation
  • -  Square each deviation
  • -  When we add up these squared deviations and find their average (almost), we call the result
    the variance
  • -  s2 =

  • -  To get back to the original units, we take the square root of s2 - the result, s, is the standard
    deviation
  • -  s =

- Example:
Original Values
Deviations
Squared Deviations
14
14-17=-3
(-3)2= 9
13
13-17=-4
(-4)2=16
20
20-17=3
32=9
22
22-17=5
52=25
18
18-17=1
12=1
19
19-17=2
22=4
13
13-17=-4
(-4)2=16
Add up the squared deviations: 9+16++25+1+4+16=80 Now divide by n-1: 80/6=13.33
Finally, take the square root: s=√13.33=3.65 





Think About Variation
  • -  If many data values are scattered far from the center, the IQR and the standard deviation will be large
  • -  Measures of spread tell how well other summaries describe the data What to Tell About a Quantitative Variable
    • -  Start by making a histogram or stem-and-leaf display, and discuss the shape of the distribution
    • -  Next, discuss the center and spread
      o AlwayspairthemedianwiththeIQRandthemeanwiththestandarddeviation
      o Skewedshape->medianandIQR
      o Symmetricshape->meanandstandarddeviation(maybemedianandIQRaswell)
    • -  Discuss unusual features
      o Reasonformultiplemodes(e.g.gender)->splitdataintoseparategroups
      o Pointingoutoutliers(meanandstandarddeviationoncewithoutliers,oncewithout)
    • -  Example page 62: if there is just a small outlier, and the median and the mean are close, the outlier does not seem to be a problem -> using mean and standard deviation
      Terms
    • -  Distribution – the distribution of a quantitative variable slices up all the possible values of the variable into equal-width bins and gives the number of values (or counts) falling into each bin
    • -  Histogram (relative frequency histogram) – a histogram uses adjacent bars to show the distribution of a quantitative variable. Each bar represents the frequency (or relative frequency) of values falling in each bin
    • -  Gap – a region of the distribution where there are no values
    • -  Stem-and-leaf display – shows quantitative data values in a way that sketches the
      distribution of the data. It’s best described in detail by example
    • -  Dotplot – dotplot graphs a dot for each case against a single axis
    • -  Shape – to describe the shape of a distribution, look for single vs. multiple modes, symmetry
      vs. skewness and outliers and gaps
    • -  Mode – a hump or local high point in the shape of the distribution of a variable. The apparent
      location of modes can change as the scale of a histogram is changed
    • -  Unimodal (bimodal) – having one mode. This is useful term for describing the shape of a
      histogram when it’s generally mound-shaped. Distributions with two modes are called
      bimodal. Those with more than two are multimodal
    • -  Uniform – a distribution that’s roughly flat is said to be uniform
    • -  Symmetric – a distribution is symmetric if the two halves on either side of the center look
      approximately like mirror images of each other
      Chapter 5 Understanding and Comparing Distributions
- Exploring different ways of examining the relationship between two variables when one is quantitative and the other is categorical and indicates groups to compare
The Big Picture
  • -  Fig. 5.1 A histogram of daily Average Wind Speed for every day in 1989; it is unimodal and skewed to the right, with a possible high outlier
  • -  Maximum unusually windy or just the windiest day of the year? Boxplots
- 5-number summary of a (quantitative) variable -> boxplot (page 81)
  1. Draw a single vertical axis spanning the extent of the data; draw short horizontal lines at
    the lower and upper quartiles and at the median -> form a box
  2. Erect ‘fences’ around the main part of the data; we place the upper fence 1.5IQRs above
    the upper quartile and the lower fence 1.IQRs below the lower quartile; never include
    the fences in your boxplot
  3. We use the fences to grow ‘whiskers’; draw lines from the ends of the box up and down
    to the most extreme data values found within the fences
  4. We add the outliers by displaying any data values beyond the fences with special
    symbols
Comparing Groups with Histograms
  • -  Is it windier in the winter or the summer?
  • -  Use the same scale
  • -  Spring/summer and fall/winter
  • -  In the colder months the shape is less strongly skewed and more spread out; wind speed is
    higher, several high values Comparing Groups with Boxplots
  • -  E.g. are some months windier than others?
  • -  Do some months show more variation? (spread)
  • -  Group observations by month -> side by side (fig. 5.4)
  • -  Easily see which groups have higher medians, which have the greater IQRs, where the central
    50% of the data is located and which have the greater overall range
  • -  Wind speeds tend to decrease in the summer
  • -  The months in which the winds are both strongest and most variable are November through
    March
  • -  Many outliers -> that windy day in July certainly wouldn’t stand out in November or
    December, but for July, it was remarkable Outliers
  • -  An outlier is a value that doesn’t fit with the rest of the data
  • -  Boxplots provide a rule of thumb to highlight these unusual points
  • -  Try to understand them in the context of the data
  • -  Histogram gives a better idea of how the outlier fits in with the rest of the data
  • -  Look at the gap between that case and the rest of the data (maybe error in the data)
  • -  Never leave an outlier in pace and proceed as if nothing were unusual
  • -  Never drop an outlier from the analysis without comment just because it’s unusual
Timeplots: Order, Please!
  • -  A display of values against time is sometimes called a timeplot
  • -  Without monthly division, we can see a calm period during the summer
  • -  More variable and stronger during the early and late parts of the year
    *Smoothing Timeplots
  • -  You could draw a smooth curve or trace through a timeplot (page 89)
  • -  Smooth trace can highlight long-term patterns and help us see them through the more local
    variation Looking into the Future
    • -  For example seasonal patterns -> probably safe to predict a less windy June next year and a windier November
    • -  But we wouldn’t predict another storm on November 21
    • -  But not in every case, e.g. stock rises
    • -  Stock prices, unemployment rates, and other economic, social or psychological concepts are
      much harder to predict than physical quantities Re-expressing Data: A First Look
      Re-expressing to Improve Symmetry
      • -  Data skewed -> difficult to summarize with a center and spread
      • -  Fig. 5.9 -> some CEOs received extraordinarily high compensations, while the majority
        received relatively ‘little’ -> mean value is 10307000 while the median is only 4700000
      • -  One approach is to re-express, or transform, the data by applying a simple function to make
        the skewed distribution more symmetric – we could take the square root or logarithm of
        each compensation value -> more symmetric; you can identify real outliers
      • -  Variables that are skewed to the right often benefit from a re-expression by square roots,
        logs, or reciprocals
        Re-expressing to Equalize Spread Across Groups
      • -  Fig. 5.11 the nicotine levels for both nonsmoking groups are too low to be seen (can’t be
        negative and are skewed to the high end)
      • -  Re-expressing -> logarithm
        Terms
      • -  Boxplot – displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values; boxplots are particularly effective for comparing groups and for displaying outliers
      • -  Outlier – any point more than 1.5 IQR from either end of the box in a boxplot is nominated as an outlier
      • -  Far Outlier – if a point is more than 3.0 IQR from either end of the box in a boxplot
      • -  Comparing distributions – when comparing the distributions of several groups using
        histograms or stem-and-leaf displays, consider their o Shape
        o Center o Spread
- Comparing boxplots – when comparing groups with boxplots
o Comparetheshapes–dotheboxeslooksymmetricorskewed?Arethere
differences between groups?
o Comparethemedians.Whichgroupshasthehighercenter?Isthereanypatternto
the medians?
o ComparetheIQRs–whichgroupismorespreadout?Isthereanypatterntohowthe
IQRs change?
o UsingtheIQRsasabackgroundmeasureofvariation,dothemediansseemtobe
different, or do they just vary much as you’d expect from the overall variation?
o Checkforpossibleoutliers–identifythemifyoucananddiscusswhytheymightbe
unusual; of course, correct them if you find that they are errors
- Timeplot – displays data that change over time; often, successive values are connected with

lines to shot trends more clearly; sometimes a smooth curve is added to the plot to help show long-term patterns and trends
Chapter 6 The Standard Deviation as a Ruler and the Normal Model
- Women’s heptathlon in the Olympics – seven tracks – different units – how to compare the scores?
The Standard Deviation as a Ruler
  • -  Tells us how the whole collection of the values varies
  • -  Fig. 6.1 Stem-and-leaf displays for both the long jump and the shot put
  • -  Klüft’s 6.78-m long ump is 0.62 meter longer than the mean jump of 6.16 m -> 0.62/0.23 =
    2.70 standard deviations better than the mean // Skujyté’s winning shot is only 2.51 standard deviations better than the mean
    Standardizing with z-Scores
  • -  Expressing the distance in standard deviations standardized the performances
  • -  To standardize a value, we simply subtract the mean performance in that event and then
    divide this difference by the standard deviation:
  • -   =

  • -  These values are called standardized values, and are commonly denoted with the letter z (call them z-scores)
  • -  A z-score of 2 tells us that a data value is 2 standard deviations above the mean
  • -  The farther a data value is from the mean, the more unusual it is, so a z-score of -1.2 is more
    extraordinary than a z-score of 1.2
  • -  Klüft: 2.70+1.19=3.89
  • -  Skujyté: 0.61+2.51=3.12
  • -  Klüft won
  • -  When we standardize data to get a z-score, we do two things – first, we shift the data by
Shifting
subtracting the mean; then we rescale the values by dividing by their standard deviation Data
  • -  Histogram and boxplot for the men’s weight – some of the men are heavier than the recommended weight (74kg) -> subtracting 74 kg shifts the entire histogram down but leaves the spread and the shape exactly the same
  • -  When we shift the data by adding (or subtracting) a constant to each value, all measures of position (center, percentiles, min, max) will increase (or decrease) by the same constant
  • -  Adding (or subtracting) a constant to every data value adds (or subtracts) the same constant to measures of position, but leaves measures of spread unchanged
    Rescaling Data
    • -  Suppose we want to look at the weights in pounds instead
    • -  2.2 pounds in every kilogram, we’d convert the weights by multiplying each value by 2.2 ->
      changes the measurement units
    • -  Shape does not change
    • -  Mean also multiplied by 2.2 (like all measures of position)
    • -  Spread is also 2.2 times larger
Whenwemultiply(ordivide)allthedatavaluesbyanyconstant,allmeasuresofposition
(such as the mean, median and percentiles) and measures of spread (such as the range, the IQR, and the standard deviation) are multiplied (or divided) by that same constant
Back to z-Scores
- -
- - -
What is
- -
-
-
- -
- -
When we subtract the mean of the data from every data value, we shift the mean to zero (shifts don’t change standard deviation)
Each shifted value is divided by s -> SD should be divided by s as well (SD was s) -> new standard deviation becomes zero

Standardizing into z-scores does not change the shape of the distribution of a variable Standardizing into z-scores changes the center by making the mean 0
Standardizing into z-scores changes the spread by making the standard deviation 1

a z-Score BIG?
How far from 0 does a z-score have to be to be interesting or unusual?
To say more about how big we expect a z-score to be, we need to model the data’s distribution (model of reality, not reality itself)
‘bell-shaped curves’ (normal models) -> normal models are appropriate for distribution whose shapes are unimodal and roughly symmetric
There is a normal model for every possible combination of mean and standard deviation
N (μ,σ) with a mean of μ and a standard deviation of σ
This mean and standard deviation are not numerical summaries of data -> parameters of the

model
=
The normal model with mean 0 and standard deviation 1 is called the standard normal model (or the standard normal distribution)
Normality assumption
Nearly normal condition -> the shape of the data’s distribution is unimodal and symmetric: Check this by making a histogram (or a normal probability plot, which we’ll explain later)

The 68-95-99.7 Rule
- It turns out that in a normal model, about 68% of the values fall within 1 standard deviation of the mean, about 95% of the values fall within 2 standard deviations of the mean, and about 99.7 – almost all – of the values fall within 3 standard deviations of the mean (fig. 6.6)
The First Three Rules for Working with Normal Models
  1. Make a picture
  2. Make a picture
  3. Make a picture
  • -  Sketch pictures to help think about normal models
  • -  Make a histogram or check the Nearly Normal Condition
    The worst-case scenario: Tchebycheff’s Inequality
  • -  5 standard deviations above the mean
  • -  But 68-95-99.7 rule applies only to normal models
  • -  In any distribution, at least 1 − of the values must lie within ±k standard deviations of the
    mean
  • -  For k = 1.1 – 1/12 = 0; if the distribution is far from Normal
  • -  For k = 2.1 – 1/22 = 3/4; not matter how strange the shape of the distribution, at least 75% of
    the values must be within 2 standard deviations of the mean
  • -  For k = 3.1 – 1/32 = 8/9; in any distribution, at least 89% of the values lie within 3 standard
    deviations of the mean
Valuesbeyond3standarddeviationsfromthemeanareuncommon,normalmodelornot
Finding Normal Percentiles
Finding Normal Percentiles Using Technology
From Percentiles to Scores: z in Reverse
Are You Normal? Find Out with a Normal Probability Plot

- The normal probability plot – if the distribution of the data is roughly normal, the plot is roughly a diagonal straight line; deviations from a straight line indicate that the distribution is not normal
How Does a Normal Probability Plot Work?
Chapter 7 Scatterplots, Association, and Correlation
  • -  Figure 7.1 scatterplot of the average error in nautical miles of the predicted position of Atlantic hurricanes, plotted against the Year in which the predictions were made
  • -  Predictions have improved -> decline in the average error
  • -  This timeplot is an example of a more general kind of display called a scatterplot. Scatterplots
2
may be the most common displays for data. By just looking at them, you can see patterns,
trends, relationships, and even the occasional extraordinary value sitting apart from the
others
- Between two quantitative variables

Looking at Scatterplots
  • -  Direction of the association is important
  • -  A pattern like this that runs from the upper left to the lower right is said to be negative
  • -  A pattern running the other way is called positive
  • -  The second thing to look for in a scatterplot is its form: if there is a straight line relationship,
    it will appear as a cloud or swarm of points stretched out in a generally consistent, straight
    form
  • -  E.g. the scatterplot of Prediction Error vs. Year has such an underlying linear form, although
    some points stray away from it
  • -  Straight, curved, something exotic, or no pattern?
  • -  The third feature to look for in a scatterplot is how strong the relationship is
  • -  At one extreme, do the point appear tightly clustered in a single stream (whether straight,
    curved, or bending all over the place)
  • -  Or at the other extreme, does the swarm of points seem to form a vague cloud through
    which we can barely discern any trend or pattern?
  • -  Fourth: look for unusual features and unexpected. Often the most interesting thing to see in
    a scatterplot is something you never thought to look for, e.g. an outlier standing away from
    the overall pattern of the scatterplot
  • -  Also look for clusters or subgroups that stand away from the rest of the plot or that show a
    trend in a different direction Roles for Variables
- We will call the variable of interest the response variable and the other the explanatory or predictor variable. We will continue our practice of naming the variable of interest y (on the y-axis) and place the explanatory variable on the x-axis (-> x-and y-variables)
Correlation
  • -  Height (x-axis, explanatory variable) and Weight taller students tend to weigh more
  • -  Figure 7.2: form is fairly straight, although there seems to be a high outlier, as the plot
    shows; pattern looks straight, clearly positive
  • -  The units shouldn’t matter to our measure of strength, we can remove them by
    standardizing each variable for each point, instead of the value (x, y) we will have the standardized coordinates (zx, zy) to standardize values, we subtract the mean of each variable and then divide by its standard deviation:
    (z , z ) = ( ̅, ) xy
  • -  Figure 7.3: scatterplot of standardized heights and weights – scale on both axes are standard deviation units the underlying linear pattern seems steeper in the standardized plot (due to the scales of the axes are now the same) equal scaling gives a neutral way of drawing the scatterplot and a fairer impression of the strength of the association
  • -  In a positive association, y tends to increase as x increases points in the upper right and lower left strengthen that impression
  • -  Points in the upper left and lower right quadrants tend to weaken the positive association
  • -  Points with z-scores of zero on either variable don’t vote either way, because zx, zy = 0 (see
    also figure 7.4)
  • -  To turn these products into a measure of the strength of the association, just add up the zx zy
    products for every point in the scatterplot:
    zxzy
    This summarizes the direction and strength of the association for all the points
  • -  To adjust for the fact that the size of the sum gets bigger the more data we have, we divide
    the sum by n-1 correlation coefficient: r =

    (see also page 155/156) Correlation Conditions
    • -  Correlation measures the strength of the linear association between two quantitative variables
    • -  Before you use correlation, you must check several conditions:
      o QuantitativeVariableCondition–correlationappliesonlytoquantitativevariables o StraightEnoughCondition
      o OutlierCondition–whenyouseeanoutlier,itisoftenagoodideatoreportthe
      correlation with and without the point Correlation Properties
  • -  The sign of a correlation coefficient gives the direction of the association
  • -  Correlation is always between -1 and +1 – correlation can be exactly equal to -1 and +1 but
    these values are unusual in real data
  • -  Correlation treats x and y symmetrically – the correlation of x and y is the same as the
    correlation of y with x
  • -  Correlation has no units (but don’t use percentages)
  • -  Correlation is not affected by changes in the center or scale of either variable – changing the
    units or baseline of either variable has not effect on the correlation coefficient – correlation
    depends only on the z-scores, and they are unaffected by changes in center or scale
  • -  Correlation measures the strength of the linear association between the two variables
  • -  Correlation is sensitive to outliers – a single outlying value can make a small correlation large
    or make a large one small Warning: Correlation ≠ Causation
  • -  Figure 7.5 – the two variables are obviously related to each other but that doesn’t prove that storks bring babies
  • -  A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable
  • -  Scatterplots and correlation coefficients never prove causation Correlation Tables
- The rows and column of the table name the variables, and the cells hold the correlations
  • -  But: without any checks for linearity and outliers, the correlation table risks showing truly small correlations that have been inflated by outliers, truly large correlations that are hidden by outliers, and correlations of any size that may be meaningless because the underlying form is not linear
  • -  Table 7.1: the diagonal cells of a correlation table always show correlations of exactly 1 *Measuring Trend: Kendall’s Tau
  • -  Scales of the sort that attempt to measure attitudes numerically are called Likert scales
  • -  Likert scales have order (e.g. assessing the pace of a course on a scale form 1-5)
  • -  But the correlation coefficient might not be the appropriate measure using alternative
    measure of association: Kendall’s tau
  • -  Kendall’s tau is a statistic designed to assess how close the relationship between two
    variables is to being monotone – a monotone relationship is one that consistently increases
    or decreases, but not necessarily in a linear fashion
  • -  Kendall’s tau measures monotonicity directly – for each pair of points in a scatterplot, it
    records only whether the slope of a line between those two points is positive, negative, or zero
    *Nonparametric Association: Spearman’s Rho
    • -  Spearman’s rho can deal with the two problems of outliers and bends in the data (that make it impossible to interpret correlation)
    • -  Rho replaces the original data values with their ranks within each variable
    • -  It replaces the lowest value in x by the number 1 ...
    • -  The same method ranking method is applied to the y-variable
    • -  Spearman’s rho is the correlation of the two rank variables – it must be between -1 and 1
    • -  Both (Spearman and Kendall) are examples of what are called nonparametric or distribution-
      free methods Straightening Scatterplots
- Square of one variable -> more linear relationship
Chapter 8 Linear Regression
  • -  Burger King: the scatterplot of the Fat (in grams) versus the Protein (in grams) for food sold at Burger King shows a positive, moderately strong, linear relationship
  • -  The correlation between Fat and Protein is 0.83 (fairly strong relationship)
  • -  We can model the relationship with a line and give its equation with two parameters: its
    mean μ and standard deviation σ linear model (an equation of a straight line through the data; but wrong in the sense that it can’t match reality exactly)
    Residuals
    • -  Figure page 179
    • -  The line might suggest that BK Broiler chicken sandwich with 30 grams of protein should
have 36 grams of fat when, in fact, it actually has only 25 grams of fat
- We call the estimate made from a model the predicted value, and write it as to distinguish it from the true value, y
- The difference between the observed value and its associated predicted value is called the residual – the residual value tells us how far off the model’s prediction is at that point
- BK Broiler chicken residual: y- = 25-36 = -11g of fat actual fat content is about 11 grams less than the model predicts
- To find the residuals, we always subtract the predicted value from the observed one “Best Fit” Means Least Squares
  • -  Squaring all residuals and add them up
  • -  The sum indicates how well the line we drew fits the data – the smaller the sum, the better
    the fit
  • -  The line of best fit is the line for which the sum of the squared residuals is smallest, the least
    squares line The Linear Model
  • -  Straight line: y = mx + b
  • -  Linear model (statistics): =b0+b1x (predicted values = slope + intercept of the line)
  • -  The b’s are called the coefficients of the linear model – the coefficient b1 is the slope, which
tells how rapidly changes with respect to x – the coefficient b0 is the intercept, which tells
where the line hits (intercepts) the y-axis
- Burger King:
= 6.8 + 0.97Protein (one more gram of protein -> 0.97 more grams of fat;
No protein -> 6.8 grams of fat? No reasonable then the intercept serves only as a starting value for our predictions)
The Least Squares Line
  • -  The correlation (tells us the strength of the linear association), the standard deviation (gibes us the units), and the means (tells us where to put the line)
  • -  Slope b1 = r* sy/sx
  • -  Changing the units of x and y affects their standard deviations directly
  • -  Units of the slope are always the units of y per unit of x
  • -  Intercept: b0= -b1 ̅
  • -  Example page 182
  • -  Regression almost always means “the linear model fit by least squares”
  • -  To use a regression model, we should check the same conditions for regressions as we did for
    correlation: the Quantitative Variables Condition, the Straight Enough Condition, and the Outlier Condition
    Correlation and the Line
- Figure 8.3: scatterplot for the BK items of zy (standardized Fat) vs. zx (standardized Protein) along with their least squares line
  • -  Equation: ̅y = r*zx
  • -  It says that in moving one standard deviation from the mean in x, we can expect to move
about r standard deviations away from the mean in y
- BK: if we standardize both protein and fat, we can write
̅y = 0.83*zprotein
- It tells us that for every standard deviation above (or below) the mean a menu item is in protein, we’d predict that its fat content is 0.83 standard deviations above (or below) the mean fat content
Ingeneral,menuitemsthatareonestandarddeviationawayfromthemeaninxare,on average, r standard deviations away from the mean in y
How Big Can Predicted Values Get?
- Each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean, and that’s where we got the term regression line.
Residuals Revisited
  • -  Data = Model + Residual
  • -  Residual = Data - Model
  • -  e=y-
  • -  A scatterplot of the residuals versus the x-values should be the most boring scatterplot
    you’ve ever seen – it shouldn’t have any interesting features, like a direction or shape – it should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers.
    The Residual Standard Deviation
    • -  The standard deviation of the residuals, se, gives us a measure of how much the points spread around the regression line
    • -  New assumption: Equal Variance Assumption with the associated Does the Plot Thicken? Condition – spread is about the same throughout
    • -  s= e
      R2-The Variation Accounted For
  • -  -0.5 is doing as well as 0.5 (correlation) but different direction
  • -  If we square the correlation coefficient, we’ll get a value between 0 and 1, and the direction
    won’t matter
  • -  The squared correlation, r2, gives the fraction of the data’s variation accounted for by the
    model, and 1-r2 is the fraction of the data’s variation left in the residuals
  • -  BK: 31% of the variability in total Fat has been left in the residuals / 69% of the variability in
    the fat content of BK sandwiches is accounted for by variation in the protein content
  • -  All regression analyses include this statistic, although by tradition, it is written with a capital
    letter, R2, and pronounced “R-squared”
How Big Should R2 Be?
  • -  R2 depends on the kind of data you are analyzing
  • -  Data from scientific experiments often have high percentages
  • -  Data from observational studies and surveys often show weak associations -> 50%-30% can
provide evidence of a useful regression
A Tale of Two Regressions
  • -  Solving our equation for Protein to get a model for predicting Protein from Fat does not work
  • -  Protein = 0.55+0.709Fat
    Regression Assumptions and Conditions
  • -  Reasonable?
  • -  Check Quantitative Variables Condition to be sure a regression is appropriate
  • -  Linear model
    o Linearityassumption
    o StraightEnoughCondition
    o DoesthePlotThicken?Condition o OutlierCondition
  • -  For the standard deviation of the residuals to summarize the scatter, all the residuals should share the same spread
    Reality Check: Is the Regression Reasonable?
    • -  Direction right?
    • -  Size reasonable?
      Chapter 18 Sampling Distribution Models
      The Central Limit Theorem for Sample Proportions
  • -  True proportion: p = 0.45 (45% of all American adults believe in ghosts) (fig. 18.1)
  • -  2000 simulated independent samples of 808 adults (p=0.45); we don’t get the same
    proportion for each sample we draw
  • -  p = parameter of the model (the probability of a success)
  • -  ^p for the observed proportion in a sample
  • -  q = for the probability of a failure (q=1-p) and 1q for its observed value
  • -  P = general probability
  • -  The histogram (Fig.18.1) is a simulation of what we’d get if we could see all the proportions
    from all possible samples; that distribution has a special name; it is called the sampling
    distribution of the proportions
  • -  A sampling distribution model for how a sample proportion varies from sample to sample
    allows us to quantify that variation and to talk about how likely it is that we’d observe a
    sample proportion in any particular interval
  • -  To use a normal model, we need to specify two parameters: its mean and standard
    deviation; the center is p , so we’ll put μ, the mean of the Norma, at p
    P -> standard deviation of the proportion of successes, ^p -> ^p is the number of successes divided by the number of trials, n, so the standard deviation is also divided by n:

σ (^p) = SD(^p) = =
-N(p, )
- p = 0.45 -> SD (^p) = = . ∗ . = 0.0175
  • -  scale for Normal model:
  • -  Normal model -> 68-95-99.7 Rule
  • -  Since 2*1.75% = 3.5%, we see that the CBS poll estimating belief in ghosts at 48% is
    consistent with our guess of 45%
  • -  This is what we mean by sampling error – it’s not really an error at all, but just variability
    you’d expect to see from one sample to another -> sampling variability
How good is the Normal model?
- The model becomes a better and better representation of the distribution of the sample, proportions as the sample size gets bigger
Assumptions and Conditions
  • -  The Independence Assumption: The sampled values must be independent of each other The Sample Size Assumption: The sample size, n, must be large enough
  • -  Randomization Condition: subjects should have been randomly assigned
    10% Condition: the sample size, n, must be no larger than 10% of the population Success/Failure Condition: p*n > 10 and q*n > 10

    A Sampling Distribution Model
    • -  Laplace: the larger the sample size, the better the model works
    • -  No longer is a proportion something we just compute for a set of data; we now see it as a
      random variable quantity that has a probability distribution, and thanks to Laplace we have a
      model for that distribution sampling distribution model for the proportion
    • -  They inform us about the amount of variation we should expect when we sample
    • -  They act as a bridge from the real world of data to the imaginary model of the statistic and
      enable us to say something about the population when all we have is data from the real
      world
    • -  Margin of error
      What about Quantitative Data?
      • -  The means have also a sampling distribution that we can model with a Normal mode
      • -  Laplace theoretical result applies to means, too
        Simulating the Sampling Distribution of a Mean
        • -  Toss a fair die 10000 times (fig.18.5)
        • -  Toss a pair of dice and record the average of the two (fig.18.6) -> more likely to get an
          average near 3.5 -> triangular distribution
0.3975
0.4150
0.4325
0.4500
0.4675
0.4850
0.5025
-3σ
-2σ
-1σ
P



- Average 3 or 4 dices -> Law of large numbers: as the sample size (number of dice) gets larger, each sample average is more likely to be closer to the population mean &
It’s becoming bell-shaped and approaching the Normal model

The Central Limit Theorem: The Fundamental Theorem of Statistics
  • -  For sampling distributions, we had to check a few conditions
  • -  For means, there are almost no conditions at all
  • -  The sampling distribution of any mean becomes more nearly Normal as the sample size
    grows; all we need is for the observations to be independent and collected with randomization; we don’t even care about the shape of the population distribution This surprising fact is the result Laplace proved -> Central Limit Theorem (CLT)
  • -  Not only does the distribution of means of many random samples get closer and closer to a Normal model as the sample size grows, this is true regardless of the shape of the population distribution
  • -  Even skewed or bimodal population -> CLT: means of repeated random samples will tend to follow a Normal model as the sample size grows
  • -  Works better and faster the closer the population distribution is to a Normal model
  • -  Works better for larger samples
    Assumptions and Conditions (for the CLT)
    • -  Independence & Sample Size Assumption
    • -  Randomization Condition
      10% Condition
      Large Enough Sample Condition

But Which Normal?
  • -  For proportions, the sampling distribution is centered at the population proportion
  • -  For means, it’s centered at the population mean
  • -  Means have smaller standard deviations than individuals
  • -  The standard deviation of y falls as the sample size grows
- But it only goes down by the square root of the sample size SD( )=

Whenwehavecategoricaldata,wecalculateasampleproportion, ̂;thesampling
distribution of this random variable has a Normal model with a mean at the true proportion p and a standard deviation of SD ( ̂) =

When we have quantitative data, we calculate a sample mean; ; the sampling distribution
of this random variable has a Normal model with a means at the true mean, μ, and a standard deviation of SD ( ) =
About variation

  • -  Means vary less than individual data values
  • -  Variability of sample means decreases as the sample size increases
  • -  If only we had a much larger sample, we could get the standard deviation of the sampling distribution really under control so that the sample mean could tell us still more about the unknown population mean
  • -  The square root limits how much we can make a sample tell about the population (law of diminishing returns)
    The Real World and the Model World
  • -  Real world distribution of the sample (histogram, bar chart, table)
  • -  Math world sampling distribution model of the statistic, a Normal model based on the CLT
  • -  Don’t think the CLT says that the data are Normally distributed as long as the sample is large
    enough
  • -  The CLT doesn’t talk about the distribution of the data from the sample; it talks about the
    sample means and sample proportions of many different random samples drawn from the same population
    Sampling Distribution Models
    • -  Statistic itself is a random variable
    • -  Shows us the distribution of possible values that the statistic could have had
    • -  Sample-to-sample variability generates the sampling distribution
    • -  Sampling distributions arise because samples vary – each random sample will contain
      different cases and, so, a different value of the statistic
    • -  Although we can always simulate a sampling distribution, the CLT saves us the trouble for
      means and proportions Terms
    • -  Sampling distribution model – different random samples give different values for a statistic; the sampling distribution model shows the behavior of the statistic over all the possible samples for the same size n
    • -  Sampling variability – the variability we expect to see from one random sample to another
    • -  Sampling error – sampling variability
    • -  Sampling distribution model for a proportion – if assumptions of independence and random
      sampling are met, and we expect at least 10 successes and 10 failures, then the sampling distribution of a proportion is modeled by a Normal model with a mean equal to the true
      proportion value, p, and a standard deviation equal to
    • -  Central Limit Theorem – CLT states that the sampling distribution model of the sample mean (and proportion) from a random sample is approximately Normal for large n, regardless of the distribution of the population, as long as the observations are independent
    • -  Sampling distribution model for a mean – if assumptions of independence and random sampling are met, and the sample size is large enough, the sampling distribution of the sample mean is modeled by a Normal model with a mean equal to the population mean, μ, and a standard deviation equal to
      Chapter 19 Confidence Intervals for Proportions (sea fans)

A Confidence Interval
  • -  We know it’s approximately Normal and that its mean is the proportion of all infected sea fans on the Las Redes Reef
  • -  ^p=51.9% (centered at p)
  • -  SD=
  • -  We don’t know p
  • -  Whenever we estimate the standard deviation of a sampling distribution, we call it a
    standard error
    SE (^p) = ^ ^ /
  • -  Sea fans: 4.9%
  • -  Because it’s Normal, it says that about 68% of all samples of 104 sea fans will have ^p’s
    within 1 SE (0.049), of p; about 95% of all these samples will be within p ± 2 SEs if I reach out 2 SEs, or 2* 0.049, away from me on both sides, I’m 95% sure that p will be within my grasp
“Weare95%confidentthatbetween42.1%and61.7%ofLasRedesseafansareinfected.”-> statements like these are called confidence intervals
- The interval calculated and interpreted here is sometimes called a one-proportion z-interval What Does “95% Confidence” Really Mean?
- Formally, what we mean is that “95% of samples of this size will produce confidence intervals that capture the true proportion.” This is correct, but a little long winded, so we sometimes say, “we are 95% confident that the true proportion lies in our interval.” Our uncertainty is about whether the particular sample we have at hand is one of the successful ones or one of the 5% that fail to produce an interval that captures the true value.
Margin of Error: Certainty vs. Precision
  • -  Our confidence interval had the form ^p ± 2 SE (^p)
  • -  The extent of the interval on either side of ^p is called the margin of error (ME)
    Critical Values
    • -  To change the confidence level, we’d need to change the number of SEs so that the size of the margin of error corresponds to the new level
    • -  This number of SE is called the critical value -> z* -> Table Z (Table D)
    • -  For a 95% confidence interval, you’ll find the precise critical value is z* = 1.96 -> 1.96
      standard deviations of the mean Assumptions and Conditions
- We can never be certain that an assumption is true, but we can decide intelligently whether it is reasonable
Independence Assumption
  • -  Whether you decide that the Independence Assumption is plausible depends on your knowledge of the situation
  • -  Randomization condition

- 10% Condition Sample Size Assumption
- Whether the sample is large enough to make the sampling model for the sample proportions approximately Normal
- Success/Failure Condition: we must expect at least 10 successes and at least 10 failures Choosing Your Sample Size
  • -  Suppose a candidate is planning a poll and wants to estimate voter support within 3% with 95% confidence. How large a sample does she need?
  • -  ME = z* ^ ^ /
  • -  0.03 = 1.96 ^ ^ /
  • -  For ^p we can guess a value – the worst case is 0.50 /makes ^p^q and n largest
  • -  0.03 = 1.96 . ∗ .

- 0.03√n = 1.96 √0.5 ∗ 0.5 ≈ 32.67
  • -  n ≈ 1067.1
  • -  We need at least 1068 respondents to keep the margin of error as small as 3% with a
    confidence level of 95%
  • -  To cut the standard error (and thus the ME) in half, we must quadruple the sample size
    Terms
  • -  Standard error – when we estimate the standard deviation of a sampling distribution using statistics found from the data, the estimate is called a standard error
  • -  Confidence interval – a level C confidence interval for a model parameter is an interval of values usually of the form
    Estimate ± margin of error
    Found from data in such a way that C% of all random samples will yield intervals that capture the true parameter value

  • -  One-proportion z-interval – a confidence interval for the true value of a proportion. The confidence interval is
    ^p ± z*SE(^p)
    Where z* is a critical value from the Standard Normal model corresponding to the specified confidence level

  • -  Margin of error – in a confidence interval the extent of the interval on either side of the observed statistic value is called the margin of error. A margin of error is typically the product of a critical value from the sampling distribution and a standard error from the data. A small margin of error corresponds to a confidence interval that pins down the parameter precisely. A large margin of error corresponds to a confidence interval that gives relatively little
    information about the estimated parameter. For a proportion ME = z* ^ ^ /
  • -  Critical value – the number of standard errors to move away from the mean of the sampling
    distribution to correspond to the specified level of confidence. The critical value, denoted z*, is usually found from a table or with technology
Chapter 20 Testing Hypotheses about Proportions
- Cracking ingots: in one plant only about 80% of thee ingots have been free of cracks -> changes to reduce the cracking proportion -> since then, 400 ingots have been cast and only 17% of them have cracked
Natural sampling variability or evidence to assure management that the true cracking rate now is really below 20%

Testhypothesesaboutmodels Hypotheses
  • -  Hypotheses are working models that we adopt temporarily
  • -  We assume that they have in fact made no difference and that apparent improvement is just
    random fluctuation (sampling error) -> called the null hypothesis
  • -  Null hypotheses (H0), specifies a population model parameter of interest and proposes a
    value for that parameter
    H
    0: parameter = hypothesized value Ingots: H0: p = 0.20
  • -  The alternative hypothesis, which we denote HA, contains the values of the parameter that we consider plausible if we reject the null hypothesis
    Ingots: management interested in reducing the cracking rate, so their alternative is H
    A: p<0.20
  • -  400 new ingots have been cast -> success/failure condition satisfied and independent -> normal sampling distribution model
  • -  SD(^p)= = . ∗ . = 0.02

- With p (0.2) and SD(^p) (0.02) -> we can find out how likely it would be to see the observed value of ^p=17%
z = (0.17-0.20)/0.02 = -1.5
How likely is it to observe a value at least 1.5 standard deviations below the mean of a Normal model? -> 0.067 (table A) (probability of observing a cracking rate of 17%)

A Trial as a Hypothesis Test
  • -  Evaluating the evidence in light of the presumption of innocence and judges whether the evidence against the defendant would be plausible if the defendant were in fact innocent
  • -  You must judge for yourself in each situation whether the probability of observing your data is small enough to constitute ‘reasonable doubt’
    P-Values
    • -  We want to find the probability of seeing data like these given that the null hypothesis is true -> P-value
    • -  When the P-value is high, we haven’t seen anything unlikely or surprising at all
  • -  When the P-value is low enough, it says that it’s very unlikely we’d observe data like these if our null hypothesis were true
  • -  We fail to reject the null hypothesis What to Do with an ‘Innocent’ Defendant
    • -  Insufficient evidence to convict the defendant, the jury does not decide that H0 is true and declare the defendant innocent – juries can only fail to reject the null hypothesis and declare the defendant ‘not guilty’
    • -  And we never declare the null hypothesis to be true because we simply do not know whether it’s true or not
      The Reasoning of Hypothesis Testing
1. Hypotheses
  • -  To assess how unlikely our data may be, we need a null model
  • -  The null hypothesis specifies a particular parameter value to use in our model. In the usual
    shorthand, we write H0: parameter = hypothesized value. The alternative hypothesis, HA,
    contains the values of the parameter we consider plausible when we reject the null
2. Model
  • -  Specify the model you will use to test the null hypothesis and the parameter of interest
  • -  State assumptions and check any corresponding conditions
  • -  “Because the conditions are satisfied, I can model the sampling distribution of the proportion
    with a Normal model.”
  • -  “Because the conditions are not satisfied, I can’t proceed with the test.”
  • -  The test about proportions is called a one-proportion z-test
    o WetestthehypothesisH:p=p usingthestatisticz= ^
0 0 ^
o Weusethehypothesizedproportiontofindthestandarddeviation,SD(^p)=
3. Mechanics
  • -  Actual calculation
  • -  Obtain a P-value – the probability that the observed statistic value occurs if the null model is
    correct
4. Conclusion
  • -  Statement about the null hypothesis – either reject or that we fail to reject
  • -  The size of the effect is always a concern when we test hypotheses – a good way to look at
    the effect size is to examine a confidence interval Alternative Alternatives
    • -  Old cracking rate: 20%
    • -  H0:p=0.20
    • -  Someone might be interested in any change in the cracking rate -> HA: p ≠ 0.20
    • -  An alternative hypothesis such as this is known as a two-sided alternative because we are
      equally interested in deviations on either side of the null hypothesis value. For two-sided alternatives, the P-value is the probability of deviating in either direction from the null hypothesis value
- But only interested in lowering the cracking rate below 20% -> HA: p < 0.20
  • -  An alternative hypothesis that focuses on deviations from the null hypothesis value in only one direction is called a one-sided alternative
  • -  For a hypothesis test with a one-sided alternative, the P-value is the probability of deviating only in the direction of the alternative away from the null hypothesis value
    P-Values and Decisions: What to Tell About a Hypothesis Test
  • -  How small should the P-value be in order for you to reject the null hypothesis? -> highly context-dependent
  • -  Examples page 487
  • -  The conclusion about any null hypothesis should be accompanied by the P-value of the test
  • -  To complete the analysis, follow your test with a confidence interval for the parameter of
    interest, to report the size of the effect Terms
  • -  Null hypothesis – the claim being assessed in a hypothesis test is called the null hypothesis. Usually, the null hypothesis is a statement of “no change from the traditional value”, “no effect“, “no different” or “no relationship” For a claim to be a testable null hypothesis, it must specify a value for some population parameter that can form the basis for assuming a sampling distribution for a test statistic
  • -  Alternative hypothesis – the alternative hypothesis proposes what we should conclude if we find the null hypothesis to be unlikely
  • -  P-value – the probability of observing a value for a test statistic at least as far from the hypothesized value as the statistic value actually observed if the null hypothesis is true. A small P-value indicates either that the observation is improbable or that the probability calculation was based on incorrect assumptions. The assumed truth of the null hypothesis is the assumption under suspicion
- One-proportion z-test – a test of the null hypothesis that the proportion of a single sample equals a specified value (H : p = p ) by referring the statistic z = ^ to a Standard Normal
0 0 ^
model
  • -  Effect size – the difference between the null hypothesis value and the true value of a model
    parameter
  • -  Two-sided alternative – an alternative hypothesis is two-sided (HA: p ≠ p0) when we are
    interested in deviations in either direction away from the hypothesized parameter value
  • -  One-sided alternative – an alternative hypothesis is one-sided (e.g. HA: p > p0 or HA: p < p0)
    when we are interested in deviations in only one direction away from the hypothesized parameter value
    Chapter 21 More About Tests and Intervals
    • -  Florida: no longer are riders 21 and older required to wear helmets
    • -  Police reports of motorcycle accidents: Before the change in the helmet law, 60% of youths
      involved in a motorcycle accident had been wearing their helmets; three years following the law change, considering these riders to be a representative sample of the larger population –
they observed 781 young riders who were involved in accidents – of these, 50.7% (396) were wearing helmets
Zero In on the Null
  • -  One good way to identify both the null and alternative hypotheses is to think about the Why of the situation
  • -  The null hypotheses for the Florida study could be that the true rate of helmet use remained the same among young riders after the law changed
  • -  It makes more sense to use what you want to show as the alternative
How to Think About P-Values
  • -  A P-value actually is a conditional probability. It tells us the probability of getting results at least as unusual as the observed statistic, given that the null hypothesis is true
  • -  The P-value is not the probability that the null hypothesis is true – it is a probability about the data
  • -  All we can say is that, given the null hypothesis, there is a 3% chance (P-value of 0.03) of observing the statistic value that we have actually observed
    What to do with a High P-value
    • -  0.793 ?
    • -  Big P-values just mean that what we’ve observed isn’t surprising
    • -  A big P-value doesn’t prove that the null hypothesis is true, but it certainly offers no evidence
      that it’s not true
    • -  When we see a large P-value, all we can say is that we ‘don’t reject the null hypothesis’
      Alpha Levels
      • -  Sometimes we have to decide whether or not to reject the null hypothesis
      • -  We can define ‘rare event’ arbitrarily by setting a threshold for our P-value. If our P-value
        falls below that point, we’ll reject the null hypothesis. We call such results statistically
        significant. The threshold is called an alpha level
      • -  Common alpha levels are 0.1, 0.05, 0.01 and 0.001
      • -  E.g. assessing safety of air bags -> low alpha level
      • -  E.g. if folks prefer their pizza with or without pepperoni -> alpha = 0.1
      • -  We often choose 0.05
      • -  Assess alpha level before you look at the data
      • -  The alpha level is also called the significance level – when we reject the null hypothesis, we
        say that the test is ‘significant at that level’
      • -  E.g. we might say that we reject the null hypothesis ‘at the 5% level of significance’
      • -  If the P-value does not fall below alpha -> the data have failed to provide sufficient evidence
        to reject the null hypothesis.
      • -  If the P-value is too high -> “we fail to reject the null hypothesis” (-> there is insufficient
        evidence to conclude that the practitioners are performing better than they would if they were just guessing)
Significant vs. Important
  • -  Statistically significant -> P-value lower than our alpha level
  • -  Don’t be lulled into thinking that statistical significance carries with it any sense of practical
    importance or impact
    Confidence Intervals and Hypothesis Tests

    • -  For the motorcycle helmet example, a 95% confidence interval would give 0.507 ± 1.96 * 0.0179 = (0.472, 0.542) or 47.2& to 54.2% -> previous rate would be 50% -> in the interval -> not able to reject the null hypothesis
    • -  In general, a confidence interval with a confidence level of C% corresponds to a two-sided hypothesis test with an alpha level of 100-C% (e.g. 95% confidence interval -> two sided hypothesis test at alpha 5%
    • -  For a one-sided test with alpha 5%, the corresponding confidence interval has a confidence level of 90% - that’s 5% in each tail in general, a confidence interval with a confidence level of C% corresponds to a one-sided hypothesis test with an alpha level of 1⁄2(100-C)%
      A Confidence Interval for Small Samples
- When the Success/failure Condition fails, all is not lost – a simple adjustment to the calculation lets us make a 95% confidence interval anyway
  • -  Add four phony observations – two to the successes, two to the failures
  • -  Adjusted proportion: = and, for convenience, we write = n + 4

- Adjusted interval: ± z* 1 − /
Making
Errors
- Called the Agresti-Coull interval or the ‘plus-four’ interval
  • -  The null hypothesis is true, but we mistakenly reject it (Type I error) – e.g. a healthy person is diagnosed as with disease (the null hypothesis is usually the assumption that a person is healthy)
  • -  The null hypothesis is false, but we fail to reject it (Type II error) – e.g. an infected person is diagnosed as disease free
  • -  Which of these errors is more serious, depends on the situation, the cost, and your point of view
  • -  Page 512
  • -  When you choose level alpha, you’re setting the probability of a Type I error to alpha
  • -  We assign the letter ß to the probability of this mistake
  • -  We could reduce 1 for all alternative parameter values by increasing alpha – but we’d make
    more Type I errors -> tension between Type I and Type II errors
  • -  The only way to reduce both types of error is to collect more evidence or, in statistical terms,
    to collect more data Power
  • -  The power of a test is the probability that it correctly rejects a false null
  • -  When the power is high, we can be confident that we’ve looked hard enough
  • -  We know that ß is the probability that a test fails to reject a false null hypothesis, so the
power of the test is the probability that it does reject: 1-ß
Effect Size
  • -  We call the distance between the null hypothesis value (for example), p0, and the truth, p, the effect size
  • -  Not knowing the true value, we estimate the effect size as the difference between the null and observed value
  • -  Small effects -> more Type II errors -> lower power
  • -  The power of a test depends on the size of the effect and the standard deviation
    A Picture Worth Words . )
    • -  The power of a test is the probability that it rejects a false null hypothesis. The upper figure shows the null hypothesis model. We’d reject the null in a one-sided test if we observed a value of ^p in the red region to the right of the critical value, p*.
      The lower figure shows the true model. If the true value of p is greater than p
      0, then we’re more likely to observe a value that exceeds the critical value and make the correct decision to reject the null hypothesis. The power of the test is the purple region on the right of the lower figure. Of course, even drawing samples whose observed proportions are distributed around p, we’ll sometimes get a value in the red region on the left and make a Type II error of failing to reject the null.
    • -  Power = 1-ß
    • -  Reducing alpha to lower the chance of committing a Type I error will move the critical value,
      p*, to the right (in this example). This will have the effect of increasing ß, the probability of a
      Type II error, and correspondingly reducing the power.
    • -  The larger the real difference between the hypothesized value, p0, and the true population
      value, p, the smaller the chance of making a Type II error and the greater the power of the test. If the two proportions are very far apart, the two models will barely overlap, and we will not be likely to make any Type II errors at all – but then, we are unlikely to really need a formal hypothesis-testing procedure to see such an obvious difference.
      Reducing Both Type I and Type II Errors
  • -  If we can make both curves narrower (fig. 21.4), then both the probability of Type I errors and the probability of Type II errors will decrease, and the power of the test will increase
  • -  The only way is to reduce the standard deviations by increasing the sample size (pictures of sampling distribution models!) the standard deviation of the sampling distribution model decreases only as the square root of the sample size, so to halve the standard deviations we must quadruple the sample size
    Terms
  • -  Alpha level – the threshold P-value that determines when we reject a null hypothesis. If we observe a statistic whose P-value based on the null hypothesis is less than alpha, we reject that null hypothesis
  • -  Statistically significant – when the P-value falls below the alpha level, we say that the test is “statistically significant” at that alpha level
  • -  Significance level – the alpha level is also called the significance level, most often in a phrase such as a conclusion that a particular test is ‘significant at the 5% significance level’
  • -  Type I error – the error of rejecting a null hypothesis when in fact it is true (also called a false positive). The probability of a Type I error is alpha
  • -  Type II error – the error of failing to reject a null hypothesis when in fact it is false (false negative). The probability of a Type II error is commonly denoted beta and depends on the effect size
  • -  Power – the probability that a hypothesis test will correctly reject a false null hypothesis is the power of the test. To find power, we must specify a particular alternative parameter value as the true value. For any specific value in the alternative, the power is 1-ß
  • -  Effect size – the different between the null hypothesis value and true value of a model parameter is called the effect size
    Chapter 22 Comparing Two Proportions
  • -  Male drivers wear seat belts less often than women do
  • -  Men’s belt-wearing jumped more than 16 percentage points when they had a female
    passenger
  • -  Female driver wore belts more than 70% of the time, regardless of the sex of their
    passengers
  • -  Of 4208 male drivers with female passengers, 2777 (66%) were belted
  • -  Among 2763 male drivers with male passengers only, 1363 (49.3%) wore seat belts
  • -  Shift in men’s risk-taking behavior when women are present?
  • -  What would we estimate the true size of that gap to be?
    Another Ruler
    • -  Difference in the sample: 16.7%
    • -  True difference?
    • -  Difference between the two proportions and its standard deviation?
      Pythagorean Theorem of Statistics (chapter 16):
The variance of the sum or difference of two independent random variables is the sum of their variances
Variance (X – Y) = Var(X)+Var(Y), so

SD (X – Y) = + = + - Only applies when X and Y are independent
The Standard Deviation of the Difference between Two Proportions
- The standard deviations of the sample proportions are SD (^p ) =
and SD (^p ) =
, so the variance of the difference in the proportions is
Var(^p -^p ) = ( 2 + ( )2 = + 1 2
1 2
- The standard deviation is the square root of that variance SD(^p -^p )= +
1 2
SE(^p -^p )= ^ ^ + ^ ^
12 - Example page 527! 2 !!!
Assumptions and Conditions
  • -  Independence Assumption: within each group, the data should be based on results for independent individuals
    Randomization condition
    10% condition

  • -  Independent Groups Assumption: the two groups we’re comparing must also be independent of each other
    Sample Size Assumption
  • -  Success/failure condition: both groups are big enough that at least 10 successes and at least
    10 failures have been observed in each The Sampling Distribution
- - -
-
- -
Will I -
A two-proportion z-interval:
Confidence interval: (^p1-^p2) ± z* x SE(^p1-^p2) Where we find the standard error of the difference

SE(^p -^p )= ^ ^ + ^ ^ 12
The critical value z* depends on the particular confidence level, C, that we specify Example page 529!
Snore When I’m 64?
Of the 995 respondents, 37% of adults reported that they snored at least a few nights a week during the past year
  • -  Split into two age categories, 26% of the 184 people under 30 snored, compared with 39% of the 811 in the older group
  • -  Is this difference of 13% real or due only to natural fluctuations in the sample we’ve chosen?
  • -  Null hypothesis? -> we hypothesize that there is no difference in the proportions
    H0: p1-p2 = 0 Everyone into the Pool
    • -  SE(^p -^p )= ^ ^ + ^ ^ 12
    • -  But to do a hypothesis test, we assume that the null hypothesis is true (proportions are equal) -> so there should be just a single value of ^p in the SE formula (and, of course, ^q is just 1-^p)
    • -  Snoring example: overall we saw 48+318 = 366 snores out of a total of 184+811 = 995 adults who responded to this question -> 0.3678
  • -  Combining the counts like this to get an overall proportion is called pooling
  • -  Pooled proportion (for success): ^p = where Success1 is the number of
pooled successes in group 1 (Success1=n1*p1)
- We then put this pooled value into the formula, substituting it for both sample proportions in
the standard error formula:
SE (^p -^p ) = ^ ^ + ^ ^ pooled 1 2
= . ∗ . + . ∗ . =0.039
Improving the Success/Failure Condition
  • -  We should not refuse to test the effectiveness just because it failed the success/failure condition
  • -  For that reason, in a two-proportion z-test, the proper success/failure test uses the expected frequencies, which we can find from the pooled proportion
  • -  Only 1 case of HPV was diagnosed among 7897 women who received the vaccine, compared to 91 cases diagnosed among 7899 who received a placebo
  • -  ^p = =0.0058 pooled
    n1^ppooled = 7899(0.0058) = 46 n2^ppooled = 7897(0.0058) = 46
    Compared to What?
  • -  We’ll reject our null hypothesis if we see a large enough difference in the two proportions
  • -  Large? We just compare it to its standard deviation (standard error, pooled)
  • -  Since the sampling distribution is Normal, we can divide the observed difference by its
    standard error to get a z-score -> tells us how many SE the observed difference is away from
    0
  • -  Then we can use the 68-95-99.7 Rule
  • -  Result: two proportion z-test
  • -  z = ^ ^
^ ^
- When the conditions are met and the null hypothesis is true, this statistic follows the standard Normal mode, so we can use that model to obtain a P-value
Chapter 23 Inferences About Means
Getting
Started
  • -  Motor vehicle crashes resulted in 119 deaths each day
  • -  Speeding is a contributing factor in 31% of all fatal accidents
  • -  Triphammer Road – exceeding 30 miles per hour?
  • -  Interested in both in estimating the true mean speed and in testing whether it exceeds the
    posted speed limit
  • -  Quantitative data usually report a value for each individual three rules of data analysis
    and plot the data
  • -  Quantitative data means and standard deviations; inferences sampling distributions
  • -  Confidence intervals, then we add and subtract a margin of error; for proportions: ^p ± ME
  • -  Margin of error –> ^p ± z*SE(^p)

  • -  CLT: SD( = σ/√n (example page 552)
  • -  If we don’t know σ estimate the population parameter σ with s, the sample standard deviation based on the data; the resulting standard error is SE( = s/√n
  • -  Gosset: we need not only to allow for the extra variation with larger margins of error and P- values, but we even need a new sampling distribution model; in fact we need a whole family of models, depending on the sample size, n; these models are unimodal, symmetric, bell- shaped models, but the smaller our sample, the more we must stretch out the tails
    Gosset’s t
    • -  With s/√n, an estimate of the standard deviation, the shape of the sampling model changes t-distribution Student’s t
    • -  Gosset’s model is always bell-shaped, but the details change with different sample sizes
    • -  So the Student’s t-models form a whole family of related distributions that depend on a
      parameter known as degrees of freedom (df tdf) A Confidence Interval for Means
- To make confidence intervals or test hypothesis for means df = n-1 A Practical Sampling Distribution Model for Means
  • -  t=
  • -  df=n-1
  • -  SE( )= s/√n
    One-Sample t-Interval for the mean
    • -   ±t*n-1*SE( )
    • -  Critical value depends on the particular confidence level, C, and the number of degrees of
      freedom, n-1
    • -  Example page 554
  • -  Figure 23.2: the t-model (solid curve) on 2 degrees of freedom has fatter tails than the Normal model (dashed curve); so the 68-95-99.7 Rule doesn’t work for t-models with only a few degrees of freedom
  • -  Student’s t-models are unimodal, symmetric, and bell-shaped, just like the Normal
  • -  But t-models with only a few degrees of freedom have much fatter tails than the Normal
  • -  As the degrees of freedom increases, the t-models looks more and more like the Normal
  • -  If you know σ, use z; whenever you use s to estimate σ, use t
    Assumptions and Conditions (Student’s t-models)
    • -  Independence Assumption – the data values should be independent
      o Randomizationconditions–thedataarisefromarandomsampleorsuitably
      randomized experiment
      o 10% Condition - the sample is no more than 10% of the population
    • -  Normal Population Assumption – Student’s t-models won’t work for data that are badly skewed
      o NearlyNormalCondition–thedatacomefromadistributionthatisunimodaland symmetric (make a histogram or probability plot); the normality depends on the sample size:
      •  n<15 or so – the data should follow a Normal model pretty closely
      •  n between 15 and 40 or so – data unimodal and reasonably symmetric
      •  larger than 40 or 50 – the t-methods are safe to use unless the data are
        extremely skewed
    • -  Table T: to find a critical value, locate the row of the table corresponding to the degrees of freedom and the column corresponding to the probability you want
      More Cautions About Interpreting Confidence Intervals
      - “90% of intervals that could be found in this way would cover the true value” or “I am 90% confident that the true mean speed is between 29.5 and 32.5 mph”
      Make A Picture ...
      - Make a histogram of the data and verify that its distribution is unimodal and symmetric and that is has no outliers
      - Make a Normal probability plot to see that it’s reasonably straight A test for the Mean
      • -  Hypothesis test called the one-sample t-test for the mean (example: true mean speed in fact greater than the 30 mph speed limit?)
      • -  The assumptions and conditions for the one-sample t-test for the mean are the same as for the one-sample t-interval
        We test the hypothesis H : μ=μ using the statistic t = 0 0 n-1
      • -  Example page 560
Finding t-Values by Hand
  • -  Table T: the tables run down the page for as many degrees of freedom as can fit; as the degrees of freedom increase, the t-model gets closer and closer to the Normal, so the tables give a final row with the critical value from the Normal model and label it ∞ df
  • -  If you cannot find a row for the df you need, just use the next smaller df in the table Significance and Importance
  • -  Statistically significant does not mean actually important or meaningful
  • -  It is always a good idea when we test a hypothesis to also check the confidence interval and
    think about the likely values for the mean Intervals and Tests
  • -  The confidence interval contains all the null hypothesis values we can’t reject with these data
  • -  More precisely, a level C confidence interval contains all of the plausible null hypothesis
    values that would not be rejected by a two-sided hypothesis test at alpha level 1-C; so a 95%
    confidence interval matches a 1-95=0.05 level two-sided test for these data
  • -  Confidence intervals are naturally two-sided, so they match exactly with two-sided
Sample
- - -
hypothesis tests; when, the hypothesis is one-sided, the corresponding alpha level is (1-C)/2 Size
If we need great precision, however, we’ll want a smaller ME larger sample size
We can solve this equation for n (ME=T*
n-1 s/√n)
Without knowing n, we don’t know the degrees of freedom and we can’t find the critical value, t*
n-1 use the corresponding z* value
*The Sign Test – Back to Yes and No
  • -  Yes (1) an no (0)
  • -  Null hypothesis says that the median is 30; if that null hypothesis were true, we’d expect the
    proportion of cars driving faster than 30 mph to be 0.50; on the other hand, if the median
    speed were greater than 30 mph, we’d expect to see more cars driving faster than 30
  • -  If we test a median by counting the number of values above and below that value, it’s called
    a sign test – the sign test is a distribution free method (example page 567)
  • -  Simpler, fewer assumptions
  • -  But only works even when the data have outliers or a skewed distribution
    Comparing Means
    • -  Generic or brand-name batteries?
    • -  Difference in mean lifetimes?
      Plot the Data
    • -  Boxplots of the data for two groups, placed side by side
    • -  Figure 24.1 -> difference large enough? Random fluctuation? statistical inference Comparing Two Means
  • -  Difference between the mean battery lifetimes of the two brands, μ12
  • -  Confidence interval, standard deviation, sampling model
  • -  For independent random variables, the variance of their difference is the sum of their
    individual variances, Var (Y-X)=Var(Y)+Var(X)
- SD( 1− 2)= +

- SE( 1− 2)= +
- The confidence interval we build is called a two-sample t-interval (for the difference in means). The corresponding hypothesis test is called a two-sample t-test.
( 1− 2)±ME
WhereME=t*xSE( 1− 2)
Assumptions and Conditions
  • -  Independence Assumption – the data in each group must be drawn independently and at random from a homogeneous population, or generated by a randomized comparative experiment
    o RandomizationCondition
    o 10%Condition
  • -  Normal Population Assumption
    o NearlyNormalCondition–wemustcheckthisforbothgroups;aviolationbyeither one violates the condition
    •  n<15 – you should not use these methods if the histogram or Normal probability plot shows severe skewness
    •  n’s closer to 40 – mildly skewed histogram is OK
    •  n>40 CLT
  • -  Independent Groups Assumption – to use the two-sample t methods, the two groups we are
    comparing must be independent of each other Two-Sample t-interval for the difference between means

- Confidence interval: ( 1 − 2) ± t*df x SE ( 1 − 2)
- Standard error of the difference of the means SE( 1 − 2) = +
A Test for the Difference between Two Means
  • -  Two-sample t-test for the difference between means
  • -  Hypothesized difference Δ0 = 0
  • -  We then compare the difference in the means with the standard error of that difference
  • -  Example page 588/589
Back into the Pool
- For means, there is also a pooled t-test (but knowing that two means are equal doesn’t say anything about whether their variances are equal)

  • -  If we were willing to assume that their variances are equal, we could pool the data from two groups to estimate the common variance; we’d estimate this pooled variance from the data, so we’d still use a Student’s t-model pooled t-test (for the difference between means)
  • -  But difficult to assume therefore, use a two-sample t-test instead The Pooled t-test
    • -  Equal variance assumption – the variances of the two populations from which the samples have been drawn are equal: σ12 = σ22
    • -  Similar Spread Condition – looking at the boxplots to check that the spreads are not wildly
  • -  s2
  • -  SE
= 2 2
( - )= + =s 1 2
different
pooled
pooled
+ pooled
  • -  df = n1+n2-2
  • -  substitute the pooled-t estimate of the standard error and its degrees of freedom into the
    steps of the confidence interval or hypothesis test, and you’ll be using the pooled-t method Turkey’s Quick Test
  • -  7,10and13
  • -  Basis for the test: boxplots don’t overlap
  • -  To use Turkey’s test, one group must have the highest value and the other, the lowest. We
    just count how many values in the high group are higher than all the values of the lower group. Add to this the number of values in the low group that are lower than all the values of the higher group (count ties as 1⁄2)
  • -  Now if this total is 7 or more, we can reject the null hypothesis of equal means at alpha = 0.05
  • -  The “critical values” of 10 and 13 give us alpha’s of 0.01 and 0.001
  • -  Only assumption: two samples are independent
Rank Sum Test
- See lecture slides!

Chapter 25 Paired Samples and Blocks
  • -  Speed-skating races are run in pairs
  • -  Some fans thought there might have been an advantage to starting on the outside
  • -  The data for the races run two at a time not independent
    Paired Data
    • -  Data such as these are called paired
    • -  We can focus on the difference in times for each racing pair
    • -  When pairs arise from an experiment, the pairing is a type of blocking
    • -  When they arise from an observational study, it is a form of matching
  • -  There is no test to determine whether the data are paired – you must determine that from understanding how they were collected and what they mean
  • -  Pairwise differences -> because it is the difference we care about, we’ll treat them as if they were the data, ignoring the original two columns. Now that we have only one column of values to consider, we can use a simple one-sample t-test. Mechanically, a paired t-test is just a one-sample t-test for the means of these pairwise differences (the sample size is the number of pairs)
    Assumptions and Conditions
    • -  Paired Data Assumption – the data must be paired (two-sample t methods aren’t valid without independent groups, and paired groups aren’t independent
    • -  Independence Assumption – the difference must be independent of each other
      o RandomizationCondition–focusourattentiononwheretherandomnessshouldbe o 10%condition–doesn’tapplytorandomizedexperiments,wherenosamplingtakes
      place
    • -  Normal Population Assumption – the population of differences follows a Normal model
      o NearlyNormalCondition–canbecheckedwithahistogramorNormalprobability plot of the differences - but not of the individual groups
    • -  Example paired t-test page 615 Confidence Intervals for Matched Pairs
      • -  Married couples, husbands tend to be slightly older than wives
      • -  Data paired, couples at random
      • -  Interested in the mean age difference within couples
      • -  Confidence interval for the true mean difference in ages?
      • -  Example page 618 Blocking
  • -  A paired design is an example of blocking
  • -  The fact of the pairing determines how many degrees of freedom are available
  • -  Matching pairs generally removes so much extra variation that it more than compensates for
    having only half the degrees of freedom
*The Sign Test Again
  • -  We record a 0 for every paired difference that’s negative and a 1 for each positive difference, ignoring pairs for which the difference is exactly 0
  • -  We test the associated proportion p=0.5 using a z-test
  • -  As with other distribution-free tests, the advantage of the sign test for matched pairs is that
    we don’t require the Nearly Normal Condition for the paired difference
    Chapter 26 Comparing Counts
    • -  Zodiac signs of 256 heads of the largest 400 companies
    • -  Successful people more likely to born under some signs than others?
Goodness-of-Fit
  • -  Uniformly distributed? -> 1/12 of them under each sign? (256/12 -> 21.3 births per sign)
  • -  How closely do the observed numbers of births per sign fit this simple “null” model?
  • -  A hypothesis test to address this question is called a test of “goodness-of-fit” – it involves
    testing a hypothesis
  • -  Confidence interval doesn’t make sense
  • -  We need a test that includes all 12 hypothesized proportions
    Assumptions and Conditions
    • -  Counted Data Condition – the data must be counts for the categories of categorical variable
    • -  Independence Assumption – the counts in the cells should be independent of each other
      o RandomizationCondition–theindividualswhohavebeencountedshouldbea random sample from the population of interest
    • -  Sample Size Assumption
      o ExpectedCellFrequencyCondition–weshouldexpecttoseeatleast5individualsin
      each cell
      Calculations
    • -  Difference between these observed and expected counts, denoted (Obs-Exp)
    • -  We divide each squared difference by the expected count for that cell
    • -  The test statistic, called the chi-square statistic, is found by adding up the sum of the squares
      of the deviations between the observed and expected counts divided by the expected counts
    • -  Χ2 =
    • -  The number of degrees of freedom for a goodness-of-fit test is n-1
    • -  n is not the sample size, but instead is the number of categories (12 signs -> 11 df)
      Chi-Square P-Values
    • -  Chi-square statistic is used only for testing hypotheses
    • -  If the observed counts don’t match the expected, the statistic will be large
    • -  This chi-square test is always one-sided
    • -  If the calculated statistic value is large enough, we’ll reject the null hypothesis
    • -  Read the X2 table (Table X) just find the row for the correct number of degrees of freedom
      and read across to find where your calculated X2 value falls
    • -  There is no direction to the rejection of the null model; all we know is that is doesn’t fit
    • -  Example page 637
      The Chi-Square Calculation
      1. Find the expected values
      2. Compute the residuals
      3. Square the residuals
      4. Compute the components
      5. Find the sum of the components
      6. Find the degrees of freedom
      7. Test the hypothesis

Sign
Observed
Expected
Residual (Obs- Exp)
(Obs-Exp)2
Component =
2
But I Believe the Model...
  • -  The hypothesis-testing procedure allows us only to reject the null or fail to reject it
  • -  If you choose uniform as the null hypothesis, you can only fail to reject it
    Comparing Observed Distributions
    • -  Example page 641 – whether the plans of students are the same at different colleges
    • -  Two-way table – each cell of the table shows how many students from a particular college
      made a certain choice
    • -  We want to test whether the student’s choices are the same across all four colleges; the z-
      test for two proportions generalizes to a chi-square test of homogeneity
    • -  Here we are asking whether choices are the same among different groups, so we find the
      expected counts for each category directly from the data
    • -  Homogeneity means that things are the same -> we ask whether the post-graduation choices
      made by students are the same for these four colleges
    • -  The homogeneity test comes with a built-in null hypothesis: we hypothesize that the
      distribution does not change from group to group Assumptions and Conditions
      • -  Counted Data Condition – the data must be counts
      • -  As long as we don’t want to generalize, we don’t have to check the Randomization Condition
        or the 10% Condition
      • -  Expected Cell Frequency Condition – expected count in each cell must be at least 5
        Calculations
      • -  The expected counts are those proportions applied to the number of students in each graduating class fill in expected values for each cell check condition calculate component for each cell summing all components across all cells
      • -  Degrees of freedom = (R-1)(C-1)
      • -  Example page 643
        Examining the Residuals
      • -  Whenever we reject the null hypothesis, it is a good idea to examine residuals
      • -  We need to know each residual’s standard deviation
      • -  To standardize a cell’s residual, we just divide by the square root of its expected value c =

      • -  Notice that these standardized residuals are just the square roots of the components we calculated for each, and their sign indicates whether we observed more cases than we expected, or fewer
- Now that we have subtracted the mean (zero) and divided by their standard deviations, these are z-scores (null hypothesis true? -> CLT and 68-95-99.7 Rule)
Independence
  • -  Example: whether the risk of hepatitis C was related to whether people had tattoos and to where they got their tattoos (two-way table)
  • -  These data differ from the kinds of data we’ve considered before in this chapter because they categorize subjects from a single group on two categorical variables rather than on only one
  • -  Contingency tables categorize counts on two (or more) variables so that we can see whether the distribution of counts on one variable is contingent on the other
  • -  Independence means that the probability that a randomly selected patient has hepatitis C should not change when we learn the patient’s tattoo status if Hepatitis Status is independent of tattoos, we’d expect the proportion of people testing positive for hepatitis to be the same for the three levels of Tattoo Status a chi-square test for independence Are the variables independent?
    Assumptions and Conditions
  • -  We still need counts and enough data so that the expected values are at least 5 in each cell
  • -  In case of independence we want to generalize -> check if it is a representative random
    sample from, and fewer than 10% of, that population
  • -  Example page 648
    Examine the Residuals
  • -  We should examine the residuals because we have rejected the null hypothesis
  • -  Standardize each residual sum of the squares = chi-square value
  • -  Figure 26.6 (standardized residuals) large and positive value (tattoos obtained in a tattoo
    parlor who have hepatitis C), indicating there are more people in that cell than the null hypothesis of independence would predict / a negative value says that there are fewer people in this cell than independence would expect
    Chi-Square and Causation
  • -  Just as correlation between quantitative variables does not demonstrate causation, a failure of independence between two categorical variables does not show a cause-and-effect relationship between them, nor should we say that one variable depends on the other
  • -  Lurking variables can be responsible for the observed lack of independence
    Chapter 27 Inferences for Regression
    • -  %Body Fat plotted against Waist size fo r a sample of 250 males of various ages (fig. 27.1)
    • -  Equation of the least squares line: % = -42.7 + 1.7 Waist
      -> on average, %Body Fat is greater by 1.7 percent for each additional in around the waist The Population and the Sample
  • -  Regression -> straight line; but: not all men who have 38-inch waists have the same %Body Fat (the distribution of 38-inch men is unimodal and symmetric -> fig. 27.2/27.3)
  • -  We want a model -> therefore, an idealized regression line – the model assumes that the means of the distribution of %Body Fat for each Waist size fall along the line, even though the individuals are scattered around it
  • -  μy = β0 + β 1x (model = intercept + slope)
  • -  Model makes errors (ε) – some individuals lie above and some below the line
  • -  y=β01x+ε
  • -  We estimate the β’s by finding a regression line, = b0 + b 1x; the residuals, e = y- , are the
    sample-based versions of the errors, ε Assumptions and Conditions
- Linearity Assumption
o StraightEnoughCondition–scatterplotlooksstraight(bylookingatascatterplotof
the residuals against x or against the predicted values, o QuantitativeDataCondition
  • -  Independence Assumption – the errors in the true underlying regression model must be mutually independent
    o RandomizationCondition
  • -  Equal Variance Assumption – the variability of y should be about the same for all values of x
    o DoesthePlotThicken?Condition–checkthespreadaroundthelineisnearly constant
  • -  Normal Population Assumption – the errors around the idealized regression line at each values of x follow a Normal model
o NearlyNormalCondition–ateachvalueofxthereisadistributionofy-valuesthat follows a Normal mode, and each of these Normal models is centered on the line and has the same standard deviation
o OutlierCondition
Which Comes First: The Conditions or the Residuals?

  1. Make a scatterplot of the data to check the Straight Enough Condition.
  2. If the data are straight enough, fit a regression and find the residuals, e, and predicted
values, .
3. Make a scatterplot of the residuals against x or against the predicted values. This plot

should have no pattern. Check in particular for any bend (which would suggest that the data weren’t all that straight after all), for any thickening (or thinning), and, of course, for any outliers.
  1. If the data are measured over time, plot the residuals against time to check for evidence of patterns that might suggest they are not independent.
  2. If the scatterplots look OK, then make a histogram and Normal probability plot of the residuals to check the Nearly Normal Condition.
  3. If all the conditions seem to be reasonably satisfied, go ahead with inference.
Intuition About Regression Inference
  • -  The sample-to-sample variation is what generates the sampling distribution for the coefficients
  • -  3 aspects of the scatterplot affect the standard error of the regression slope:
o Spreadaroundtheline–lessscatteraroundthelinemeanstheslopewillbemore
consistent from sample to sample. The spread around the line is measured with the residual standard deviation, se. You can always find se in the regression output, often just labeled s.
s = ∑ ^ )
e
The less scatter around the line, the smaller the residual standard deviation and the stronger the relationship between x and y
o Spreadofthex’s–ifsx,thestandarddeviationofxislarge,itprovidesamorestable regression
o Samplesize–havingalargersamplesize,n,givesmoreconsistentestimatesfrom sample to sample
Standard Error for the Slope - SE(b)=
1
- When we standardize the slopes by subtracting the model mean and dividing by their
standard error, we get a Student’s t-model, this time with n-2 degrees of freedom - ß ~
) n-2
What About the Intercept?
- ß ~ ) n-2
Regression Inference
  • -  We can test a hypothesis about it and make confidence intervals
  • -  Usual null hypothesis about the slope is that it’s equal to 0 (would say that y doesn’t tend to
change linearly when x changes = no linear association) - TotestH:ß=0,wefindt =
0 1 n-2 )
- A 95% confidence interval for ß is: b1±t*n-2 x SE(b1)
Another Example
  • -  Contest in which participants try to guess the exact minute that a wooden tripod placed on the frozen Tanana River will fall through the breaking ice
  • -  We cannot use regression to tell the causes of any change – but we can estimate the rate of change (if any) and use it to make better predictions
  • -  Example page 686-689 Standard Errors for Predicted Values
- A confidence interval can tell us how precise that prediction will be
- We can predict the mean %Body Fat for all men whose Waist size is 28 inches with a lot more precision than we can predict the %Body Fat of a particular individual whose Waist size happens to be 38 inches
- We are predicting the value for a new individual, one that was not part of the original data set -> “x sub new” (xv)
  • -  Regression equation predicts %Body Fat as v=b0+b1xv
  • -  Now that we have the predicted value, we construct both intervals around this same
number; both intervals take the form: v± t*n-2 x SE (t* is the same for both)
- Easier to predict a data point near the middle of the data set than far from the center

- SE( )= 2 1) − )2+ 2 + 2 v
Confidence Intervals for Predicted Values
  • -  Example all men and individual page 690
  • -  The narrower interval is a confidence interval for the predicted mean value at xv, and the
Logistic
wider interval is a prediction interval for an individual with that x-value
Regression
  • -  Researchers investigating factors for increased risk f diabetes examined data on 768 adult women of Pima Indian heritage (BMI (weight/height))
  • -  From the boxplots, we see that the group with diabetes has a higher mean BMI
  • -  BMI as the response and Diabetes as the predictor displayed – but researches interested in
    predicting the increased risk of Diabetes due to increased BMI
  • -  Fig. 27.13 dichotomous variable
  • -  Fig. 27.14 treating like quantitative data -> regression line
  • -  Setting all negative probabilities to 0 and all probabilities greater than 1 to 1
  • -  Fig. 27.16 smooth curve models
  • -  There are many curved in mathematics with shapes like this that we might use for our model.
    One of the most common is the logistic curve -> logistic regression
  • -  ln ( ̂ /1 − ̂ )= b0+b1x
  • -  When p is a probability, p/1-p is the odds in favor of a success
    When the probability of success, p, = 1/3, we’d get the ratio / =1/2 /
    Chapter 30 Multiple Regression
    - Height Just do it
    • -  A regression with two or more predictor variables is called a multiple regression
    • -  For simple regression, we found the Least Squares solution, the one whose coefficients made
      the sum of the squared residuals as small as possible. For multiple regression, we’ll do the
      same thing but this time with more coefficients
    • -  R2 gives the fraction of the variability of %Body Fat accounted for by the multiple regression
model
  • -  Degr ees of freedom is the number of observations minus 1 for each coefficient estimated
  • -  % = -3.10 + 1.77 Waist – 0.60 Height
  • -   = % − %
    So, What’s New?
    • -  The meaning of the coefficients in the regression model has changed in a subtle but important way
    • -  Multiple regression is an extraordinarily versatile calculation, underlying many widely used Statistics methods
    • -  Offers first glimpse into statistical models that use more than two quantitative variables What Multiple Regression Coefficients Mean
  • -  Fig. 30.1 scatterplot of %Body Fat against Height little relationship between these variables
  • -  The multiple regression coefficient of Height takes account of the other predictor, Waist size, in the regression model
  • -  Only looking at all men whose waist size is about 37 inches -> negative relationship between Height and %Body Fat because taller men probably have less body fat than shorter men who have the same waist size
  • -  For men with that waist size, an extra inch of height is associated with a decrease of about 0.60% in body fat
  • -  Looking on all waist sizes at the same time? -> plotting the residuals of %Body Fat after a regression on Waist size against the residuals of Height after regressing it on Waist size (“partial regression plot”) -> showing the relationship of %Body Fat to Height after removing the linear effects of Waist size
  • -  A partial regression plot for a particular predictor has a slope that is the same as the multiple regression coefficient for that predictor. It also has the same residuals as the full multiple regression, so you can spot any outliers or influential points and tell whether they’ve affected the estimation of this particular coefficient
    The Multiple Regression Model
- y=β0 1x1 2x2
Assumptions and Conditions
  • -  Linearity Assumption
    o StraightEnoughConditionforeachofthepredictors
  • -  Independence Assumption
    o RandomizationCondition
  • -  Equal Variance Assumption – the variability of the errors should be the same for all values of each predictor
    o Does the Plot Thicken? Condition – scatterplots of the regression residuals against each x or against the predicted values, , offer a visual check
  • -  Normality Assumption – errors around the idealized regression model at any specified values of the x-variables follow a Normal model
o NearlyNormalCondition
- Summary of checking conditions
o ChecktheStraightEnoughConditionwithscatterplotsofthey-variableagainsteach
x-variable
o Ifthescatterplotsarestraightenough,fitamultipleregressionmodeltothedata
o Findtheresidualsandpredictedvalues
o Makeascatterplotoftheresidualsagainstthepredictedvalues.Thisplotshouldlook
patternless. Check in particular for any bend and for any thickening
o Suitablerandomizationused?Representativeofsomeidentifiablepopulation?
Checking if they are not independent by plotting the residuals against time to look
for patterns
o Interpretationandprediction
o Ifyouwishtotesthypothesesaboutthecoefficientsorabouttheoverallregression,
then make a histogram and Normal probability plot of the residuals to check the Nearly Normal Condition
Multiple Regression Inference I: I Thought I Saw an ANOVA Table...
  • -  Is this multiple regression model any good at all?
  • -  If all the coefficients (except the intercept) were zero, we’d have
=b0+0x1 +...+0xk And we’d just set b0 = H01 2 =...=βk =0
- We can test this hypothesis with a statistic that is labeled with the letter F – bigger F-values mean smaller P-values
Multiple Regression Inference II: Testing the Coefficients
  • -  Only if we reject the null hypothesis, we can move on to check the test statistics for the individual coefficients
  • -  For each coefficient, we test H0: β1=0 against the (two-sided) alternative that it isn’t zero; the regression table gives a standard error for each coefficient and the ratio of the estimated coefficient to its standard error
  • -  If the assumptions and conditions are met, these ratios follow a Student’s t-distribution tn-k-1 = bj-0 / SE (bj)
  • -  The degrees of freedom is the number of data values minus the number of predictors
  • -  CI in the usual way (estimate ± margin of error); margin of error is just the product of the
    standard error and a critical value CI for βj: bj ± t*n-k-1 SE (bj)
    How’s That, Again?
  • -  y=β01x1+...+βkxk
  • -  Wrong conclusion that each βj tells us the effect of its associated predictor, xj, on the
    response variable, y
    Another Example: Modeling Infant Mortality

- Variables available: child deaths, percent f teens who drop out of high school, percent of low- birth-weight babies, teen births, and teen deaths by accident, homicide, and suicide
  • -  All variables no outliers and Nearly Normal distributions
  • -  One useful way o check many of our conditions is with a scatterplot matrix (fig. 30.6) -> array
    of scatterplots set up so that the plots in each row have the same variable on their y-axis and
    those in each column have the same variable on their x-axis
  • -  On the diagonal, rather than plotting a variable against itself, you’ll usually find either a
    Normal probability plot or a histogram of the variable to help us assess the Nearly Normal
    Condition
  • -  Example page 797
    Comparing Multiple Regression Models
    • -  How do we know that some other choice of predictors might not provide a better model?
    • -  Many people look at the R2 value, and certainly we are not likely to be happy with a model
      that accounts for only a small fraction of the variability of y
    • -  Keep in mind that the meaning of a regression coefficient depends on all the other predictors
      in the model, so it is best to keep the number of predictors as small as possible
    • -  Predictors that are easy to understand are usually better choices than obscure variables
      Adjusted R2
      • -  The adjusted R2 statistic is a rough attempt to adjust for the simple fact that when we add another predictor to a multiple regression, the R2 can’t go down and will most likely go up
      • -  We can write a formula for R2 using the sums of squares in the ANOVA table portion of the
        regression output table:
        R2 = = 1-

      • -  Adjusted R2 simply substitutes the corresponding Mean Squares for the SS’s: R2 = 1-
        adj
      • -  Because the Mean Squares are Sums of Squares divided by degrees of freedom, they are
        adjusted for the number of predictors in the model
      • -  As a result, the adjusted R2 value won’t necessarily increase when a new predictor is added
      • -  It no longer tells the fraction of variability accounted for by the model 












Sunday, January 15, 2017

 EVERYTHING YOU NEED TO KNOW ABOUT POKEMON



POKEMON is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures. The franchise copyright is shared by all three companies, but Nintendo is the sole owner of the trademark. The franchise was created by Satoshi Tajiri in 1995, and is centered on fictional creatures called "Pokémon", which humans, known as Pokémon Trainers, catch and train to battle each other for sport.
The franchise began as a pair of video games for the original Game Boy that were developed by Game Freak and published by Nintendo. It now spans video games, trading card games, animated television shows and movies, comic books, and toys. Pokémon is the second best-selling video game franchise, behind only Nintendo's Mario franchise and one of the Highest-Grossing Media franchises of all time.
Cumulative sales of the video games  have reached more than 280 million copies. As of May 2016, the Pokémon franchise has grossed revenues of ¥4.8 trillion worldwide .
The franchise celebrated its tenth anniversary in 2006. 2016 marks the 20th anniversary of the release of the original games, with the company celebrating by airing an ad during Super Bowl 50, issuing re-releases of Pokémon Red, Blue, and Yellow and completely redesigning the way the games are played. The mobile augmented reality game Pokémon Go was released in July 2016. The first seventh-generation games Pokémon Sun and Moon were released worldwide on November 18, 2016. A live-action film adaptation based on Great Detective Pikachu is planned to start production in 2017.
Name  The name Pokémon is the romanized contraction of the Japanese brand . The term Pokémon, in addition to referring to the Pokémon franchise itself, also collectively refers to the 802 known fictional species that have made appearances in Pokémon media as of the release of the seventh generation titles Pokémon Sun and Moon. "Pokémon" is identical in both the singular and plural, as is each individual species name; it is grammatically correct to say "one Pokémon" and "many Pokémon", as well as "one Pikachu" and "many Pikachu".
Concept
Tajiri first thought of Pokémon, albeit with a different concept, around 1989 or 1990, when the Game Boy was first released. The concept of the Pokémon universe, in both the video games and the general fictional world of Pokémon, stems from the hobby of insect collecting, a popular pastime which Pokémon executive director Satoshi Tajiri enjoyed as a child. Players are designated as Pokémon Trainers and have two general goals: complete the Pokédex by collecting all of the available Pokémon species found in the fictional region where a game takes place, and train a team of powerful Pokémon from those they have caught to compete against teams owned by other Trainers so they may eventually win the fictional Pokémon League and become the regional Champion. These themes of collecting, training, and battling are present in almost every version of the Pokémon franchise, including the video games, the anime and manga series, and the Pokémon Trading Card Game.
In most incarnations of the fictional Pokémon universe, a Trainer who encounters a wild Pokémon is able to capture that Pokémon by throwing a specially designed, mass-producible spherical tool called a Poké Ball at it. If the Pokémon is unable to escape the confines of the Poké Ball, it is officially considered to be under the ownership of that Trainer. Afterwards, it will obey whatever its new master commands, unless the Trainer demonstrates such a lack of experience that the Pokémon would rather act on its own accord. Trainers can send out any of their Pokémon to wage non-lethal battles against other Pokémon; if the opposing Pokémon is wild, the Trainer can capture that Pokémon with a Poké Ball, increasing his or her collection of creatures. Pokémon already owned by other Trainers cannot be captured, except under special circumstances in certain games. If a Pokémon fully defeats an opponent in battle so that the opponent is knocked out, the winning Pokémon gains experience points and may level up. When leveling up, the Pokémon's statistics  of battling aptitude increase, such as Attack and Speed. From time to time the Pokémon may also learn new moves, which are techniques used in battle. In addition, many species of Pokémon can undergo a form of metamorphosis and transform into a similar but stronger species of Pokémon, a process called evolution.
In the main series, each game's single-player mode requires the Trainer to raise a team of Pokémon to defeat many non-player character  Trainers and their Pokémon. Each game lays out a somewhat linear path through a specific region of the Pokémon world for the Trainer to journey through, completing events and battling opponents along the way . Each game  features eight especially powerful Trainers, referred to as Gym Leaders, that the Trainer must defeat in order to progress. As a reward, the Trainer receives a Gym Badge, and once all eight badges are collected, that Trainer is eligible to challenge the region's Pokémon League, where four immensely talented trainers  challenge the Trainer to four Pokémon battles in succession. If the trainer can overcome this gauntlet, he or she must then challenge the Regional Champion, the master Trainer who had previously defeated the Elite Four. Any Trainer who wins this last battle becomes the new champion. In Sun and Moon, however, the Gym Leaders are not present, and are instead replaced with "Trial Captains", a NPC who gives the Trainer a challenge to complete so as to earn a special item. Once the player completes all of these on an island, the Trainer must take on the Island Kahuna, the strongest Trainer on the island. Once the player beats all the Kahunas, they must travel to the recently built Pokémon League, where they must re-defeat the Kahunas, who now form the Elite Four, and then defend their title against challengers.
Video games
Generations  
The original Pokémon games were role-playing games  with an element of strategy, and were created by Satoshi Tajiri for the Game Boy. These RPGs, and their sequels, remakes, and English language translations, are still considered the "main" Pokémon games, and the games which most fans of the series are referring to when they use the term "Pokémon games". All of the licensed Pokémon properties overseen by The Pokémon Company International are divided roughly by generation. These generations are roughly chronological divisions by release; every several years, when an official sequel in the main RPG series is released that features new Pokémon, characters, and gameplay concepts, that sequel is considered the start of a new generation of the franchise. The main games and their spin-offs, the anime, manga, and trading card game are all updated with the new Pokémon properties each time a new generation begins. The franchise began the seventh generation on November 18, 2016.
Generation 1  
The Pokémon franchise started off in its first generation with its initial release of Pocket Monsters Aka and Midori  for the Game Boy in Japan on February 27, 1996. When these games proved extremely popular, an enhanced Ao  version was released sometime after, and the Ao version was reprogrammed as Pokémon Red and Blue for international release. The games launched in the United States on September 30, 1998. The original Aka and Midori versions were never released outside Japan. Afterwards, a further enhanced version titled Pokémon Yellow: Special Pikachu Edition was released to partially take advantage of the color palette of the Game Boy Color, as well as to feature more elements from the popular Pokémon anime. This first generation of games introduced the original 151 species of Pokémon, in National Pokédex order, encompassing all Pokémon from Bulbasaur to Mew. It also introduced the basic game concepts of capturing, training, battling, and trading Pokémon with both computer and human players. These versions of the games take place within the fictional Kanto region, inspired by the real world Kantō region of Japan, though the name "Kanto" was not used until the second generation.
Generation 2  
The second generation of Pokémon began in 1999 with the release of Pokémon Gold and Silver for Game Boy Color. Like the previous generation, an enhanced version titled Pokémon Crystal was later released. The second generation introduced 100 new species of Pokémon, starting with Chikorita and ending with Celebi. It totaled 251 Pokémon to collect, train, and battle, set in Johto, inspired by Japan's Kansai region. The Pokémon mini is a handheld game console released in November 2001 in North America, December 2001 in Japan, and 2002 in Europe.
Generation 3  
Pokémon entered its third generation with the 2002 release of Pokémon Ruby and Sapphire for Game Boy Advance and continued with the Game Boy Advance remakes of Pokémon Red and Blue, Pokémon FireRed and LeafGreen, and an enhanced version of Pokémon Ruby and Sapphire titled Pokémon Emerald. The third generation introduced 135 new Pokémon, starting with Treecko and ending with Deoxys, for a total of 386 species. It is set in Hoenn, inspired by Japan's Kyushu region. However, this generation also garnered some criticism for leaving out several gameplay features, including the day-and-night system introduced in the previous generation. It was also the first installment that encouraged the player to collect merely a selected assortment of the total number of Pokémon rather than every existing species. By contrast, 202 out of 386 species are catchable in the Ruby and Sapphire versions.
Generation 4  
In 2006, Japan began the fourth generation of the franchise with the release of Pokémon Diamond and Pearl for Nintendo DS. The fourth generation introduced another 107 new species of Pokémon, starting with Turtwig and ending with Arceus, bringing the total of Pokémon species to 493. The Nintendo DS "touch screen" allows new features to the game such as cooking poffins with the stylus and using the "Pokétch". New gameplay concepts include a restructured move-classification system, online multiplayer trading and battling via Nintendo Wi-Fi Connection, the return and expansion of the second generation's day-and-night system, the expansion of the third generation's Pokémon Contests into "Super Contests", and the new region of Sinnoh. This region was inspired by Japan's Hokkaido region and part of Russia's Sakhalin, and has an underground component for multiplayer gameplay in addition to the main overworld. Pokémon Platinum, the enhanced version of Diamond and Pearl—much like Pokémon Yellow, Crystal, and Emerald—was released in September 2008 in Japan, March 2009 in North America, and May 2009 in Australia and Europe. Spin-off titles in the fourth generation include the Pokémon Stadium follow-up Pokémon Battle Revolution for Wii, which has Wi-Fi connectivity as well. Nintendo announced in May 2009 that enhanced remakes of Pokémon Gold and Silver, entitled Pokémon HeartGold and SoulSilver, would be released for the Nintendo DS system. HeartGold and SoulSilver are set in the Johto region and were released in September 2009 in Japan.
Generation 5  
The fifth generation of Pokémon began on September 18, 2010, with the release of Pokémon Black and White in Japan for Nintendo DS. The games were originally announced by the Pokémon Company on January 29, 2010, with a tentative release later that year. The final release date of September 18 was announced on June 27, 2010. This version is set in the, inspired by New York City, and utilizes the Nintendo DS's 3-D rendering capabilities to a greater extent than Platinum, HeartGold, and SoulSilver, as shown in game footage of the player walking through the metropolis. A total of 156 new Pokémon were introduced, starting with Victini and ending with Genesect, bringing the franchise's total to 649. This is currently the only time that the number of Pokémon introduced surpasses the number introduced in the first generation. It also deployed new game mechanics such as the wireless interactivity features and the ability to upload game data to the Internet and to the player's own computer. Pokémon Black and White was released in Europe on March 4, 2011, in North America on March 6, 2011, and in Australia on March 10, 2011. On June 23, 2012, Nintendo released Pokémon Black 2 and Pokémon White 2 in Japan for Nintendo DS, with early October releases in North America and Europe. Black 2 and White 2 are sequels to Black and White, with several events in the second games referencing events in the first; they also allow players to link their previous Black or White with their Black 2 or White 2, introducing several events based on how they played their previous game.
Generation 6  
Officially announced on January 8, 2013, and released simultaneously worldwide on October 12, 2013, Pokémon X and Y for the Nintendo 3DS are part of the sixth generation of games. Introducing the France-inspired Kalos region, these are the first Pokémon games rendered in 3D, and the first released worldwide together.
A total of 72 new Pokémon were introduced, starting with Chespin and ending with Volcanion, bringing the franchise's total to 721. The fewest new Pokémon in a single generation so far; however, the new Mega Evolution feature was added to the games to balance out the lack of new characters. Another addition was the Fairy typing, the first new type since Dark and Steel in the second generation. On May 7, 2014, Nintendo announced remakes of the third generation games Pokémon Ruby and Sapphire titled Pokémon Omega Ruby and Alpha Sapphire which were released in Japan, North America, Australia, and South Korea on November 21, 2014, and in Europe on November 28, 2014.
Generation 7  
Officially announced on February 26, 2016, Pokémon Sun and Moon for the Nintendo 3DS are part of the seventh generation of games, and the celebrations for the 20th anniversary of the franchise, introducing the Hawaii-inspired Alola region. Both games were released worldwide on November 18, 2016 in nine languages; Japanese, English, French, Italian, German, Spanish, Korean, and, for the first time, Chinese . A total of 81 new Pokémon were introduced, bringing the total to 802. Though no new mega evolutions were added, a new type of form was added for specific pokémon, changing their types and move sets. A new type of move was added as well, called the Z-move. Usable by any Pokémon, Z-moves are extremely powerful and as such can only be used once per battle.
Game mechanics  
The main staple of the Pokémon video game series revolves around the catching and battling of Pokémon. Starting with a starter Pokémon, the player can catch wild Pokémon by weakening them and catching them with Poké Balls. Conversely, they can choose to defeat them in battle in order to gain experience for their Pokémon, raising their levels and teaching them new moves. Certain Pokémon can evolve into more powerful forms by raising their levels or using certain items. Throughout the game, players will have to battle other trainers in order to progress, with the main goal to defeat various Gym Leaders and earn the right to become a tournament champion. Subsequent games in the series have introduced various side games and side quests, including the Battle Frontiers that display unique battle types and the Pokémon Contests where visual appearance is put on display.
Starter Pokémon  
One of the consistent aspects of the Pokémon games—spanning from Pokémon Red and Blue on the Game Boy to the Nintendo 3DS games Pokémon Sun and Moon—is the choice of one of three different Pokémon at the start of the player's adventures; these three are often labeled "starter Pokémon". Players can choose a Grass-type, a Fire-type, or a Water-type. For example, in Pokémon Red and Blue, the player has the choice of starting with Bulbasaur, Charmander, or Squirtle. The exception to this rule is Pokémon Yellow, where players are given a Pikachu, an Electric-type mouse Pokémon, famous for being the mascot of the Pokémon media franchise; in this game, however, the three starter Pokémon from Red and Blue can be obtained during the quest by a single player, something that is not possible in any other installment of the franchise. Another consistent aspect is that the player's rival will always choose as his or her starter Pokémon the one that has a type advantage over the player's Pokémon. For instance, if the player picks a Grass-type Pokémon, the rival will always pick the Fire-type starter. An exception to this is again Pokémon Yellow, in which the rival picks an Eevee, but whether this Eevee evolves into Jolteon, Vaporeon, or Flareon is decided by when the player wins and loses to the rival through the journey. Pokémon Sun and Moon are also an exception where the rivial will pick the starter weak toward the players, with the remaining starter used elsewhere. The GameCube games Pokémon Colosseum and Pokémon XD: Gale of Darkness also contain an exception; whereas in most games the player's initial Pokémon starts at Level 5, in these two games the player's initial Pokémon starts at Levels 10 and 25, respectively. In Colosseum the player's starter Pokémon are Espeon and Umbreon, while in Gale of Darkness the player's starter is Eevee.
Pokédex  
The Pokédex is a fictional electronic device featured in the Pokémon video game and anime series. In the games, whenever a Pokémon is first captured, its data will be added to a player's Pokédex, but in the anime or manga, the Pokédex is a comprehensive electronic reference encyclopedia, usually referred to in order to deliver exposition. "Pokédex" is also used to refer to a list of Pokémon, usually a list of Pokémon by number. In the video games, a Pokémon Trainer is issued a blank device at the start of the journey. A trainer must then attempt to fill the Pokédex by encountering and at least briefly obtaining each of the different species of Pokémon. A player will receive the name and image of a Pokémon after encountering one that was not previously in the Pokédex, typically after battling said Pokémon either in the wild or in a trainer battle . In Pokémon Red and Blue, some Pokémon's data is added to the Pokédex simply by viewing the Pokémon, such as in the zoo outside of the Safari Zone. Also, certain NPC characters may add to the Pokédex by explaining what a Pokémon looks like during conversation. More detailed information is available after the player obtains a member of the species, either through capturing the Pokémon in the wild, evolving a previously captured Pokémon, hatching a Pokémon egg, or through a trade with another trainer . This information includes height, weight, species type, and a short description of the Pokémon. Later versions of the Pokédex have more detailed information, like the size of a certain Pokémon compared to the player character, or Pokémon being sorted by their habitat . The most current forms of Pokédex are capable of containing information on all Pokémon currently known. The GameCube games, Pokémon Colosseum and Pokémon XD: Gale of Darkness, have a Pokémon Digital Assistant  which is similar to the Pokédex, but also tells what types are effective against a Pokémon and gives a description of its abilities.
In other media
Anime series  
The Pokémon anime series and films are a meta-series of adventures separate from the canon that most of the Pokémon video games follow . The anime follows the quest of the main character, Ash Ketchum, a Pokémon Master in training, as he and a small group of friends travel around the fictitious world of Pokémon along with their Pokémon partners. The original series, titled Pocket Monsters, or simply Pokémon in Western countries, begins with Ash's first day as a Pokémon trainer. His first  Pokémon is a Pikachu, differing from the games, where only Bulbasaur, Charmander, or Squirtle could be chosen. The series follows the storyline of the original games, Pokémon Red and Blue, in the region of Kanto. Accompanying Ash on his journeys are Brock, the Pewter City Gym Leader, and Misty, the youngest of the Gym Leader sisters from Cerulean City. Pokémon: Adventures in the Orange Islands follows Ash's adventures in the Orange Islands, a place unique to the anime, and replaces Brock with Tracey Sketchit, an artist and "Pokémon watcher". The next series, based on the second generation of games, include Pokémon: Johto Journeys, Pokémon: Johto League Champions, and Pokémon: Master Quest, following the original trio of Ash, Brock, and Misty in the western Johto region.
The saga continues in Pokémon: Advanced, based on the third generation games. Ash and company travel to Hoenn, a southern region in the Pokémon World. Ash takes on the role of a teacher and mentor for a novice Pokémon trainer named May. Her brother Max accompanies them, and though he isn't a trainer, he knows large amounts of handy information. Brock  soon catches up with Ash, but Misty has returned to Cerulean City to tend to her duties as a gym leader . The Advanced series concludes with the Battle Frontier saga, based on the Emerald version and including aspects of FireRed and LeafGreen. It ended with Max leaving to pick his starter Pokémon and May going to the Grand Festival in Johto.
In the Diamond and Pearl series, based on the fourth generation games, Ash, Brock, and a new companion, an aspiring Pokémon coordinator named Dawn, travel through the region of Sinnoh. At the end of the series, Ash and Brock return to Kanto where Brock begins to follow his newfound dream of becoming a Pokémon doctor himself.
Pocket Monsters: Best Wishes!, based on the fifth generation games, features Ash and Pikachu traveling through the region of Unova  alongside two new companions, Iris and Cilan  who part ways with them after returning to Kanto.
, is the current airing series based on the sixth generation games, following Ash and Pikachu's journey through the region of Kalos, accompanied by Ash's childhood friend Serena and the siblings Clemont and Bonnie.
In addition to the TV series, nineteen Pokémon films have been made, with the pair of films, Pokémon the Movie: Black—Victini and Reshiram and White—Victini and Zekrom considered together as one. Collectible bonuses, such as promotional trading cards, have been available with some of the films. Various children's books, collectively known as Pokémon Junior, are also based on the anime.
Films  
Given release years are the original Japanese release years.
#Pokémon: The First Movie—Mewtwo Strikes Back
#Pokémon: The Movie 2000—The Power of One
#Pokémon 3: The Movie—Spell of the Unown
#Pokémon 4Ever—Celebi: Voice of the Forest
#Pokémon Heroes
#Pokémon: Jirachi Wish Maker
#Pokémon: Destiny Deoxys
#Pokémon: Lucario and the Mystery of Mew
#Pokémon Ranger and the Temple of the Sea
#Pokémon: The Rise of Darkrai
#Pokémon: Giratina and the Sky Warrior
#Pokémon: Arceus and the Jewel of Life
#Pokémon: Zoroark: Master of Illusions
#Pokémon the Movie: Black—Victini and Reshiram &Pokémon the Movie: White—Victini and Zekrom
#Pokémon the Movie: Kyurem vs. the Sword of Justice
#Pokémon the Movie: Genesect and the Legend Awakened
#Pokémon the Movie: Diancie and the Cocoon of Destruction
#Pokémon the Movie: Hoopa and the Clash of Ages
#Pokémon the Movie: Volcanion and the Mechanical Marvel
It was announced by The Hollywood Reporter that Warner Bros. Pictures, Sony Pictures and Legendary Pictures are in negotiations for a live action Pokémon movie. Deadline reports that Legendary are closing a deal for the film after Pokémon Go's success and will also make a live Detective Pikachu movie as well with Universal Pictures distributing outside Japan. Nicole Perlman and Alex Hirsch are penning the script. Dean Israelite, Robert Rodriguez, Tim Miller, Mark A.Z. Dippé, Shane Acker and Chris Wedge were being considered as potential directors. Toho will distribute the film in Japan, while Universal Pictures will distribute it outside Japan. On November 30, 2016, Deadline reveals that Legendary Entertainment has chosen Rob Letterman to direct the film. It was announced that the film's titled will be called Pokémon's Detective Pikachu.
Soundtracks  
Pokémon CDs have been released in North America, most of them in conjunction with the theatrical releases of the first three Pokémon films. These releases were commonplace until late 2001. On March 27, 2007, a tenth anniversary CD was released containing 18 tracks from the English dub; this was the first English-language release in over five years. Soundtracks of the Pokémon feature films have been released in Japan each year in conjunction with the theatrical releases.
The exact date of release is unknown.
Pokémon Trading Card Game  
The Pokémon Trading Card Game is a collectible card game with a goal similar to a Pokémon battle in the video game series. Players use Pokémon cards, with individual strengths and weaknesses, in an attempt to defeat their opponent by "knocking out" his or her Pokémon cards. The game was first published in North America by Wizards of the Coast in 1999. However, with the release of Pokémon Ruby and Sapphire Game Boy Advance video games, The Pokémon Company took back the card game from Wizards of the Coast and started publishing the cards themselves.
Manga  
There are various Pokémon manga series, four of which were released in English by Viz Media, and seven of them released in English by Chuang Yi. The manga series vary from game-based series to being based on the anime and the TCG. Original stories have also been published. As there are several series created by different authors most Pokémon manga series differ greatly from each other and other media, such as the anime. Pokémon Pocket Monsters and Pokémon Adventures are the only two manga never stopped since the first generation.
Manga released in English
The Electric Tale of Pikachu, a shōnen manga created by Toshihiro Ono. It was divided into four tankōbon, each given a separate title in the North American and English Singapore versions: The Electric Tale of Pikachu, Pikachu Shocks Back, Electric Pikachu Boogaloo, and Surf's Up, Pikachu. The series is based loosely on the anime.
Pokémon Adventures  by Hidenori Kusaka, Mato, and Satoshi Yamamoto, the most popular Pokémon manga based on the video games. The story series around the Pokémon Trainers who called "Pokédex holders".
Magical Pokémon Journey, a shōjo manga
Pikachu Meets the Press
Ash & Pikachu
Pokémon Gold & Silver
Pokémon Ruby-Sapphire and Pokémon Pocket Monsters
Pokémon: Jirachi Wish Maker
Pokémon: Destiny Deoxys
Pokémon: Lucario and the Mystery of Mew
Pokémon Ranger and the Temple of the Sea
Pokémon Diamond and Pearl Adventure!
Pokémon Adventures: Diamond and Pearl / Platinum
Pokémon: The Rise of Darkrai
Pokémon: Giratina and the Sky Warrior
Pokémon: Arceus and the Jewel of Life
Pokémon: Zoroark: Master of Illusions
Pokémon The Movie: White: Victini and Zekrom
Pokémon Black and White
Manga not released in English
Pokémon Pocket Monsters by Kosaku Anakubo, the first Pokémon manga. It is chiefly a gag manga series stars a Pokémon Trainer named Red, his rude Clefairy, and Pikachu.
Pokémon Card ni Natta Wake  by Kagemaru Himeno, an artist for the TCG. There are six volumes and each includes a special promotional card. The stories tell the tales of the art behind some of Himeno's cards.
Pokémon Get aa ze! by Miho Asada
Pocket Monsters Chamo-Chamo ★ Pretty ♪ by Yumi Tsukirino, who also made Magical Pokémon Journey.
Pokémon Card Master
Pocket Monsters Emerald Chōsen!! Battle Frontier by Ihara Shigekatsu
Pocket Monsters Zensho by Satomi Nakamura
Monopoly  
A Monopoly board game was released in August 2014.
Criticism and controversy
Morality and religious beliefs  
Pokémon has been criticized by some Christians over perceived occult and violent themes and the concept of "Pokémon evolution", which they feel goes against the Biblical creation account in Genesis. However, Sat2000, a satellite television station based in Vatican City, has countered that the Pokémon Trading Card Game and video games are "full of inventive imagination" and have no "harmful moral side effects". In the United Kingdom, the "Christian Power Cards" game was introduced in 1999 by David Tate who stated, "Some people aren't happy with Pokémon and want an alternative, others just want Christian games." The game was similar to the Pokémon TCG but used Biblical figures.
In 1999, Nintendo stopped manufacturing the Japanese version of the "Koga's Ninja Trick" trading card because it depicted a manji, a traditionally Buddhist symbol with no negative connotations. The Jewish civil rights group Anti-Defamation League complained because the symbol is the reverse of a swastika, which is considered offensive to Jewish people. The cards were intended for sale in Japan only, but the popularity of Pokémon led to importation into the United States with approval from Nintendo. The Anti-Defamation League understood that the issue symbol was not intended to offend and acknowledged the sensitivity that Nintendo showed by removing the product.
In 1999, two nine-year-old boys from Merrick, New York sued Nintendo because they claimed the Pokémon Trading Card Game caused their problematic gambling.
In 2001, Saudi Arabia banned Pokémon games and cards, alleging that the franchise promoted Zionism by displaying the Star of David in the trading cards  as well as other religious symbols such as crosses they associated with Christianity and triangles they associated with Freemasonry; the games also involved gambling, which is in violation of Muslim doctrine.
Pokémon has also been accused of promoting materialism.
Animal cruelty  
In 2012, PETA publicly criticized the concept of Pokémon as supporting cruelty to animals. PETA compared the game's concept, of capturing animals and forcing them to fight, to cockfights, dog fighting rings and circuses, all events frequently criticized for cruelty to animals. PETA released a game spoofing Pokémon where the Pokémon battle their trainers to win their freedom. PETA reaffirmed their objections in 2016 with the release of Pokémon Go, promoting the hashtag #GottaFreeThemAll.
Health  
flashing' colors, stills will not blind people!-->
On December 16, 1997, more than 635 Japanese children were admitted to hospitals with epileptic seizures. It was determined the seizures were caused by watching an episode of Pokémon "Dennō Senshi Porygon", ; as a result, this episode has not been aired since. In this particular episode, there were bright explosions with rapidly alternating blue and red color patterns. It was determined in subsequent research that these strobing light effects cause some individuals to have epileptic seizures, even if the person had no previous history of epilepsy. This incident is a common focus of Pokémon-related parodies in other media, and was lampooned by the Simpsons episode "Thirty Minutes over Tokyo" and the South Park episode "Chinpokomon", among others.
Monster in My Pocket  
In March 2000, Morrison Entertainment Group, a small toy developer based at Manhattan Beach, California, sued Nintendo over claims that Pokémon infringed on its own Monster in My Pocket characters. A judge ruled there was no infringement, so Morrison appealed the ruling.On February 4, 2003, the U.S. Court of Appeals for the Ninth Circuit affirmed the decision by the District Court to dismiss the suit.
Pokémon Go  
Within its first two days of release, Pokémon Go raised safety concerns among players.  Multiple people also suffered minor injuries from falling while playing the game due to being distracted.
Multiple police departments in various countries have issued warnings, some tongue-in-cheek, regarding inattentive driving, trespassing, and being targeted by criminals due to being unaware of one's surroundings. People have suffered various injuries from accidents related to the game, and Bosnian players have been warned to stay out of minefields left over from the 1990s Bosnian War. On July 20, 2016, it was reported that a 18-year-old boy in Chiquimula, Guatemala was shot and killed while playing the game in the late evening hours. This was the first reported death in connection with the app. The boy's 17-year-old cousin, who was accompanying the victim, was shot in the foot. Police speculated that the shooters used the game's GPS capability to find the two.
Cultural influence
Pokémon, being a globally popular franchise, has left a significant mark on today's pop culture. The Pokémon characters themselves have become pop culture icons; examples include two different Pikachu balloons in the Macy's Thanksgiving Day Parade, Pokémon Jets operated by All Nippon Airways, thousands of merchandise items, and a traveling theme park that was in Nagoya, Japan in 2005 and in Taipei in 2006. Pokémon also appeared on the cover of the U.S. magazine Time in 1999. The Comedy Central show Drawn Together has a character named Ling-Ling who is a direct parody of Pikachu. Several other shows such as ReBoot, The Simpsons, South Park, The Grim Adventures of Billy & Mandy, Robot Chicken, All Grown Up!, and Johnny Test have made references and spoofs of Pokémon, among other series. Pokémon was also featured on VH1's I Love the '90s: Part Deux. A live action show called Pokémon Live! toured the United States in late 2000. It was based on the popular Pokémon anime, but had some continuity errors relating to it. Jim Butcher cites Pokémon as one of the inspirations for the Codex Alera series of novels.
In November 2001, Nintendo opened a store called the Pokémon Center in New York, in New York's Rockefeller Center, modeled after the two other Pokémon Center stores in Tokyo and Osaka and named after a staple of the videogame series; Pokémon Centers are fictional buildings where Trainers take their injured Pokémon to be healed after combat. The store sold Pokémon merchandise on a total of two floors, with items ranging from collectible shirts to stuffed Pokémon plushies. The store also featured a Pokémon Distributing Machine in which players would place their game to receive an egg of a Pokémon that was being given out at that time. The store also had tables that were open for players of the Pokémon Trading Card Game to duel each other or an employee. The store was closed and replaced by the Nintendo World Store on May 14, 2005. Three Pokémon Center kiosks were put in malls in Washington, with one in Tacoma and one in Seattle currently remaining. The Pokémon Center online store was relaunched on August 6, 2014.
Joseph Jay Tobin theorizes that the success of the franchise was mainly due to the long list of names that could be learned by children and repeated in their peer groups. The rich fictional universe provided a lot of opportunities for discussion and demonstration of knowledge in front of their peers. In the French version Nintendo took care to translate the name of the creatures so that they reflected the French culture and language. In all cases the names of the creatures were linked to its characteristics, which converged with the children's belief that names have symbolic power. Children could pick their favourite Pokémon and affirm their individuality while at the same time affirming their conformance to the values of the group, and they could distinguish themselves from other kids by asserting what they liked and what they didn't like from every chapter. Pokémon gained popularity because it provided a sense of identity to a wide variety of children, and lost it quickly when many of those children found that the identity groups were too big and searched for identities that would distinguish them into smaller groups.
Pokémons history has been marked at times by rivalry with the Digimon media franchise that debuted at a similar time. Described as "the other 'mon by IGN's Juan Castro, Digimon has not enjoyed Pokémons level of international popularity or success, but has maintained a dedicated fanbase. IGN's Lucas M. Thomas stated that Pokémon is Digimons "constant competition and comparison", attributing the former's relative success to the simplicity of its evolution mechanic as opposed to Digivolution. The two have been noted for conceptual and stylistic similarities by sources such as GameZone. A debate among fans exists over which of the two franchises came first. In actuality, the first Pokémon media, Pokémon Red and Green, were released initially on February 27, 1996; whereas the Digimon virtual pet was released on June 26, 1997.
Fan community  
While Pokémon's target demographic is young boys, early purchasers of the latest games, Pokémon Omega Ruby and Alpha Sapphire, were in their 20's. Many fans are adults who originally played the games as children and later returned to the series.
Bulbapedia, a wiki-based encyclopedia associated with longtime fan site Bulbagarden, is the "Internet's most detailed Pokémon database project".
A significant community around the Pokémon video games' metagame has existed for a long time, analyzing the best ways to use each Pokémon to their full potential in competitive battles. The most prolific competitive community is Smogon University, which has created a widely accepted tier-based battle system.
In early 2014, an anonymous video streamer on Twitch launched Twitch Plays Pokémon, an experiment trying to crowdsource playing subsequent Pokémon games starting with Pokémon Red.
A challenge called the Nuzlocke Challenge was created in order for older players of the series to enjoy Pokémon again, but with a twist. When a Pokémon faints it is considered "dead" and must be released or stored in the PC permanently. If the player blacks out/whites out the game is considered over and the player must restart. The original idea only consisted of 2 to 3 rules that the community has since built upon. There are many fan made Pokémon games made that contains a game mode similar to the Nuzlocke Challenge, such as Pokémon Uranium.
An online Pokémon game was created called Pokémon Showdown, in which players create a team and battle against other players around the world.
See also
List of Pokémon
List of Pokémon chapters
List of Pokémon characters
List of Pokémon episodes
List of Pokémon video games
Pokémon episodes removed from rotation
References
Tobin, Joseph, ed. . Pikachu's Global Adventure: The Rise and Fall of Pokémon. Duke University Press. ISBN 0-8223-3287-6.
External links


Bibliography:
Wikipedia
@baygross















































Tuesday, January 10, 2017

Summary ECOLOGY BIOB50 - MILLON and Group Sparrow Hawk- UTSC Uoft scarborough

The research study conducted by Millon and group was a long term study which focuses on the growth rate of Eurasian sparrow hawk and their prey community composition.  The main purposes of this study was to determine whether there are changes in the composition of the prey community by analyzing temporal trends within the prey community of the sparrow hawk and how these changes resonate into the dynamics of the sparrow hawk population (Millon 2009). They conducted this study in a vast area in Denmark and visited the breeding areas one to five times annually during the time of breeding (Millon 2009). Then they identified the females who are breeding and aged them using the features of their feathers. Then the researchers obtained the mean temperatures from the meteorological stations and the average temperature of three coldest months in winter and early spring. These periods were assumed the most critical for sparrohawks (Millon 2009).  The diet of the predator was then analyzed during the breeding season and their diet mainly consisted of black bird and skylark. During the years of study, as the abundance of these two main prey species increased, the diet diversity of sparrowhawk decreased (Millon 2009). But then the abundance of these two population decreased during the winter and then the sparrowhawk population also started to declined. The analysis then shows us that the winter climate, and also the prey availability contributes to the population growth and decline of the sparrowhawk population. Prey vulnerability to predation differs according to coloration, behavior, parasite load of prey and can also be altered by changes in habitat structure due to farming practice (Millon 2009). The changes on the composition of these avian prey community shows the variation of prey abundance on predators (Millon 2009). During the mid 80’s when the population of black birds and skylark declined drastically during the harsh winter, the population of the sparrowhawks also started facing a decline in population. Harsh winter climates affecting both the predators and prey species made hunting, food gathering and the maintenance of physiological processed very challenging. Also the data observed during the study explains how the both and predator and prey abundance depends on each other (Millon 2009). In 1987 the skylark population hit rock bottom while black bird population persisted which caused a severe decline in the sparrowhawk population for a brief period and then recovered. Also the use of organochlorine pesticide lead to a decline in sparrowhawk population. “The number of sparrowhawks in the study area fell from 51 to 38 pairs between 1978 – 82” (Millon 2009). This pesticide killed the adult sparrows directly before breeding. The combined effect of predation along with the harsh environmental stressors had been predicted to drive the skylarks to small numbers (Millon 2009). Also the intra-guild predation between northern Goshawk also bought down the growth rate. Northern Goshawk also regularly prey over other raptors including sparrowhawks. Finally, the study revealed that