diff --git a/.gitignore b/.gitignore index 974dc1146..bd8ec03d7 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ .Rproj.user .RData +*.db diff --git a/02_RProgramming/DataTypes/index.Rmd b/02_RProgramming/DataTypes/index.Rmd index 65eb1ce54..694d14494 100644 --- a/02_RProgramming/DataTypes/index.Rmd +++ b/02_RProgramming/DataTypes/index.Rmd @@ -200,7 +200,9 @@ NAs introduced by coercion > as.logical(x) [1] NA NA NA > as.complex(x) -[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i +[1] NA NA NA NA +Warning message: +NAs introduced by coercion ``` --- @@ -472,4 +474,4 @@ Data Types - data frames -- names \ No newline at end of file +- names diff --git a/02_RProgramming/DataTypes/index.html b/02_RProgramming/DataTypes/index.html index 9b50617cb..60f66d06a 100644 --- a/02_RProgramming/DataTypes/index.html +++ b/02_RProgramming/DataTypes/index.html @@ -263,7 +263,9 @@

Explicit Coercion

> as.logical(x) [1] NA NA NA > as.complex(x) -[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i +[1] NA NA NA NA +Warning message: +NAs introduced by coercion @@ -636,4 +638,4 @@

Summary

- \ No newline at end of file + diff --git a/02_RProgramming/DataTypes/index.md b/02_RProgramming/DataTypes/index.md index ccd9ff364..694d14494 100644 --- a/02_RProgramming/DataTypes/index.md +++ b/02_RProgramming/DataTypes/index.md @@ -200,7 +200,9 @@ NAs introduced by coercion > as.logical(x) [1] NA NA NA > as.complex(x) -[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i +[1] NA NA NA NA +Warning message: +NAs introduced by coercion ``` --- diff --git a/02_RProgramming/assets/img/Thumbs.db b/02_RProgramming/assets/img/Thumbs.db new file mode 100644 index 000000000..cdea17aff Binary files /dev/null and b/02_RProgramming/assets/img/Thumbs.db differ diff --git a/04_ExploratoryAnalysis/assets/img/Thumbs.db b/04_ExploratoryAnalysis/assets/img/Thumbs.db new file mode 100644 index 000000000..966dbaaf7 Binary files /dev/null and b/04_ExploratoryAnalysis/assets/img/Thumbs.db differ diff --git a/06_StatisticalInference/01_01_Introduction/index.pdf b/06_StatisticalInference/01_01_Introduction/index.pdf deleted file mode 100644 index 70d9be1bc..000000000 Binary files a/06_StatisticalInference/01_01_Introduction/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/01_02_Probability/index.pdf b/06_StatisticalInference/01_02_Probability/index.pdf deleted file mode 100644 index b431ce394..000000000 Binary files a/06_StatisticalInference/01_02_Probability/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/01_03_Expectations/index.pdf b/06_StatisticalInference/01_03_Expectations/index.pdf deleted file mode 100644 index c9c43b5a3..000000000 Binary files a/06_StatisticalInference/01_03_Expectations/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/01_04_Independence/index.pdf b/06_StatisticalInference/01_04_Independence/index.pdf deleted file mode 100644 index ba92e4d8e..000000000 Binary files a/06_StatisticalInference/01_04_Independence/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/01_05_ConditionalProbability/index.pdf b/06_StatisticalInference/01_05_ConditionalProbability/index.pdf deleted file mode 100644 index 7cbac08c4..000000000 Binary files a/06_StatisticalInference/01_05_ConditionalProbability/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/01_Introduction/fig/fmri-salmon.jpg b/06_StatisticalInference/01_Introduction/fig/fmri-salmon.jpg new file mode 100644 index 000000000..41bb6154b Binary files /dev/null and b/06_StatisticalInference/01_Introduction/fig/fmri-salmon.jpg differ diff --git a/06_StatisticalInference/01_Introduction/index.Rmd b/06_StatisticalInference/01_Introduction/index.Rmd new file mode 100644 index 000000000..dde2d8720 --- /dev/null +++ b/06_StatisticalInference/01_Introduction/index.Rmd @@ -0,0 +1,161 @@ +--- +title : Introduction to statistical inference +subtitle : Statistical inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Statistical inference defined + +Statistical inference is the process of drawing formal conclusions from +data. + +In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy +statistical data where uncertainty must be accounted for. + +--- + +## Motivating example: who's going to win the election? + +In every major election, pollsters would like to know, ahead of the +actual election, who's going to win. Here, the target of +estimation (the estimand) is clear, the percentage of people in +a particular group (city, state, county, country or other electoral +grouping) who will vote for each candidate. + +We can not poll everyone. Even if we could, some polled +may change their vote by the time the election occurs. +How do we collect a reasonable subset of data and quantify the +uncertainty in the process to produce a good guess at who will win? + +--- + +## Motivating example: is hormone replacement therapy effective? + +A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.** + +Here's there's two inferential problems. + +1. Is HRT effective? +2. How long should we continue the trial in the presence of contrary +evidence? + +See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts + +--- + +## Motivating example +### Brain activation + +![fMRI salmon study](fig/fmri-salmon.jpg 'fMRI salmon study') + +http://www.wired.com/2009/09/fmrisalmon/ + + +--- + +## Summary + +- These examples illustrate many of the difficulties of trying +to use data to create general conclusions about a population. +- Paramount among our concerns are: + - Is the sample representative of the population that we'd like to draw inferences about? + - Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions? + - Is there systematic bias created by missing data or the design or conduct of the study? + - What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization +or random sampling, or implicit as the aggregation of many complex uknown processes. + - Are we trying to estimate an underlying mechanistic model of phenomena under study? +- Statistical inference requires navigating the set of assumptions and +tools and subsequently thinking about how to draw conclusions from data. + +--- +## Example goals of inference + +1. Estimate and quantify the uncertainty of an estimate of +a population quantity (the proportion of people who will + vote for a candidate). +2. Determine whether a population quantity + is a benchmark value ("is the treatment effective?"). +3. Infer a mechanistic relationship when quantities are measured with + noise ("What is the slope for Hooke's law?") +4. Determine the impact of a policy? ("If we reduce polution levels, + will asthma rates decline?") +5. Talk about the probability that something occurs. + +--- +## Example tools of the trade + +1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest +2. Random sampling: concerned with obtaining data that is representative +of the population of interest +3. Sampling models: concerned with creating a model for the sampling +process, the most common is so called "iid". +4. Hypothesis testing: concerned with decision making in the presence of uncertainty +5. Confidence intervals: concerned with quantifying uncertainty in +estimation +6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are +approximated. +7. Study design: the process of designing an experiment to minimize biases and variability. +8. Nonparametric bootstrapping: the process of using the data to, + with minimal probability model assumptions, create inferences. +9. Permutation, randomization and exchangeability testing: the process +of using data permutations to perform inferences. + +--- +## Different thinking about probability leads to different styles of inference + +We won't spend too much time talking about this, but there are several different +styles of inference. Two broad categories that get discussed a lot are: + +1. Frequency probability: is the long run proportion of + times an event occurs in independent, identically distributed + repetitions. +2. Frequency inference: uses frequency interpretations of probabilities +to control error rates. Answers questions like "What should I decide +given my data controlling the long run proportion of mistakes I make at +a tolerable level." +3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules. +4. Bayesian inference: the use of Bayesian probability representation +of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what +should I believe now?" + +Data scientists tend to fall within shades of gray of these and various other schools of inference. + +--- +## In this class + +* In this class, we will primarily focus on basic sampling models, +basic probability models and frequency style analyses +to create standard inferences. +* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing +and bootstrapping. +* As probability modeling will be our starting point, we first build +up basic probability. + +--- +## Where to learn more on the topics not covered + +1. Explicit use of random sampling in inferences: look in references +on "finite population statistics". Used heavily in polling and +sample surveys. +2. Explicit use of randomization in inferences: look in references +on "causal inference" especially in clinical trials. +3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many). +4. Missing data: well covered in biostatistics and econometric +references; look for references to "multiple imputation", a popular tool for +addressing missing data. +5. Study design: consider looking in the subject matter area that + you are interested in; some examples with rich histories in design: + 1. The epidemiological literature is very focused on using study design to investigate public health. + 2. The classical development of study design in agriculture broadly covers design and design principles. + 3. The industrial quality control literature covers design thoroughly. + diff --git a/06_StatisticalInference/01_Introduction/index.html b/06_StatisticalInference/01_Introduction/index.html new file mode 100644 index 000000000..2772c7646 --- /dev/null +++ b/06_StatisticalInference/01_Introduction/index.html @@ -0,0 +1,361 @@ + + + + Introduction to statistical inference + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Introduction to statistical inference

+

Statistical inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Statistical inference defined

+
+
+

Statistical inference is the process of drawing formal conclusions from +data.

+ +

In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy +statistical data where uncertainty must be accounted for.

+ +
+ +
+ + +
+

Motivating example: who's going to win the election?

+
+
+

In every major election, pollsters would like to know, ahead of the +actual election, who's going to win. Here, the target of +estimation (the estimand) is clear, the percentage of people in +a particular group (city, state, county, country or other electoral +grouping) who will vote for each candidate.

+ +

We can not poll everyone. Even if we could, some polled +may change their vote by the time the election occurs. +How do we collect a reasonable subset of data and quantify the +uncertainty in the process to produce a good guess at who will win?

+ +
+ +
+ + +
+

Motivating example: is hormone replacement therapy effective?

+
+
+

A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. Based on a statistically based protocol, the study was stopped early due an excess number of negative events.

+ +

Here's there's two inferential problems.

+ +
    +
  1. Is HRT effective?
  2. +
  3. How long should we continue the trial in the presence of contrary +evidence?
  4. +
+ +

See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts

+ +
+ +
+ + +
+

Motivating example

+
+ + +
+ + +
+

Summary

+
+
+
    +
  • These examples illustrate many of the difficulties of trying +to use data to create general conclusions about a population.
  • +
  • Paramount among our concerns are: + +
      +
    • Is the sample representative of the population that we'd like to draw inferences about?
    • +
    • Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
    • +
    • Is there systematic bias created by missing data or the design or conduct of the study?
    • +
    • What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization +or random sampling, or implicit as the aggregation of many complex uknown processes.
    • +
    • Are we trying to estimate an underlying mechanistic model of phenomena under study?
    • +
  • +
  • Statistical inference requires navigating the set of assumptions and +tools and subsequently thinking about how to draw conclusions from data.
  • +
+ +
+ +
+ + +
+

Example goals of inference

+
+
+
    +
  1. Estimate and quantify the uncertainty of an estimate of +a population quantity (the proportion of people who will +vote for a candidate).
  2. +
  3. Determine whether a population quantity +is a benchmark value ("is the treatment effective?").
  4. +
  5. Infer a mechanistic relationship when quantities are measured with +noise ("What is the slope for Hooke's law?")
  6. +
  7. Determine the impact of a policy? ("If we reduce polution levels, +will asthma rates decline?")
  8. +
  9. Talk about the probability that something occurs.
  10. +
+ +
+ +
+ + +
+

Example tools of the trade

+
+
+
    +
  1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
  2. +
  3. Random sampling: concerned with obtaining data that is representative +of the population of interest
  4. +
  5. Sampling models: concerned with creating a model for the sampling +process, the most common is so called "iid".
  6. +
  7. Hypothesis testing: concerned with decision making in the presence of uncertainty
  8. +
  9. Confidence intervals: concerned with quantifying uncertainty in +estimation
  10. +
  11. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are +approximated.
  12. +
  13. Study design: the process of designing an experiment to minimize biases and variability.
  14. +
  15. Nonparametric bootstrapping: the process of using the data to, +with minimal probability model assumptions, create inferences.
  16. +
  17. Permutation, randomization and exchangeability testing: the process +of using data permutations to perform inferences.
  18. +
+ +
+ +
+ + +
+

Different thinking about probability leads to different styles of inference

+
+
+

We won't spend too much time talking about this, but there are several different +styles of inference. Two broad categories that get discussed a lot are:

+ +
    +
  1. Frequency probability: is the long run proportion of +times an event occurs in independent, identically distributed +repetitions.
  2. +
  3. Frequency inference: uses frequency interpretations of probabilities +to control error rates. Answers questions like "What should I decide +given my data controlling the long run proportion of mistakes I make at +a tolerable level."
  4. +
  5. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
  6. +
  7. Bayesian inference: the use of Bayesian probability representation +of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what +should I believe now?"
  8. +
+ +

Data scientists tend to fall within shades of gray of these and various other schools of inference.

+ +
+ +
+ + +
+

In this class

+
+
+
    +
  • In this class, we will primarily focus on basic sampling models, +basic probability models and frequency style analyses +to create standard inferences.
  • +
  • Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing +and bootstrapping.
  • +
  • As probability modeling will be our starting point, we first build +up basic probability.
  • +
+ +
+ +
+ + +
+

Where to learn more on the topics not covered

+
+
+
    +
  1. Explicit use of random sampling in inferences: look in references +on "finite population statistics". Used heavily in polling and +sample surveys.
  2. +
  3. Explicit use of randomization in inferences: look in references +on "causal inference" especially in clinical trials.
  4. +
  5. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
  6. +
  7. Missing data: well covered in biostatistics and econometric +references; look for references to "multiple imputation", a popular tool for +addressing missing data.
  8. +
  9. Study design: consider looking in the subject matter area that +you are interested in; some examples with rich histories in design: + +
      +
    1. The epidemiological literature is very focused on using study design to investigate public health.
    2. +
    3. The classical development of study design in agriculture broadly covers design and design principles.
    4. +
    5. The industrial quality control literature covers design thoroughly.
    6. +
  10. +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/01_Introduction/index.md b/06_StatisticalInference/01_Introduction/index.md new file mode 100644 index 000000000..dde2d8720 --- /dev/null +++ b/06_StatisticalInference/01_Introduction/index.md @@ -0,0 +1,161 @@ +--- +title : Introduction to statistical inference +subtitle : Statistical inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Statistical inference defined + +Statistical inference is the process of drawing formal conclusions from +data. + +In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy +statistical data where uncertainty must be accounted for. + +--- + +## Motivating example: who's going to win the election? + +In every major election, pollsters would like to know, ahead of the +actual election, who's going to win. Here, the target of +estimation (the estimand) is clear, the percentage of people in +a particular group (city, state, county, country or other electoral +grouping) who will vote for each candidate. + +We can not poll everyone. Even if we could, some polled +may change their vote by the time the election occurs. +How do we collect a reasonable subset of data and quantify the +uncertainty in the process to produce a good guess at who will win? + +--- + +## Motivating example: is hormone replacement therapy effective? + +A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.** + +Here's there's two inferential problems. + +1. Is HRT effective? +2. How long should we continue the trial in the presence of contrary +evidence? + +See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts + +--- + +## Motivating example +### Brain activation + +![fMRI salmon study](fig/fmri-salmon.jpg 'fMRI salmon study') + +http://www.wired.com/2009/09/fmrisalmon/ + + +--- + +## Summary + +- These examples illustrate many of the difficulties of trying +to use data to create general conclusions about a population. +- Paramount among our concerns are: + - Is the sample representative of the population that we'd like to draw inferences about? + - Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions? + - Is there systematic bias created by missing data or the design or conduct of the study? + - What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization +or random sampling, or implicit as the aggregation of many complex uknown processes. + - Are we trying to estimate an underlying mechanistic model of phenomena under study? +- Statistical inference requires navigating the set of assumptions and +tools and subsequently thinking about how to draw conclusions from data. + +--- +## Example goals of inference + +1. Estimate and quantify the uncertainty of an estimate of +a population quantity (the proportion of people who will + vote for a candidate). +2. Determine whether a population quantity + is a benchmark value ("is the treatment effective?"). +3. Infer a mechanistic relationship when quantities are measured with + noise ("What is the slope for Hooke's law?") +4. Determine the impact of a policy? ("If we reduce polution levels, + will asthma rates decline?") +5. Talk about the probability that something occurs. + +--- +## Example tools of the trade + +1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest +2. Random sampling: concerned with obtaining data that is representative +of the population of interest +3. Sampling models: concerned with creating a model for the sampling +process, the most common is so called "iid". +4. Hypothesis testing: concerned with decision making in the presence of uncertainty +5. Confidence intervals: concerned with quantifying uncertainty in +estimation +6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are +approximated. +7. Study design: the process of designing an experiment to minimize biases and variability. +8. Nonparametric bootstrapping: the process of using the data to, + with minimal probability model assumptions, create inferences. +9. Permutation, randomization and exchangeability testing: the process +of using data permutations to perform inferences. + +--- +## Different thinking about probability leads to different styles of inference + +We won't spend too much time talking about this, but there are several different +styles of inference. Two broad categories that get discussed a lot are: + +1. Frequency probability: is the long run proportion of + times an event occurs in independent, identically distributed + repetitions. +2. Frequency inference: uses frequency interpretations of probabilities +to control error rates. Answers questions like "What should I decide +given my data controlling the long run proportion of mistakes I make at +a tolerable level." +3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules. +4. Bayesian inference: the use of Bayesian probability representation +of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what +should I believe now?" + +Data scientists tend to fall within shades of gray of these and various other schools of inference. + +--- +## In this class + +* In this class, we will primarily focus on basic sampling models, +basic probability models and frequency style analyses +to create standard inferences. +* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing +and bootstrapping. +* As probability modeling will be our starting point, we first build +up basic probability. + +--- +## Where to learn more on the topics not covered + +1. Explicit use of random sampling in inferences: look in references +on "finite population statistics". Used heavily in polling and +sample surveys. +2. Explicit use of randomization in inferences: look in references +on "causal inference" especially in clinical trials. +3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many). +4. Missing data: well covered in biostatistics and econometric +references; look for references to "multiple imputation", a popular tool for +addressing missing data. +5. Study design: consider looking in the subject matter area that + you are interested in; some examples with rich histories in design: + 1. The epidemiological literature is very focused on using study design to investigate public health. + 2. The classical development of study design in agriculture broadly covers design and design principles. + 3. The industrial quality control literature covers design thoroughly. + diff --git a/06_StatisticalInference/01_Introduction/index.pdf b/06_StatisticalInference/01_Introduction/index.pdf new file mode 100644 index 000000000..ba632a641 Binary files /dev/null and b/06_StatisticalInference/01_Introduction/index.pdf differ diff --git a/06_StatisticalInference/02_01_CommonDistributions/index.pdf b/06_StatisticalInference/02_01_CommonDistributions/index.pdf deleted file mode 100644 index 633936fe6..000000000 Binary files a/06_StatisticalInference/02_01_CommonDistributions/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/02_02_Asymptopia/index.pdf b/06_StatisticalInference/02_02_Asymptopia/index.pdf deleted file mode 100644 index 7fcb65b90..000000000 Binary files a/06_StatisticalInference/02_02_Asymptopia/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/02_03_tCIs/index.pdf b/06_StatisticalInference/02_03_tCIs/index.pdf deleted file mode 100644 index 855c3502b..000000000 Binary files a/06_StatisticalInference/02_03_tCIs/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/02_04_Likeklihood/index.pdf b/06_StatisticalInference/02_04_Likeklihood/index.pdf deleted file mode 100644 index 85787788c..000000000 Binary files a/06_StatisticalInference/02_04_Likeklihood/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/02_05_Bayes/index.pdf b/06_StatisticalInference/02_05_Bayes/index.pdf deleted file mode 100644 index c8043e4a1..000000000 Binary files a/06_StatisticalInference/02_05_Bayes/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/02_Probability/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/02_Probability/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..b4fb0bd35 Binary files /dev/null and b/06_StatisticalInference/02_Probability/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/02_Probability/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/02_Probability/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..ff974fda8 Binary files /dev/null and b/06_StatisticalInference/02_Probability/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-1.png b/06_StatisticalInference/02_Probability/figure/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-1.png rename to 06_StatisticalInference/02_Probability/figure/unnamed-chunk-1.png diff --git a/06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-2.png b/06_StatisticalInference/02_Probability/figure/unnamed-chunk-2.png similarity index 100% rename from 06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-2.png rename to 06_StatisticalInference/02_Probability/figure/unnamed-chunk-2.png diff --git a/06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-4.png b/06_StatisticalInference/02_Probability/figure/unnamed-chunk-4.png similarity index 100% rename from 06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-4.png rename to 06_StatisticalInference/02_Probability/figure/unnamed-chunk-4.png diff --git a/06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-6.png b/06_StatisticalInference/02_Probability/figure/unnamed-chunk-6.png similarity index 100% rename from 06_StatisticalInference/01_02_Probability/figure/unnamed-chunk-6.png rename to 06_StatisticalInference/02_Probability/figure/unnamed-chunk-6.png diff --git a/06_StatisticalInference/02_Probability/index.Rmd b/06_StatisticalInference/02_Probability/index.Rmd new file mode 100644 index 000000000..9f81bb399 --- /dev/null +++ b/06_StatisticalInference/02_Probability/index.Rmd @@ -0,0 +1,586 @@ +<<<<<<< HEAD:06_StatisticalInference/01_02_Probability/index.Rmd +--- +title : Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Notation + +- The **sample space**, $\Omega$, is the collection of possible outcomes of an experiment + - Example: die roll $\Omega = \{1,2,3,4,5,6\}$ +- An **event**, say $E$, is a subset of $\Omega$ + - Example: die roll is even $E = \{2,4,6\}$ +- An **elementary** or **simple** event is a particular result + of an experiment + - Example: die roll is a four, $\omega = 4$ +- $\emptyset$ is called the **null event** or the **empty set** + +--- + +## Interpretation of set operations + +Normal set operations have particular interpretations in this setting + +1. $\omega \in E$ implies that $E$ occurs when $\omega$ occurs +2. $\omega \not\in E$ implies that $E$ does not occur when $\omega$ occurs +3. $E \subset F$ implies that the occurrence of $E$ implies the occurrence of $F$ +4. $E \cap F$ implies the event that both $E$ and $F$ occur +5. $E \cup F$ implies the event that at least one of $E$ or $F$ occur +6. $E \cap F=\emptyset$ means that $E$ and $F$ are **mutually exclusive**, or cannot both occur +7. $E^c$ or $\bar E$ is the event that $E$ does not occur + +--- + +## Probability + +A **probability measure**, $P$, is a function from the collection of possible events so that the following hold + +1. For an event $E\subset \Omega$, $0 \leq P(E) \leq 1$ +2. $P(\Omega) = 1$ +3. If $E_1$ and $E_2$ are mutually exclusive events + $P(E_1 \cup E_2) = P(E_1) + P(E_2)$. + +Part 3 of the definition implies **finite additivity** + +$$ +P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) +$$ +where the $\{A_i\}$ are mutually exclusive. (Note a more general version of +additivity is used in advanced classes.) + + +--- + + +## Example consequences + +- $P(\emptyset) = 0$ +- $P(E) = 1 - P(E^c)$ +- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ +- if $A \subset B$ then $P(A) \leq P(B)$ +- $P\left(A \cup B\right) = 1 - P(A^c \cap B^c)$ +- $P(A \cap B^c) = P(A) - P(A \cap B)$ +- $P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)$ +- $P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)$ + +--- + +## Example + +The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? + +--- + +## Example continued + +Answer: No, the events are not mutually exclusive. To elaborate let: + +$$ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +$$ + +Then + +$$ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +$$ +Likely, some fraction of the population has both. + +--- + +## Random variables + +- A **random variable** is a numerical outcome of an experiment. +- The random variables that we study will come in two varieties, + **discrete** or **continuous**. +- Discrete random variable are random variables that take on only a +countable number of possibilities. + * $P(X = k)$ +- Continuous random variable can take any value on the real line or some subset of the real line. + * $P(X \in A)$ + +--- + +## Examples of variables that can be thought of as random variables + +- The $(0-1)$ outcome of the flip of a coin +- The outcome from the roll of a die +- The BMI of a subject four years after a baseline measurement +- The hypertension status of a subject randomly drawn from a population + +--- + +## PMF + +A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, $p$, must satisfy + + 1. $p(x) \geq 0$ for all $x$ + 2. $\sum_{x} p(x) = 1$ + +The sum is taken over all of the possible values for $x$. + +--- + +## Example + +Let $X$ be the result of a coin flip where $X=0$ represents +tails and $X = 1$ represents heads. +$$ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ +Suppose that we do not know whether or not the coin is fair; Let +$\theta$ be the probability of a head expressed as a proportion +(between 0 and 1). +$$ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ + +--- + +## PDF + +A probability density function (pdf), is a function associated with +a continuous random variable + + *Areas under pdfs correspond to probabilities for that random variable* + +To be a valid pdf, a function $f$ must satisfy + +1. $f(x) \geq 0$ for all $x$ + +2. The area under $f(x)$ is one. + +--- +## Example + +Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +$$ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for } 1 > x > 0 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +$$ + +Is this a mathematically valid density? + +--- + +```{r, fig.height = 5, fig.width = 5, echo = TRUE, fig.align='center'} +x <- c(-0.5, 0, 1, 1, 1.5); y <- c( 0, 0, 2, 0, 0) +plot(x, y, lwd = 3, frame = FALSE, type = "l") +``` + +--- + +## Example continued + +What is the probability that 75% or fewer of calls get addressed? + +```{r, fig.height = 5, fig.width = 5, echo = FALSE, fig.align='center'} +plot(x, y, lwd = 3, frame = FALSE, type = "l") +polygon(c(0, .75, .75, 0), c(0, 0, 1.5, 0), lwd = 3, col = "lightblue") +``` + +--- +```{r} +1.5 * .75 / 2 +pbeta(.75, 2, 1) +``` +--- + +## CDF and survival function + +- The **cumulative distribution function** (CDF) of a random variable $X$ is defined as the function +$$ +F(x) = P(X \leq x) +$$ +- This definition applies regardless of whether $X$ is discrete or continuous. +- The **survival function** of a random variable $X$ is defined as +$$ +S(x) = P(X > x) +$$ +- Notice that $S(x) = 1 - F(x)$ +- For continuous random variables, the PDF is the derivative of the CDF + +--- + +## Example + +What are the survival function and CDF from the density considered before? + +For $1 \geq x \geq 0$ +$$ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +$$ + +$$ +S(x) = 1 - x^2 +$$ + +```{r} +pbeta(c(0.4, 0.5, 0.6), 2, 1) +``` + +--- + +## Quantiles + +- The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that +$$ +F(x_\alpha) = \alpha +$$ +- A **percentile** is simply a quantile with $\alpha$ expressed as a percent +- The **median** is the $50^{th}$ percentile + +--- +## Example +- We want to solve $0.5 = F(x) = x^2$ +- Resulting in the solution +```{r, echo = TRUE} +sqrt(0.5) +``` +- Therefore, about `r sqrt(0.5)` of calls being answered on a random day is the median. +- R can approximate quantiles for you for common distributions + +```{r} +qbeta(0.5, 2, 1) +``` + +--- + +## Summary + +- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" +- We're referring to are **population quantities**. Therefore, the median being + discussed is the **population median**. +- A probability model connects the data to the population using assumptions. +- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** +======= +--- +title : Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Probability + +- In these slides we will cover the basics of probability at low enough level +to have a basic understanding for the rest of the series +- For a more complete treatment see the class Mathematical Biostatistics Boot Camp 1 + - Youtube: www.youtube.com/playlist?list=PLpl-gQkQivXhk6qSyiNj51qamjAtZISJ- + - Coursera: www.coursera.org/course/biostats + - Git: http://github.com/bcaffo/Caffo-Coursera + + +--- + +## Probability + +Given a random experiment (say rolling a die) a probability measure is a population quantity +that summarizes the randomness. + +Specifically, probability takes a possible outcome from the expertiment and: + +- assigns it a number between 0 and 1 +- so that the probability that something occurs is 1 (the die must be rolled) +and +- so that the probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) +is the sum of their respective probabilities. + + +The Russian mathematician Kolmogorov formalized these rules. + +--- + + +## Rules probability must follow + +- The probability that nothing occurs is 0 +- The probability that something occurs is 1 +- The probability of something is 1 minus the probability that the opposite occurs +- The probability of at least one of + two (or more) things that can not simultaneously occur (mutually exclusive) + is the sum of their + respective probabilities +- If an event A implies the occurrence of event B, then the probability of A +occurring is less than the probability that B occurs +- For any two events the probability that at least one occurs is the sum of their + probabilities minus their intersection. + +--- + +## Example + +The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? + +--- + +## Example continued + +Answer: No, the events can simultaneously occur and so +are not mutually exclusive. To elaborate let: + +--- +## If you want to see the mathematics + +$$ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +$$ + +Then + +$$ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +$$ +Likely, some fraction of the population has both. + +--- +## Going further + +Probability calculus is useful for understanding the rules that probabilities +must follow. + +However, we need ways to model and think about probabilities for +numeric outcomes of experiments (broadly defined). + +Densities and mass functions for random variables are the best starting point for this. + +Remember, everything we're talking about up to at this point is a population quantity +not a statement about what occurs in the data. +- We're going with this is: use the data to estimate properties of the population. + +--- +## Random variables + +- A **random variable** is a numerical outcome of an experiment. +- The random variables that we study will come in two varieties, + **discrete** or **continuous**. +- Discrete random variable are random variables that take on only a +countable number of possibilities and we talk about the probability that they +take specific values +- Continuous random variable can conceptually take any value on the real line or some subset of the real line and we talk about the probability that they line within +some range + +--- + +## Examples of variables that can be thought of as random variables + +Experiments that we use for intuition and building context +- The $(0-1)$ outcome of the flip of a coin +- The outcome from the roll of a die + +Specific instances of treating variables as if random +- The web site traffic on a given day +- The BMI of a subject four years after a baseline measurement +- The hypertension status of a subject randomly drawn from a population +- The number of people who click on an ad +- Intelligence quotients for a sample of children + +--- + +## PMF + +A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, $p$, must satisfy + + 1. It must always be larger than or equal to 0. + 2. The sum of the possible values that the random variable can take has to add up to one. + +--- + +## Example + +Let $X$ be the result of a coin flip where $X=0$ represents +tails and $X = 1$ represents heads. +$$ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ +Suppose that we do not know whether or not the coin is fair; Let +$\theta$ be the probability of a head expressed as a proportion +(between 0 and 1). +$$ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ + +--- + +## PDF + +A probability density function (pdf), is a function associated with +a continuous random variable + + *Areas under pdfs correspond to probabilities for that random variable* + +To be a valid pdf, a function must satisfy + +1. It must be larger than or equal to zero everywhere. + +2. The total area under it must be one. + +--- +## Example + +Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +$$ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for }& 0< x < 1 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +$$ + +Is this a mathematically valid density? + +--- + +```{r, fig.height = 5, fig.width = 5, echo = TRUE, fig.align='center'} +x <- c(-0.5, 0, 1, 1, 1.5); y <- c( 0, 0, 2, 0, 0) +plot(x, y, lwd = 3, frame = FALSE, type = "l") +``` + +--- + +## Example continued + +What is the probability that 75% or fewer of calls get addressed? + +```{r, fig.height = 5, fig.width = 5, echo = FALSE, fig.align='center'} +plot(x, y, lwd = 3, frame = FALSE, type = "l") +polygon(c(0, .75, .75, 0), c(0, 0, 1.5, 0), lwd = 3, col = "lightblue") +``` + +--- +```{r} +1.5 * .75 / 2 +pbeta(.75, 2, 1) +``` +--- + +## CDF and survival function +### Certain areas are so useful, we give them names + +- The **cumulative distribution function** (CDF) of a random variable, $X$, returns the probability that the random variable is less than or equal to the value $x$ +$$ +F(x) = P(X \leq x) +$$ +(This definition applies regardless of whether $X$ is discrete or continuous.) +- The **survival function** of a random variable $X$ is defined as the probability +that the random variable is greater than the value $x$ +$$ +S(x) = P(X > x) +$$ +- Notice that $S(x) = 1 - F(x)$ + +--- + +## Example + +What are the survival function and CDF from the density considered before? + +For $1 \geq x \geq 0$ +$$ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +$$ + +$$ +S(x) = 1 - x^2 +$$ + +```{r} +pbeta(c(0.4, 0.5, 0.6), 2, 1) +``` + +--- + +## Quantiles + +You've heard of sample quantiles. If you were the 95th percentile on an exam, you know +that 95% of people scored worse than you and 5% scored better. +These are sample quantities. Here we define their population analogs. + + +--- +## Definition + +The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that +$$ +F(x_\alpha) = \alpha +$$ +- A **percentile** is simply a quantile with $\alpha$ expressed as a percent +- The **median** is the $50^{th}$ percentile + +--- +## For example + +The $95^{th}$ percentile of a distribution is the point so that: +- the probability that a random variable drawn from the population is less is 95% +- the probability that a random variable drawn from the population is more is 5% + +--- +## Example +What is the median of the distribution that we were working with before? +- We want to solve $0.5 = F(x) = x^2$ +- Resulting in the solution +```{r, echo = TRUE} +sqrt(0.5) +``` +- Therefore, about `r sqrt(0.5)` of calls being answered on a random day is the median. + +--- +## Example continued +R can approximate quantiles for you for common distributions + +```{r} +qbeta(0.5, 2, 1) +``` + +--- + +## Summary + +- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" +- We're referring to are **population quantities**. Therefore, the median being + discussed is the **population median**. +- A probability model connects the data to the population using assumptions. +- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** + + + +>>>>>>> devel:06_StatisticalInference/02_Probability/index.Rmd diff --git a/06_StatisticalInference/02_Probability/index.html b/06_StatisticalInference/02_Probability/index.html new file mode 100644 index 000000000..7c3a70fe4 --- /dev/null +++ b/06_StatisticalInference/02_Probability/index.html @@ -0,0 +1,693 @@ + + + + Probability + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Probability

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Probability

+
+ + +
+ + +
+

Probability

+
+
+

Given a random experiment (say rolling a die) a probability measure is a population quantity +that summarizes the randomness.

+ +

Specifically, probability takes a possible outcome from the expertiment and:

+ +
    +
  • assigns it a number between 0 and 1
  • +
  • so that the probability that something occurs is 1 (the die must be rolled) +and
  • +
  • so that the probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) +is the sum of their respective probabilities.
  • +
+ +

The Russian mathematician Kolmogorov formalized these rules.

+ +
+ +
+ + +
+

Rules probability must follow

+
+
+
    +
  • The probability that nothing occurs is 0
  • +
  • The probability that something occurs is 1
  • +
  • The probability of something is 1 minus the probability that the opposite occurs
  • +
  • The probability of at least one of +two (or more) things that can not simultaneously occur (mutually exclusive) +is the sum of their +respective probabilities
  • +
  • If an event A implies the occurrence of event B, then the probability of A +occurring is less than the probability that B occurs
  • +
  • For any two events the probability that at least one occurs is the sum of their +probabilities minus their intersection.
  • +
+ +
+ +
+ + +
+

Example

+
+
+

The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts?

+ +
+ +
+ + +
+

Example continued

+
+
+

Answer: No, the events can simultaneously occur and so +are not mutually exclusive. To elaborate let:

+ +
+ +
+ + +
+

If you want to see the mathematics

+
+
+

\[ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +\]

+ +

Then

+ +

\[ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +\] +Likely, some fraction of the population has both.

+ +
+ +
+ + +
+

Going further

+
+
+

Probability calculus is useful for understanding the rules that probabilities +must follow.

+ +

However, we need ways to model and think about probabilities for +numeric outcomes of experiments (broadly defined).

+ +

Densities and mass functions for random variables are the best starting point for this.

+ +

Remember, everything we're talking about up to at this point is a population quantity +not a statement about what occurs in the data.

+ +
    +
  • We're going with this is: use the data to estimate properties of the population.
  • +
+ +
+ +
+ + +
+

Random variables

+
+
+
    +
  • A random variable is a numerical outcome of an experiment.
  • +
  • The random variables that we study will come in two varieties, +discrete or continuous.
  • +
  • Discrete random variable are random variables that take on only a +countable number of possibilities and we talk about the probability that they +take specific values
  • +
  • Continuous random variable can conceptually take any value on the real line or some subset of the real line and we talk about the probability that they line within +some range
  • +
+ +
+ +
+ + +
+

Examples of variables that can be thought of as random variables

+
+
+

Experiments that we use for intuition and building context

+ +
    +
  • The \((0-1)\) outcome of the flip of a coin
  • +
  • The outcome from the roll of a die
  • +
+ +

Specific instances of treating variables as if random

+ +
    +
  • The web site traffic on a given day
  • +
  • The BMI of a subject four years after a baseline measurement
  • +
  • The hypertension status of a subject randomly drawn from a population
  • +
  • The number of people who click on an ad
  • +
  • Intelligence quotients for a sample of children
  • +
+ +
+ +
+ + +
+

PMF

+
+
+

A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, \(p\), must satisfy

+ +
    +
  1. It must always be larger than or equal to 0.
  2. +
  3. The sum of the possible values that the random variable can take has to add up to one.
  4. +
+ +
+ +
+ + +
+

Example

+
+
+

Let \(X\) be the result of a coin flip where \(X=0\) represents +tails and \(X = 1\) represents heads. +\[ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +\] +Suppose that we do not know whether or not the coin is fair; Let +\(\theta\) be the probability of a head expressed as a proportion +(between 0 and 1). +\[ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +\]

+ +
+ +
+ + +
+

PDF

+
+
+

A probability density function (pdf), is a function associated with +a continuous random variable

+ +

Areas under pdfs correspond to probabilities for that random variable

+ +

To be a valid pdf, a function must satisfy

+ +
    +
  1. It must be larger than or equal to zero everywhere.

  2. +
  3. The total area under it must be one.

  4. +
+ +
+ +
+ + +
+

Example

+
+
+

Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +\[ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for }& 0< x < 1 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +\]

+ +

Is this a mathematically valid density?

+ +
+ +
+ + +
+
x <- c(-0.5, 0, 1, 1, 1.5)
+y <- c(0, 0, 2, 0, 0)
+plot(x, y, lwd = 3, frame = FALSE, type = "l")
+
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Example continued

+
+
+

What is the probability that 75% or fewer of calls get addressed?

+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+
1.5 * 0.75/2
+
+ +
## [1] 0.5625
+
+ +
pbeta(0.75, 2, 1)
+
+ +
## [1] 0.5625
+
+ +
+ +
+ + +
+

CDF and survival function

+
+
+

Certain areas are so useful, we give them names

+ +
    +
  • The cumulative distribution function (CDF) of a random variable, \(X\), returns the probability that the random variable is less than or equal to the value \(x\) +\[ +F(x) = P(X \leq x) +\] +(This definition applies regardless of whether \(X\) is discrete or continuous.)
  • +
  • The survival function of a random variable \(X\) is defined as the probability +that the random variable is greater than the value \(x\) +\[ +S(x) = P(X > x) +\]
  • +
  • Notice that \(S(x) = 1 - F(x)\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+

What are the survival function and CDF from the density considered before?

+ +

For \(1 \geq x \geq 0\) +\[ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +\]

+ +

\[ +S(x) = 1 - x^2 +\]

+ +
pbeta(c(0.4, 0.5, 0.6), 2, 1)
+
+ +
## [1] 0.16 0.25 0.36
+
+ +
+ +
+ + +
+

Quantiles

+
+
+

You've heard of sample quantiles. If you were the 95th percentile on an exam, you know +that 95% of people scored worse than you and 5% scored better. +These are sample quantities. Here we define their population analogs.

+ +
+ +
+ + +
+

Definition

+
+
+

The \(\alpha^{th}\) quantile of a distribution with distribution function \(F\) is the point \(x_\alpha\) so that +\[ +F(x_\alpha) = \alpha +\]

+ +
    +
  • A percentile is simply a quantile with \(\alpha\) expressed as a percent
  • +
  • The median is the \(50^{th}\) percentile
  • +
+ +
+ +
+ + +
+

For example

+
+
+

The \(95^{th}\) percentile of a distribution is the point so that:

+ +
    +
  • the probability that a random variable drawn from the population is less is 95%
  • +
  • the probability that a random variable drawn from the population is more is 5%
  • +
+ +
+ +
+ + +
+

Example

+
+
+

What is the median of the distribution that we were working with before?

+ +
    +
  • We want to solve \(0.5 = F(x) = x^2\)
  • +
  • Resulting in the solution
  • +
+ +
sqrt(0.5)
+
+ +
## [1] 0.7071
+
+ +
    +
  • Therefore, about 0.7071 of calls being answered on a random day is the median.
  • +
+ +
+ +
+ + +
+

Example continued

+
+
+

R can approximate quantiles for you for common distributions

+ +
qbeta(0.5, 2, 1)
+
+ +
## [1] 0.7071
+
+ +
+ +
+ + +
+

Summary

+
+
+
    +
  • You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?"
  • +
  • We're referring to are population quantities. Therefore, the median being +discussed is the population median.
  • +
  • A probability model connects the data to the population using assumptions.
  • +
  • Therefore the median we're discussing is the estimand, the sample median will be the estimator
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/02_Probability/index.md b/06_StatisticalInference/02_Probability/index.md new file mode 100644 index 000000000..aa9afeb16 --- /dev/null +++ b/06_StatisticalInference/02_Probability/index.md @@ -0,0 +1,341 @@ +--- +title : Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Probability + +- In these slides we will cover the basics of probability at low enough level +to have a basic understanding for the rest of the series +- For a more complete treatment see the class Mathematical Biostatistics Boot Camp 1 + - Youtube: www.youtube.com/playlist?list=PLpl-gQkQivXhk6qSyiNj51qamjAtZISJ- + - Coursera: www.coursera.org/course/biostats + - Git: http://github.com/bcaffo/Caffo-Coursera + + +--- + +## Probability + +Given a random experiment (say rolling a die) a probability measure is a population quantity +that summarizes the randomness. + +Specifically, probability takes a possible outcome from the expertiment and: + +- assigns it a number between 0 and 1 +- so that the probability that something occurs is 1 (the die must be rolled) +and +- so that the probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) +is the sum of their respective probabilities. + + +The Russian mathematician Kolmogorov formalized these rules. + +--- + + +## Rules probability must follow + +- The probability that nothing occurs is 0 +- The probability that something occurs is 1 +- The probability of something is 1 minus the probability that the opposite occurs +- The probability of at least one of + two (or more) things that can not simultaneously occur (mutually exclusive) + is the sum of their + respective probabilities +- If an event A implies the occurrence of event B, then the probability of A +occurring is less than the probability that B occurs +- For any two events the probability that at least one occurs is the sum of their + probabilities minus their intersection. + +--- + +## Example + +The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? + +--- + +## Example continued + +Answer: No, the events can simultaneously occur and so +are not mutually exclusive. To elaborate let: + +--- +## If you want to see the mathematics + +$$ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +$$ + +Then + +$$ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +$$ +Likely, some fraction of the population has both. + +--- +## Going further + +Probability calculus is useful for understanding the rules that probabilities +must follow. + +However, we need ways to model and think about probabilities for +numeric outcomes of experiments (broadly defined). + +Densities and mass functions for random variables are the best starting point for this. + +Remember, everything we're talking about up to at this point is a population quantity +not a statement about what occurs in the data. +- We're going with this is: use the data to estimate properties of the population. + +--- +## Random variables + +- A **random variable** is a numerical outcome of an experiment. +- The random variables that we study will come in two varieties, + **discrete** or **continuous**. +- Discrete random variable are random variables that take on only a +countable number of possibilities and we talk about the probability that they +take specific values +- Continuous random variable can conceptually take any value on the real line or some subset of the real line and we talk about the probability that they line within +some range + +--- + +## Examples of variables that can be thought of as random variables + +Experiments that we use for intuition and building context +- The $(0-1)$ outcome of the flip of a coin +- The outcome from the roll of a die + +Specific instances of treating variables as if random +- The web site traffic on a given day +- The BMI of a subject four years after a baseline measurement +- The hypertension status of a subject randomly drawn from a population +- The number of people who click on an ad +- Intelligence quotients for a sample of children + +--- + +## PMF + +A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, $p$, must satisfy + + 1. It must always be larger than or equal to 0. + 2. The sum of the possible values that the random variable can take has to add up to one. + +--- + +## Example + +Let $X$ be the result of a coin flip where $X=0$ represents +tails and $X = 1$ represents heads. +$$ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ +Suppose that we do not know whether or not the coin is fair; Let +$\theta$ be the probability of a head expressed as a proportion +(between 0 and 1). +$$ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ + +--- + +## PDF + +A probability density function (pdf), is a function associated with +a continuous random variable + + *Areas under pdfs correspond to probabilities for that random variable* + +To be a valid pdf, a function must satisfy + +1. It must be larger than or equal to zero everywhere. + +2. The total area under it must be one. + +--- +## Example + +Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +$$ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for }& 0< x < 1 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +$$ + +Is this a mathematically valid density? + +--- + + +```r +x <- c(-0.5, 0, 1, 1, 1.5) +y <- c(0, 0, 2, 0, 0) +plot(x, y, lwd = 3, frame = FALSE, type = "l") +``` + +plot of chunk unnamed-chunk-1 + + +--- + +## Example continued + +What is the probability that 75% or fewer of calls get addressed? + +plot of chunk unnamed-chunk-2 + + +--- + +```r +1.5 * 0.75/2 +``` + +``` +## [1] 0.5625 +``` + +```r +pbeta(0.75, 2, 1) +``` + +``` +## [1] 0.5625 +``` + +--- + +## CDF and survival function +### Certain areas are so useful, we give them names + +- The **cumulative distribution function** (CDF) of a random variable, $X$, returns the probability that the random variable is less than or equal to the value $x$ +$$ +F(x) = P(X \leq x) +$$ +(This definition applies regardless of whether $X$ is discrete or continuous.) +- The **survival function** of a random variable $X$ is defined as the probability +that the random variable is greater than the value $x$ +$$ +S(x) = P(X > x) +$$ +- Notice that $S(x) = 1 - F(x)$ + +--- + +## Example + +What are the survival function and CDF from the density considered before? + +For $1 \geq x \geq 0$ +$$ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +$$ + +$$ +S(x) = 1 - x^2 +$$ + + +```r +pbeta(c(0.4, 0.5, 0.6), 2, 1) +``` + +``` +## [1] 0.16 0.25 0.36 +``` + + +--- + +## Quantiles + +You've heard of sample quantiles. If you were the 95th percentile on an exam, you know +that 95% of people scored worse than you and 5% scored better. +These are sample quantities. Here we define their population analogs. + + +--- +## Definition + +The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that +$$ +F(x_\alpha) = \alpha +$$ +- A **percentile** is simply a quantile with $\alpha$ expressed as a percent +- The **median** is the $50^{th}$ percentile + +--- +## For example + +The $95^{th}$ percentile of a distribution is the point so that: +- the probability that a random variable drawn from the population is less is 95% +- the probability that a random variable drawn from the population is more is 5% + +--- +## Example +What is the median of the distribution that we were working with before? +- We want to solve $0.5 = F(x) = x^2$ +- Resulting in the solution + +```r +sqrt(0.5) +``` + +``` +## [1] 0.7071 +``` + +- Therefore, about 0.7071 of calls being answered on a random day is the median. + +--- +## Example continued +R can approximate quantiles for you for common distributions + + +```r +qbeta(0.5, 2, 1) +``` + +``` +## [1] 0.7071 +``` + + +--- + +## Summary + +- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" +- We're referring to are **population quantities**. Therefore, the median being + discussed is the **population median**. +- A probability model connects the data to the population using assumptions. +- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** + + + diff --git a/06_StatisticalInference/02_Probability/index.pdf b/06_StatisticalInference/02_Probability/index.pdf new file mode 100644 index 000000000..105568760 Binary files /dev/null and b/06_StatisticalInference/02_Probability/index.pdf differ diff --git a/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf b/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf deleted file mode 100644 index a15898d7a..000000000 Binary files a/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/03_02_HypothesisTesting/index.pdf b/06_StatisticalInference/03_02_HypothesisTesting/index.pdf deleted file mode 100644 index 9e5f2ae42..000000000 Binary files a/06_StatisticalInference/03_02_HypothesisTesting/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/03_03_pValues/index.pdf b/06_StatisticalInference/03_03_pValues/index.pdf deleted file mode 100644 index 85d8bd9d1..000000000 Binary files a/06_StatisticalInference/03_03_pValues/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/03_04_Power/index.pdf b/06_StatisticalInference/03_04_Power/index.pdf deleted file mode 100644 index a5daf8abd..000000000 Binary files a/06_StatisticalInference/03_04_Power/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.pdf b/06_StatisticalInference/03_05_MultipleTesting/index.pdf deleted file mode 100644 index 190c24c34..000000000 Binary files a/06_StatisticalInference/03_05_MultipleTesting/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/03_06_resampledInference/index.pdf b/06_StatisticalInference/03_06_resampledInference/index.pdf deleted file mode 100644 index df08fe3df..000000000 Binary files a/06_StatisticalInference/03_06_resampledInference/index.pdf and /dev/null differ diff --git a/06_StatisticalInference/03_ConditionalProbability/index.Rmd b/06_StatisticalInference/03_ConditionalProbability/index.Rmd new file mode 100644 index 000000000..c33fdafa5 --- /dev/null +++ b/06_StatisticalInference/03_ConditionalProbability/index.Rmd @@ -0,0 +1,221 @@ +--- +title : Conditional Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Conditional probability, motivation + +- The probability of getting a one when rolling a (standard) die + is usually assumed to be one sixth +- Suppose you were given the extra information that the die roll + was an odd number (hence 1, 3 or 5) +- *conditional on this new information*, the probability of a + one is now one third + +--- + +## Conditional probability, definition + +- Let $B$ be an event so that $P(B) > 0$ +- Then the conditional probability of an event $A$ given that $B$ has occurred is + $$ + P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} + $$ +- Notice that if $A$ and $B$ are independent (defined later in the lecture), then + $$ + P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) + $$ + +--- + +## Example + +- Consider our die roll example +- $B = \{1, 3, 5\}$ +- $A = \{1\}$ +$$ + \begin{eqnarray*} +P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ + & = & \frac{P(A \cap B)}{P(B)} \\ \\ + & = & \frac{P(A)}{P(B)} \\ \\ + & = & \frac{1/6}{3/6} = \frac{1}{3} + \end{eqnarray*} +$$ + + + +--- + +## Bayes' rule +Baye's rule allows us to reverse the conditioning set provided +that we know some marginal probabilities +$$ +P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. +$$ + + +--- + +## Diagnostic tests + +- Let $+$ and $-$ be the events that the result of a diagnostic test is positive or negative respectively +- Let $D$ and $D^c$ be the event that the subject of the test has or does not have the disease respectively +- The **sensitivity** is the probability that the test is positive given that the subject actually has the disease, $P(+ ~|~ D)$ +- The **specificity** is the probability that the test is negative given that the subject does not have the disease, $P(- ~|~ D^c)$ + +--- + +## More definitions + +- The **positive predictive value** is the probability that the subject has the disease given that the test is positive, $P(D ~|~ +)$ +- The **negative predictive value** is the probability that the subject does not have the disease given that the test is negative, $P(D^c ~|~ -)$ +- The **prevalence of the disease** is the marginal probability of disease, $P(D)$ + +--- + +## More definitions + +- The **diagnostic likelihood ratio of a positive test**, labeled $DLR_+$, is $P(+ ~|~ D) / P(+ ~|~ D^c)$, which is the $$sensitivity / (1 - specificity)$$ +- The **diagnostic likelihood ratio of a negative test**, labeled $DLR_-$, is $P(- ~|~ D) / P(- ~|~ D^c)$, which is the $$(1 - sensitivity) / specificity$$ + +--- + +## Example + +- A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5% +- Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive value? +- Mathematically, we want $P(D ~|~ +)$ given the sensitivity, $P(+ ~|~ D) = .997$, the specificity, $P(- ~|~ D^c) =.985$, and the prevalence $P(D) = .001$ + +--- + +## Using Bayes' formula + +$$ +\begin{eqnarray*} + P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ + & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ + & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ + & = & .062 +\end{eqnarray*} +$$ + +- In this population a positive test result only suggests a 6% probability that the subject has the disease +- (The positive predictive value is 6% for this test) + +--- + +## More on this example + +- The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity +- Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner +- Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes + +--- + +## Likelihood ratios + +- Using Bayes rule, we have + $$ + P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} + $$ + and + $$ + P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. + $$ + +--- + +## Likelihood ratios + +- Therefore +$$ +\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} +$$ +ie +$$ +\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D +$$ +- Similarly, $DLR_-$ relates the decrease in the odds of the + disease after a negative test result to the odds of disease prior to + the test. + +--- + +## HIV example revisited + +- Suppose a subject has a positive HIV test +- $DLR_+ = .997 / (1 - .985) \approx 66$ +- The result of the positive test is that the odds of disease is now 66 times the pretest odds +- Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease + +--- + +## HIV example revisited + +- Suppose that a subject has a negative test result +- $DLR_- = (1 - .997) / .985 \approx .003$ +- Therefore, the post-test odds of disease is now $.3\%$ of the pretest odds given the negative test. +- Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result + +--- + +## Independence + +- Two events $A$ and $B$ are **independent** if $$P(A \cap B) = P(A)P(B)$$ +- Equivalently if $P(A ~|~ B) = P(A)$ +- Two random variables, $X$ and $Y$ are independent if for any two sets $A$ and $B$ $$P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)$$ +- If $A$ is independent of $B$ then + - $A^c$ is independent of $B$ + - $A$ is independent of $B^c$ + - $A^c$ is independent of $B^c$ + + +--- + +## Example + +- What is the probability of getting two consecutive heads? +- $A = \{\mbox{Head on flip 1}\}$ ~ $P(A) = .5$ +- $B = \{\mbox{Head on flip 2}\}$ ~ $P(B) = .5$ +- $A \cap B = \{\mbox{Head on flips 1 and 2}\}$ +- $P(A \cap B) = P(A)P(B) = .5 \times .5 = .25$ + +--- + +## Example + +- Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial +- Based on an estimated prevalence of sudden infant death syndrome of $1$ out of $8,543$, the physician testified that that the probability of a mother having two children with SIDS was $\left(\frac{1}{8,543}\right)^2$ +- The mother on trial was convicted of murder + +--- + +## Example: continued + +- Relevant to this discussion, the principal mistake was to *assume* that the events of having SIDs within a family are independent +- That is, $P(A_1 \cap A_2)$ is not necessarily equal to $P(A_1)P(A_2)$ +- Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families +- (There are many other statistical points of discussion for this case.) + + +--- +## IID random variables + +- Random variables are said to be iid if they are independent and identically distributed + - Independent: statistically unrelated from one and another + - Identically distributed: all having been drawn from the same population distribution +- iid random variables are the default model for random samples +- Many of the important theories of statistics are founded on assuming that variables are iid +- Assuming a random sample and iid will be the default starting point of inference for this class + diff --git a/06_StatisticalInference/03_ConditionalProbability/index.html b/06_StatisticalInference/03_ConditionalProbability/index.html new file mode 100644 index 000000000..524f79bf5 --- /dev/null +++ b/06_StatisticalInference/03_ConditionalProbability/index.html @@ -0,0 +1,534 @@ + + + + Conditional Probability + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Conditional Probability

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Conditional probability, motivation

+
+
+
    +
  • The probability of getting a one when rolling a (standard) die +is usually assumed to be one sixth
  • +
  • Suppose you were given the extra information that the die roll +was an odd number (hence 1, 3 or 5)
  • +
  • conditional on this new information, the probability of a +one is now one third
  • +
+ +
+ +
+ + +
+

Conditional probability, definition

+
+
+
    +
  • Let \(B\) be an event so that \(P(B) > 0\)
  • +
  • Then the conditional probability of an event \(A\) given that \(B\) has occurred is +\[ +P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} +\]
  • +
  • Notice that if \(A\) and \(B\) are independent (defined later in the lecture), then +\[ +P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) +\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Consider our die roll example
  • +
  • \(B = \{1, 3, 5\}\)
  • +
  • \(A = \{1\}\) +\[ +\begin{eqnarray*} +P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ +& = & \frac{P(A \cap B)}{P(B)} \\ \\ +& = & \frac{P(A)}{P(B)} \\ \\ +& = & \frac{1/6}{3/6} = \frac{1}{3} +\end{eqnarray*} +\]
  • +
+ +
+ +
+ + +
+

Bayes' rule

+
+
+

Baye's rule allows us to reverse the conditioning set provided +that we know some marginal probabilities +\[ +P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. +\]

+ +
+ +
+ + +
+

Diagnostic tests

+
+
+
    +
  • Let \(+\) and \(-\) be the events that the result of a diagnostic test is positive or negative respectively
  • +
  • Let \(D\) and \(D^c\) be the event that the subject of the test has or does not have the disease respectively
  • +
  • The sensitivity is the probability that the test is positive given that the subject actually has the disease, \(P(+ ~|~ D)\)
  • +
  • The specificity is the probability that the test is negative given that the subject does not have the disease, \(P(- ~|~ D^c)\)
  • +
+ +
+ +
+ + +
+

More definitions

+
+
+
    +
  • The positive predictive value is the probability that the subject has the disease given that the test is positive, \(P(D ~|~ +)\)
  • +
  • The negative predictive value is the probability that the subject does not have the disease given that the test is negative, \(P(D^c ~|~ -)\)
  • +
  • The prevalence of the disease is the marginal probability of disease, \(P(D)\)
  • +
+ +
+ +
+ + +
+

More definitions

+
+
+
    +
  • The diagnostic likelihood ratio of a positive test, labeled \(DLR_+\), is \(P(+ ~|~ D) / P(+ ~|~ D^c)\), which is the \[sensitivity / (1 - specificity)\]
  • +
  • The diagnostic likelihood ratio of a negative test, labeled \(DLR_-\), is \(P(- ~|~ D) / P(- ~|~ D^c)\), which is the \[(1 - sensitivity) / specificity\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%
  • +
  • Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive value?
  • +
  • Mathematically, we want \(P(D ~|~ +)\) given the sensitivity, \(P(+ ~|~ D) = .997\), the specificity, \(P(- ~|~ D^c) =.985\), and the prevalence \(P(D) = .001\)
  • +
+ +
+ +
+ + +
+

Using Bayes' formula

+
+
+

\[ +\begin{eqnarray*} + P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ + & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ + & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ + & = & .062 +\end{eqnarray*} +\]

+ +
    +
  • In this population a positive test result only suggests a 6% probability that the subject has the disease
  • +
  • (The positive predictive value is 6% for this test)
  • +
+ +
+ +
+ + +
+

More on this example

+
+
+
    +
  • The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity
  • +
  • Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner
  • +
  • Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes
  • +
+ +
+ +
+ + +
+

Likelihood ratios

+
+
+
    +
  • Using Bayes rule, we have +\[ +P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} +\] +and +\[ +P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. +\]
  • +
+ +
+ +
+ + +
+

Likelihood ratios

+
+
+
    +
  • Therefore +\[ +\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} +\] +ie +\[ +\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D +\]
  • +
  • Similarly, \(DLR_-\) relates the decrease in the odds of the +disease after a negative test result to the odds of disease prior to +the test.
  • +
+ +
+ +
+ + +
+

HIV example revisited

+
+
+
    +
  • Suppose a subject has a positive HIV test
  • +
  • \(DLR_+ = .997 / (1 - .985) \approx 66\)
  • +
  • The result of the positive test is that the odds of disease is now 66 times the pretest odds
  • +
  • Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease
  • +
+ +
+ +
+ + +
+

HIV example revisited

+
+
+
    +
  • Suppose that a subject has a negative test result
  • +
  • \(DLR_- = (1 - .997) / .985 \approx .003\)
  • +
  • Therefore, the post-test odds of disease is now \(.3\%\) of the pretest odds given the negative test.
  • +
  • Or, the hypothesis of disease is supported \(.003\) times that of the hypothesis of absence of disease given the negative test result
  • +
+ +
+ +
+ + +
+

Independence

+
+
+
    +
  • Two events \(A\) and \(B\) are independent if \[P(A \cap B) = P(A)P(B)\]
  • +
  • Equivalently if \(P(A ~|~ B) = P(A)\)
  • +
  • Two random variables, \(X\) and \(Y\) are independent if for any two sets \(A\) and \(B\) \[P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)\]
  • +
  • If \(A\) is independent of \(B\) then + +
      +
    • \(A^c\) is independent of \(B\)
    • +
    • \(A\) is independent of \(B^c\)
    • +
    • \(A^c\) is independent of \(B^c\)
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • What is the probability of getting two consecutive heads?
  • +
  • \(A = \{\mbox{Head on flip 1}\}\) ~ \(P(A) = .5\)
  • +
  • \(B = \{\mbox{Head on flip 2}\}\) ~ \(P(B) = .5\)
  • +
  • \(A \cap B = \{\mbox{Head on flips 1 and 2}\}\)
  • +
  • \(P(A \cap B) = P(A)P(B) = .5 \times .5 = .25\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial
  • +
  • Based on an estimated prevalence of sudden infant death syndrome of \(1\) out of \(8,543\), the physician testified that that the probability of a mother having two children with SIDS was \(\left(\frac{1}{8,543}\right)^2\)
  • +
  • The mother on trial was convicted of murder
  • +
+ +
+ +
+ + +
+

Example: continued

+
+
+
    +
  • Relevant to this discussion, the principal mistake was to assume that the events of having SIDs within a family are independent
  • +
  • That is, \(P(A_1 \cap A_2)\) is not necessarily equal to \(P(A_1)P(A_2)\)
  • +
  • Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families
  • +
  • (There are many other statistical points of discussion for this case.)
  • +
+ +
+ +
+ + +
+

IID random variables

+
+
+
    +
  • Random variables are said to be iid if they are independent and identically distributed + +
      +
    • Independent: statistically unrelated from one and another
    • +
    • Identically distributed: all having been drawn from the same population distribution
    • +
  • +
  • iid random variables are the default model for random samples
  • +
  • Many of the important theories of statistics are founded on assuming that variables are iid
  • +
  • Assuming a random sample and iid will be the default starting point of inference for this class
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_ConditionalProbability/index.md b/06_StatisticalInference/03_ConditionalProbability/index.md new file mode 100644 index 000000000..c33fdafa5 --- /dev/null +++ b/06_StatisticalInference/03_ConditionalProbability/index.md @@ -0,0 +1,221 @@ +--- +title : Conditional Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Conditional probability, motivation + +- The probability of getting a one when rolling a (standard) die + is usually assumed to be one sixth +- Suppose you were given the extra information that the die roll + was an odd number (hence 1, 3 or 5) +- *conditional on this new information*, the probability of a + one is now one third + +--- + +## Conditional probability, definition + +- Let $B$ be an event so that $P(B) > 0$ +- Then the conditional probability of an event $A$ given that $B$ has occurred is + $$ + P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} + $$ +- Notice that if $A$ and $B$ are independent (defined later in the lecture), then + $$ + P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) + $$ + +--- + +## Example + +- Consider our die roll example +- $B = \{1, 3, 5\}$ +- $A = \{1\}$ +$$ + \begin{eqnarray*} +P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ + & = & \frac{P(A \cap B)}{P(B)} \\ \\ + & = & \frac{P(A)}{P(B)} \\ \\ + & = & \frac{1/6}{3/6} = \frac{1}{3} + \end{eqnarray*} +$$ + + + +--- + +## Bayes' rule +Baye's rule allows us to reverse the conditioning set provided +that we know some marginal probabilities +$$ +P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. +$$ + + +--- + +## Diagnostic tests + +- Let $+$ and $-$ be the events that the result of a diagnostic test is positive or negative respectively +- Let $D$ and $D^c$ be the event that the subject of the test has or does not have the disease respectively +- The **sensitivity** is the probability that the test is positive given that the subject actually has the disease, $P(+ ~|~ D)$ +- The **specificity** is the probability that the test is negative given that the subject does not have the disease, $P(- ~|~ D^c)$ + +--- + +## More definitions + +- The **positive predictive value** is the probability that the subject has the disease given that the test is positive, $P(D ~|~ +)$ +- The **negative predictive value** is the probability that the subject does not have the disease given that the test is negative, $P(D^c ~|~ -)$ +- The **prevalence of the disease** is the marginal probability of disease, $P(D)$ + +--- + +## More definitions + +- The **diagnostic likelihood ratio of a positive test**, labeled $DLR_+$, is $P(+ ~|~ D) / P(+ ~|~ D^c)$, which is the $$sensitivity / (1 - specificity)$$ +- The **diagnostic likelihood ratio of a negative test**, labeled $DLR_-$, is $P(- ~|~ D) / P(- ~|~ D^c)$, which is the $$(1 - sensitivity) / specificity$$ + +--- + +## Example + +- A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5% +- Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive value? +- Mathematically, we want $P(D ~|~ +)$ given the sensitivity, $P(+ ~|~ D) = .997$, the specificity, $P(- ~|~ D^c) =.985$, and the prevalence $P(D) = .001$ + +--- + +## Using Bayes' formula + +$$ +\begin{eqnarray*} + P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ + & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ + & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ + & = & .062 +\end{eqnarray*} +$$ + +- In this population a positive test result only suggests a 6% probability that the subject has the disease +- (The positive predictive value is 6% for this test) + +--- + +## More on this example + +- The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity +- Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner +- Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes + +--- + +## Likelihood ratios + +- Using Bayes rule, we have + $$ + P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} + $$ + and + $$ + P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. + $$ + +--- + +## Likelihood ratios + +- Therefore +$$ +\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} +$$ +ie +$$ +\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D +$$ +- Similarly, $DLR_-$ relates the decrease in the odds of the + disease after a negative test result to the odds of disease prior to + the test. + +--- + +## HIV example revisited + +- Suppose a subject has a positive HIV test +- $DLR_+ = .997 / (1 - .985) \approx 66$ +- The result of the positive test is that the odds of disease is now 66 times the pretest odds +- Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease + +--- + +## HIV example revisited + +- Suppose that a subject has a negative test result +- $DLR_- = (1 - .997) / .985 \approx .003$ +- Therefore, the post-test odds of disease is now $.3\%$ of the pretest odds given the negative test. +- Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result + +--- + +## Independence + +- Two events $A$ and $B$ are **independent** if $$P(A \cap B) = P(A)P(B)$$ +- Equivalently if $P(A ~|~ B) = P(A)$ +- Two random variables, $X$ and $Y$ are independent if for any two sets $A$ and $B$ $$P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)$$ +- If $A$ is independent of $B$ then + - $A^c$ is independent of $B$ + - $A$ is independent of $B^c$ + - $A^c$ is independent of $B^c$ + + +--- + +## Example + +- What is the probability of getting two consecutive heads? +- $A = \{\mbox{Head on flip 1}\}$ ~ $P(A) = .5$ +- $B = \{\mbox{Head on flip 2}\}$ ~ $P(B) = .5$ +- $A \cap B = \{\mbox{Head on flips 1 and 2}\}$ +- $P(A \cap B) = P(A)P(B) = .5 \times .5 = .25$ + +--- + +## Example + +- Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial +- Based on an estimated prevalence of sudden infant death syndrome of $1$ out of $8,543$, the physician testified that that the probability of a mother having two children with SIDS was $\left(\frac{1}{8,543}\right)^2$ +- The mother on trial was convicted of murder + +--- + +## Example: continued + +- Relevant to this discussion, the principal mistake was to *assume* that the events of having SIDs within a family are independent +- That is, $P(A_1 \cap A_2)$ is not necessarily equal to $P(A_1)P(A_2)$ +- Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families +- (There are many other statistical points of discussion for this case.) + + +--- +## IID random variables + +- Random variables are said to be iid if they are independent and identically distributed + - Independent: statistically unrelated from one and another + - Identically distributed: all having been drawn from the same population distribution +- iid random variables are the default model for random samples +- Many of the important theories of statistics are founded on assuming that variables are iid +- Assuming a random sample and iid will be the default starting point of inference for this class + diff --git a/06_StatisticalInference/03_ConditionalProbability/index.pdf b/06_StatisticalInference/03_ConditionalProbability/index.pdf new file mode 100644 index 000000000..b91f495a9 Binary files /dev/null and b/06_StatisticalInference/03_ConditionalProbability/index.pdf differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/galton.png b/06_StatisticalInference/04_Expectations/assets/fig/galton.png new file mode 100644 index 000000000..19abb675a Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/galton.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/lsm.png b/06_StatisticalInference/04_Expectations/assets/fig/lsm.png new file mode 100644 index 000000000..9a33fef15 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/lsm.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..c8c6209b8 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-11.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-11.png new file mode 100644 index 000000000..bdfdd22df Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-11.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-12.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-12.png new file mode 100644 index 000000000..67d844343 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-12.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..480b45d70 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..904f824bf Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-4.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-4.png new file mode 100644 index 000000000..6cb7c0dcc Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-4.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-5.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-5.png new file mode 100644 index 000000000..32e3f0c9b Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-5.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-6.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-6.png new file mode 100644 index 000000000..f574e8155 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-6.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-7.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-7.png new file mode 100644 index 000000000..7c2834b0d Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-7.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-8.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-8.png new file mode 100644 index 000000000..60e61f6e8 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-8.png differ diff --git a/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-9.png b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-9.png new file mode 100644 index 000000000..0dd06658f Binary files /dev/null and b/06_StatisticalInference/04_Expectations/assets/fig/unnamed-chunk-9.png differ diff --git a/06_StatisticalInference/01_03_Expectations/figure/galton.png b/06_StatisticalInference/04_Expectations/figure/galton.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/galton.png rename to 06_StatisticalInference/04_Expectations/figure/galton.png diff --git a/06_StatisticalInference/01_03_Expectations/figure/lsm.png b/06_StatisticalInference/04_Expectations/figure/lsm.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/lsm.png rename to 06_StatisticalInference/04_Expectations/figure/lsm.png diff --git a/06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-1.png b/06_StatisticalInference/04_Expectations/figure/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-1.png rename to 06_StatisticalInference/04_Expectations/figure/unnamed-chunk-1.png diff --git a/06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-2.png b/06_StatisticalInference/04_Expectations/figure/unnamed-chunk-2.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-2.png rename to 06_StatisticalInference/04_Expectations/figure/unnamed-chunk-2.png diff --git a/06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-3.png b/06_StatisticalInference/04_Expectations/figure/unnamed-chunk-3.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-3.png rename to 06_StatisticalInference/04_Expectations/figure/unnamed-chunk-3.png diff --git a/06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-31.png b/06_StatisticalInference/04_Expectations/figure/unnamed-chunk-31.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-31.png rename to 06_StatisticalInference/04_Expectations/figure/unnamed-chunk-31.png diff --git a/06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-32.png b/06_StatisticalInference/04_Expectations/figure/unnamed-chunk-32.png similarity index 100% rename from 06_StatisticalInference/01_03_Expectations/figure/unnamed-chunk-32.png rename to 06_StatisticalInference/04_Expectations/figure/unnamed-chunk-32.png diff --git a/06_StatisticalInference/04_Expectations/index.Rmd b/06_StatisticalInference/04_Expectations/index.Rmd new file mode 100644 index 000000000..40a2f598b --- /dev/null +++ b/06_StatisticalInference/04_Expectations/index.Rmd @@ -0,0 +1,226 @@ +--- +title : Expected values +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Expected values +- Expected values are useful cor characterizing a distribution +- The mean is a characterization of its center +- The variance and standard deviation are characterizations of +how spread out it is +- Our sample expected values (the sample mean and variance) will +estimate the population versions + + +--- +## The population mean +- The **expected value** or **mean** of a random variable is the center of its distribution +- For discrete random variable $X$ with PMF $p(x)$, it is defined as follows + $$ + E[X] = \sum_x xp(x). + $$ + where the sum is taken over the possible values of $x$ +- $E[X]$ represents the center of mass of a collection of locations and weights, $\{x, p(x)\}$ + +--- +## The sample mean +- The sample mean estimates this population mean +- The center of mass of the data is the empirical mean +$$ +\bar X = \sum_{i=1}^n x_i p(x_i) +$$ +where $p(x_i) = 1/n$ + +--- + +## Example +### Find the center of mass of the bars +```{r galton, fig.height=6,fig.width=12, fig.align='center', echo = FALSE, message =FALSE, warning=FALSE} +library(UsingR); data(galton); library(ggplot2) +library(reshape2) +longGalton <- melt(galton, measure.vars = c("child", "parent")) +g <- ggplot(longGalton, aes(x = value)) + geom_histogram(aes(y = ..density.., fill = variable), binwidth=1, colour = "black") + geom_density(size = 2) +g <- g + facet_grid(. ~ variable) +g +``` + +--- +## Using manipulate +``` +library(manipulate) +myHist <- function(mu){ + g <- ggplot(galton, aes(x = child)) + g <- g + geom_histogram(fill = "salmon", + binwidth=1, aes(y = ..density..), colour = "black") + g <- g + geom_density(size = 2) + g <- g + geom_vline(xintercept = mu, size = 2) + mse <- round(mean((galton$child - mu)^2), 3) + g <- g + labs(title = paste('mu = ', mu, ' MSE = ', mse)) + g +} +manipulate(myHist(mu), mu = slider(62, 74, step = 0.5)) +``` + +--- +## The center of mass is the empirical mean +```{r lsm, dependson="galton",fig.height=7,fig.width=7, fig.align='center', echo = FALSE} + g <- ggplot(galton, aes(x = child)) + g <- g + geom_histogram(fill = "salmon", + binwidth=1, aes(y = ..density..), colour = "black") + g <- g + geom_density(size = 2) + g <- g + geom_vline(xintercept = mean(galton$child), size = 2) + g +``` + + +--- +## Example of a population mean + +- Suppose a coin is flipped and $X$ is declared $0$ or $1$ corresponding to a head or a tail, respectively +- What is the expected value of $X$? + $$ + E[X] = .5 \times 0 + .5 \times 1 = .5 + $$ +- Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be $.5$ + +```{r, echo = FALSE, fig.height=4, fig.width = 6, fig.align='center'} +ggplot(data.frame(x = factor(0 : 1), y = c(.5, .5)), aes(x = x, y = y)) + geom_bar(stat = "identity", colour = 'black', fill = "lightblue") +``` + +--- +## What about a biased coin? + +- Suppose that a random variable, $X$, is so that +$P(X=1) = p$ and $P(X=0) = (1 - p)$ +- (This is a biased coin when $p\neq 0.5$) +- What is its expected value? +$$ +E[X] = 0 * (1 - p) + 1 * p = p +$$ + +--- + +## Example + +- Suppose that a die is rolled and $X$ is the number face up +- What is the expected value of $X$? + $$ + E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 + $$ +- Again, the geometric argument makes this answer obvious without calculation. + +```{r, fig.align='center', echo=FALSE, fig.height=4, fig.width=10} +ggplot(data.frame(x = factor(1 : 6), y = rep(1/6, 6)), aes(x = x, y = y)) + geom_bar(stat = "identity", colour = 'black', fill = "lightblue") +``` + +--- + +## Continuous random variables + +- For a continuous random variable, $X$, with density, $f$, the expected value is again exactly the center of mass of the density + + +--- + +## Example + +- Consider a density where $f(x) = 1$ for $x$ between zero and one +- (Is this a valid density?) +- Suppose that $X$ follows this density; what is its expected value? +```{r, fig.height=6, fig.width=6, echo=FALSE, fig.align='center'} +g <- ggplot(data.frame(x = c(-0.25, 0, 0, 1, 1, 1.25), + y = c(0, 0, 1, 1, 0, 0)), + aes(x = x, y = y)) +g <- g + geom_line(size = 2, colour = "black") +g <- g + labs(title = "Uniform density") +g + +``` + +--- + +## Facts about expected values + +- Recall that expected values are properties of distributions +- Note the average of random variables is itself a random variable +and its associated distribution has an expected value +- The center of this distribution is the same as that of the original distribution +- Therefore, the expected value of the **sample mean** is the population mean that it's trying to estimate +- When the expected value of an estimator is what its trying to estimate, we say that the estimator is **unbiased** +- Let's try a simulation experiment + +--- +## Simulation experiment +Simulating normals with mean 0 and variance 1 versus averages +of 10 normals from the same population + +```{r, fig.height=6, figh.width=6, fig.align='center', echo = FALSE} +library(ggplot2) +nosim <- 10000; n <- 10 +dat <- data.frame( + x = c(rnorm(nosim), apply(matrix(rnorm(nosim * n), nosim), 1, mean)), + what = factor(rep(c("Obs", "Mean"), c(nosim, nosim))) + ) +ggplot(dat, aes(x = x, fill = what)) + geom_density(size = 2, alpha = .2); + +``` + +--- +## Averages of x die rolls + +```{r, fig.align='center',fig.height=5, fig.width=10, echo = FALSE, warning=FALSE, error=FALSE, message=FALSE} +dat <- data.frame( + x = c(sample(1 : 6, nosim, replace = TRUE), + apply(matrix(sample(1 : 6, nosim * 2, replace = TRUE), + nosim), 1, mean), + apply(matrix(sample(1 : 6, nosim * 3, replace = TRUE), + nosim), 1, mean), + apply(matrix(sample(1 : 6, nosim * 4, replace = TRUE), + nosim), 1, mean) + ), + size = factor(rep(1 : 4, rep(nosim, 4)))) +g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.25, colour = "black") +g + facet_grid(. ~ size) +``` + + +--- +## Averages of x coin flips +```{r, fig.align='center',fig.height=5, fig.width=10, echo = FALSE, warning=FALSE, error=FALSE, message=FALSE} +dat <- data.frame( + x = c(sample(0 : 1, nosim, replace = TRUE), + apply(matrix(sample(0 : 1, nosim * 10, replace = TRUE), + nosim), 1, mean), + apply(matrix(sample(0 : 1, nosim * 20, replace = TRUE), + nosim), 1, mean), + apply(matrix(sample(0 : 1, nosim * 30, replace = TRUE), + nosim), 1, mean) + ), + size = factor(rep(c(1, 10, 20, 30), rep(nosim, 4)))) +g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth = 1 / 12, colour = "black"); +g + facet_grid(. ~ size) +``` + +--- +## Sumarizing what we know +- Expected values are properties of distributions +- The population mean is the center of mass of population +- The sample mean is the center of mass of the observed data +- The sample mean is an estimate of the population mean +- The sample mean is unbiased + - The population mean of its distribution is the mean that it's + trying to estimate +- The more data that goes into the sample mean, the more +concentrated its density / mass function is around the population mean diff --git a/06_StatisticalInference/04_Expectations/index.html b/06_StatisticalInference/04_Expectations/index.html new file mode 100644 index 000000000..174080640 --- /dev/null +++ b/06_StatisticalInference/04_Expectations/index.html @@ -0,0 +1,446 @@ + + + + Expected values + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Expected values

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Expected values

+
+
+
    +
  • Expected values are useful cor characterizing a distribution
  • +
  • The mean is a characterization of its center
  • +
  • The variance and standard deviation are characterizations of +how spread out it is
  • +
  • Our sample expected values (the sample mean and variance) will +estimate the population versions
  • +
+ +
+ +
+ + +
+

The population mean

+
+
+
    +
  • The expected value or mean of a random variable is the center of its distribution
  • +
  • For discrete random variable \(X\) with PMF \(p(x)\), it is defined as follows +\[ +E[X] = \sum_x xp(x). +\] +where the sum is taken over the possible values of \(x\)
  • +
  • \(E[X]\) represents the center of mass of a collection of locations and weights, \(\{x, p(x)\}\)
  • +
+ +
+ +
+ + +
+

The sample mean

+
+
+
    +
  • The sample mean estimates this population mean
  • +
  • The center of mass of the data is the empirical mean +\[ +\bar X = \sum_{i=1}^n x_i p(x_i) +\] +where \(p(x_i) = 1/n\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+

Find the center of mass of the bars

+ +

plot of chunk galton

+ +
+ +
+ + +
+

Using manipulate

+
+
+
library(manipulate)
+myHist <- function(mu){
+    g <- ggplot(galton, aes(x = child))
+    g <- g + geom_histogram(fill = "salmon", 
+      binwidth=1, aes(y = ..density..), colour = "black")
+    g <- g + geom_density(size = 2)
+    g <- g + geom_vline(xintercept = mu, size = 2)
+    mse <- round(mean((galton$child - mu)^2), 3)  
+    g <- g + labs(title = paste('mu = ', mu, ' MSE = ', mse))
+    g
+}
+manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
+
+ +
+ +
+ + +
+

The center of mass is the empirical mean

+
+
+

plot of chunk lsm

+ +
+ +
+ + +
+

Example of a population mean

+
+
+
    +
  • Suppose a coin is flipped and \(X\) is declared \(0\) or \(1\) corresponding to a head or a tail, respectively
  • +
  • What is the expected value of \(X\)? +\[ +E[X] = .5 \times 0 + .5 \times 1 = .5 +\]
  • +
  • Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be \(.5\)
  • +
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

What about a biased coin?

+
+
+
    +
  • Suppose that a random variable, \(X\), is so that +\(P(X=1) = p\) and \(P(X=0) = (1 - p)\)
  • +
  • (This is a biased coin when \(p\neq 0.5\))
  • +
  • What is its expected value? +\[ +E[X] = 0 * (1 - p) + 1 * p = p +\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose that a die is rolled and \(X\) is the number face up
  • +
  • What is the expected value of \(X\)? +\[ +E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + +3 \times \frac{1}{6} + 4 \times \frac{1}{6} + +5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 +\]
  • +
  • Again, the geometric argument makes this answer obvious without calculation.
  • +
+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Continuous random variables

+
+
+
    +
  • For a continuous random variable, \(X\), with density, \(f\), the expected value is again exactly the center of mass of the density
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Consider a density where \(f(x) = 1\) for \(x\) between zero and one
  • +
  • (Is this a valid density?)
  • +
  • Suppose that \(X\) follows this density; what is its expected value?
    +plot of chunk unnamed-chunk-3
  • +
+ +
+ +
+ + +
+

Facts about expected values

+
+
+
    +
  • Recall that expected values are properties of distributions
  • +
  • Note the average of random variables is itself a random variable +and its associated distribution has an expected value
  • +
  • The center of this distribution is the same as that of the original distribution
  • +
  • Therefore, the expected value of the sample mean is the population mean that it's trying to estimate
  • +
  • When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased
  • +
  • Let's try a simulation experiment
  • +
+ +
+ +
+ + +
+

Simulation experiment

+
+
+

Simulating normals with mean 0 and variance 1 versus averages +of 10 normals from the same population

+ +

plot of chunk unnamed-chunk-4

+ +
+ +
+ + +
+

Averages of x die rolls

+
+
+

plot of chunk unnamed-chunk-5

+ +
+ +
+ + +
+

Averages of x coin flips

+
+
+

plot of chunk unnamed-chunk-6

+ +
+ +
+ + +
+

Sumarizing what we know

+
+
+
    +
  • Expected values are properties of distributions
  • +
  • The population mean is the center of mass of population
  • +
  • The sample mean is the center of mass of the observed data
  • +
  • The sample mean is an estimate of the population mean
  • +
  • The sample mean is unbiased + +
      +
    • The population mean of its distribution is the mean that it's +trying to estimate
    • +
  • +
  • The more data that goes into the sample mean, the more +concentrated its density / mass function is around the population mean
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/04_Expectations/index.md b/06_StatisticalInference/04_Expectations/index.md new file mode 100644 index 000000000..89fea21a5 --- /dev/null +++ b/06_StatisticalInference/04_Expectations/index.md @@ -0,0 +1,165 @@ +--- +title : Expected values +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Expected values +- Expected values are useful cor characterizing a distribution +- The mean is a characterization of its center +- The variance and standard deviation are characterizations of +how spread out it is +- Our sample expected values (the sample mean and variance) will +estimate the population versions + + +--- +## The population mean +- The **expected value** or **mean** of a random variable is the center of its distribution +- For discrete random variable $X$ with PMF $p(x)$, it is defined as follows + $$ + E[X] = \sum_x xp(x). + $$ + where the sum is taken over the possible values of $x$ +- $E[X]$ represents the center of mass of a collection of locations and weights, $\{x, p(x)\}$ + +--- +## The sample mean +- The sample mean estimates this population mean +- The center of mass of the data is the empirical mean +$$ +\bar X = \sum_{i=1}^n x_i p(x_i) +$$ +where $p(x_i) = 1/n$ + +--- + +## Example +### Find the center of mass of the bars +plot of chunk galton + +--- +## Using manipulate +``` +library(manipulate) +myHist <- function(mu){ + g <- ggplot(galton, aes(x = child)) + g <- g + geom_histogram(fill = "salmon", + binwidth=1, aes(y = ..density..), colour = "black") + g <- g + geom_density(size = 2) + g <- g + geom_vline(xintercept = mu, size = 2) + mse <- round(mean((galton$child - mu)^2), 3) + g <- g + labs(title = paste('mu = ', mu, ' MSE = ', mse)) + g +} +manipulate(myHist(mu), mu = slider(62, 74, step = 0.5)) +``` + +--- +## The center of mass is the empirical mean +plot of chunk lsm + + +--- +## Example of a population mean + +- Suppose a coin is flipped and $X$ is declared $0$ or $1$ corresponding to a head or a tail, respectively +- What is the expected value of $X$? + $$ + E[X] = .5 \times 0 + .5 \times 1 = .5 + $$ +- Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be $.5$ + +plot of chunk unnamed-chunk-1 + +--- +## What about a biased coin? + +- Suppose that a random variable, $X$, is so that +$P(X=1) = p$ and $P(X=0) = (1 - p)$ +- (This is a biased coin when $p\neq 0.5$) +- What is its expected value? +$$ +E[X] = 0 * (1 - p) + 1 * p = p +$$ + +--- + +## Example + +- Suppose that a die is rolled and $X$ is the number face up +- What is the expected value of $X$? + $$ + E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 + $$ +- Again, the geometric argument makes this answer obvious without calculation. + +plot of chunk unnamed-chunk-2 + +--- + +## Continuous random variables + +- For a continuous random variable, $X$, with density, $f$, the expected value is again exactly the center of mass of the density + + +--- + +## Example + +- Consider a density where $f(x) = 1$ for $x$ between zero and one +- (Is this a valid density?) +- Suppose that $X$ follows this density; what is its expected value? +plot of chunk unnamed-chunk-3 + +--- + +## Facts about expected values + +- Recall that expected values are properties of distributions +- Note the average of random variables is itself a random variable +and its associated distribution has an expected value +- The center of this distribution is the same as that of the original distribution +- Therefore, the expected value of the **sample mean** is the population mean that it's trying to estimate +- When the expected value of an estimator is what its trying to estimate, we say that the estimator is **unbiased** +- Let's try a simulation experiment + +--- +## Simulation experiment +Simulating normals with mean 0 and variance 1 versus averages +of 10 normals from the same population + +plot of chunk unnamed-chunk-4 + +--- +## Averages of x die rolls + +plot of chunk unnamed-chunk-5 + + +--- +## Averages of x coin flips +plot of chunk unnamed-chunk-6 + +--- +## Sumarizing what we know +- Expected values are properties of distributions +- The population mean is the center of mass of population +- The sample mean is the center of mass of the observed data +- The sample mean is an estimate of the population mean +- The sample mean is unbiased + - The population mean of its distribution is the mean that it's + trying to estimate +- The more data that goes into the sample mean, the more +concentrated its density / mass function is around the population mean diff --git a/06_StatisticalInference/04_Expectations/index.pdf b/06_StatisticalInference/04_Expectations/index.pdf new file mode 100644 index 000000000..bade44b67 Binary files /dev/null and b/06_StatisticalInference/04_Expectations/index.pdf differ diff --git a/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..7c2834b0d Binary files /dev/null and b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-10.png b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-10.png new file mode 100644 index 000000000..f904f389c Binary files /dev/null and b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-10.png differ diff --git a/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-11.png b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-11.png new file mode 100644 index 000000000..f904f389c Binary files /dev/null and b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-11.png differ diff --git a/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..348f1f6ff Binary files /dev/null and b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..f22b9b90d Binary files /dev/null and b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-9.png b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-9.png new file mode 100644 index 000000000..43f87d1dd Binary files /dev/null and b/06_StatisticalInference/05_Variance/assets/fig/unnamed-chunk-9.png differ diff --git a/06_StatisticalInference/05_Variance/index.Rmd b/06_StatisticalInference/05_Variance/index.Rmd new file mode 100644 index 000000000..97678f49a --- /dev/null +++ b/06_StatisticalInference/05_Variance/index.Rmd @@ -0,0 +1,239 @@ +--- +title : The variance +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## The variance + +- The variance of a random variable is a measure of *spread* +- If $X$ is a random variable with mean $\mu$, the variance of $X$ is defined as + +$$ +Var(X) = E[(X - \mu)^2] = E[X^2] - E[X]^2 +$$ + +- The expected (squared) distance from the mean +- Densities with a higher variance are more spread out than densities with a lower variance +- The square root of the variance is called the **standard deviation** +- The standard deviation has the same units as $X$ + +--- + +## Example + +- What's the variance from the result of a toss of a die? + + - $E[X] = 3.5$ + - $E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17$ + +- $Var(X) = E[X^2] - E[X]^2 \approx 2.92$ + +--- + +## Example + +- What's the variance from the result of the toss of a coin with probability of heads (1) of $p$? + + - $E[X] = 0 \times (1 - p) + 1 \times p = p$ + - $E[X^2] = E[X] = p$ + +$$Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)$$ + + +--- +## Distributions with increasing variance +```{r, echo = FALSE, fig.height = 6, fig.width = 8, fig.align='center'} +library(ggplot2) +xvals <- seq(-10, 10, by = .01) +dat <- data.frame( + y = c( + dnorm(xvals, mean = 0, sd = 1), + dnorm(xvals, mean = 0, sd = 2), + dnorm(xvals, mean = 0, sd = 3), + dnorm(xvals, mean = 0, sd = 4) + ), + x = rep(xvals, 4), + factor = factor(rep(1 : 4, rep(length(xvals), 4))) +) +ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2) +``` + +--- +## The sample variance +- The sample variance is +$$ +S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1} +$$ +(almost, but not quite, the average squared deviation from +the sample mean) +- It is also a random variable + - It has an associate population distribution + - Its expected value is the population variance + - Its distribution gets more concentrated around the population variance with more data +- Its square root is the sample standard deviation + + +--- +## Simulation experiment +### Simulating from a population with variance 1 + +```{r, fig.height=6, figh.width=6, fig.align='center', echo = FALSE} +library(ggplot2) +nosim <- 10000; +dat <- data.frame( + x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var), + apply(matrix(rnorm(nosim * 20), nosim), 1, var), + apply(matrix(rnorm(nosim * 30), nosim), 1, var)), + n = factor(rep(c("10", "20", "30"), c(nosim, nosim, nosim))) + ) +ggplot(dat, aes(x = x, fill = n)) + geom_density(size = 2, alpha = .2) + geom_vline(xintercept = 1, size = 2) + +``` + +--- +## Variances of x die rolls +```{r, fig.align='center',fig.height=5, fig.width=10, echo = FALSE, warning=FALSE, error=FALSE, message=FALSE} +dat <- data.frame( + x = c(apply(matrix(sample(1 : 6, nosim * 10, replace = TRUE), + nosim), 1, var), + apply(matrix(sample(1 : 6, nosim * 20, replace = TRUE), + nosim), 1, var), + apply(matrix(sample(1 : 6, nosim * 30, replace = TRUE), + nosim), 1, var) + ), + size = factor(rep(c(10, 20, 30), rep(nosim, 3)))) +g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.3, colour = "black") +g <- g + geom_vline(xintercept = 2.92, size = 2) +g + facet_grid(. ~ size) +``` + + +--- + +## Recall the mean +- Recall that the average of random sample from a population +is itself a random variable +- We know that this distribution is centered around the population +mean, $E[\bar X] = \mu$ +- We also know what its variance is $Var(\bar X) = \sigma^2 / n$ +- This is very useful, since we don't have repeat sample means +to get its variance; now we know how it relates to +the population variance +- We call the standard deviation of a statistic a standard error + +--- +## To summarize +- The sample variance, $S^2$, estimates the population variance, $\sigma^2$ +- The distribution of the sample variance is centered around $\sigma^2$ +- The the variance of sample mean is $\sigma^2 / n$ + - Its logical estimate is $s^2 / n$ + - The logical estimate of the standard error is $s / \sqrt{n}$ +- $s$, the standard deviation, talks about how variable the population is +- $s/\sqrt{n}$, the standard error, talks about how variable averages of random samples of size $n$ from the population are + +--- +## Simulation example +Standard normals have variance 1; means of $n$ standard normals +have standard deviation $1/\sqrt{n}$ + +```{r} +nosim <- 1000 +n <- 10 +sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean)) +1 / sqrt(n) +``` + + +--- +## Simulation example +Standard uniforms have variance $1/12$; means of +random samples of $n$ uniforms have sd $1/\sqrt{12 \times n}$ + + +```{r} +nosim <- 1000 +n <- 10 +sd(apply(matrix(runif(nosim * n), nosim), 1, mean)) +1 / sqrt(12 * n) +``` + + +--- +## Simulation example +Poisson(4) have variance $4$; means of +random samples of $n$ Poisson(4) have sd $2/\sqrt{n}$ + + +```{r} +nosim <- 1000 +n <- 10 +sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean)) +2 / sqrt(n) +``` + + +--- +## Simulation example +Fair coin flips have variance $0.25$; means of +random samples of $n$ coin flips have sd $1 / (2 \sqrt{n})$ + + +```{r} +nosim <- 1000 +n <- 10 +sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), + nosim), 1, mean)) +1 / (2 * sqrt(n)) +``` + +--- +## Data example +```{r} +library(UsingR); data(father.son); +x <- father.son$sheight +n<-length(x) +``` + +--- +## Plot of the son's heights +```{r, fig.height=6, fig.width=6, echo=FALSE, fig.align='center'} +g <- ggplot(data = father.son, aes(x = sheight)) +g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black") +g <- g + geom_density(size = 2, colour = "black") +g +``` + +--- +## Let's interpret these numbers +```{r} +round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)),2) +``` + +```{r, echo = FALSE, fig.height=4, fig.width=4,fig.align='center'} +g +``` + + +--- +## Summarizing what we know about variances +- The sample variance estimates the population variance +- The distribution of the sample variance is centered at +what its estimating +- It gets more concentrated around the population variance with larger sample sizes +- The variance of the sample mean is the population variance +divided by $n$ + - The square root is the standard error +- It turns out that we can say a lot about the distribution of +averages from random samples, +even though we only get one to look at in a given data set diff --git a/06_StatisticalInference/05_Variance/index.html b/06_StatisticalInference/05_Variance/index.html new file mode 100644 index 000000000..d002219ee --- /dev/null +++ b/06_StatisticalInference/05_Variance/index.html @@ -0,0 +1,521 @@ + + + + The variance + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

The variance

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

The variance

+
+
+
    +
  • The variance of a random variable is a measure of spread
  • +
  • If \(X\) is a random variable with mean \(\mu\), the variance of \(X\) is defined as
  • +
+ +

\[ +Var(X) = E[(X - \mu)^2] = E[X^2] - E[X]^2 +\]

+ +
    +
  • The expected (squared) distance from the mean
  • +
  • Densities with a higher variance are more spread out than densities with a lower variance
  • +
  • The square root of the variance is called the standard deviation
  • +
  • The standard deviation has the same units as \(X\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • What's the variance from the result of a toss of a die?

    + +
      +
    • \(E[X] = 3.5\)
    • +
    • \(E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17\)
    • +
  • +
  • \(Var(X) = E[X^2] - E[X]^2 \approx 2.92\)

  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • What's the variance from the result of the toss of a coin with probability of heads (1) of \(p\)?

    + +
      +
    • \(E[X] = 0 \times (1 - p) + 1 \times p = p\)
    • +
    • \(E[X^2] = E[X] = p\)
    • +
  • +
+ +

\[Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)\]

+ +
+ +
+ + +
+

Distributions with increasing variance

+
+
+

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

The sample variance

+
+
+
    +
  • The sample variance is +\[ +S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1} +\] +(almost, but not quite, the average squared deviation from +the sample mean)
  • +
  • It is also a random variable + +
      +
    • It has an associate population distribution
    • +
    • Its expected value is the population variance
    • +
    • Its distribution gets more concentrated around the population variance with mroe data
    • +
  • +
  • Its square root is the sample standard deviation
  • +
+ +
+ +
+ + +
+

Simulation experiment

+
+
+

Simulating from a population with variance 1

+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Variances of x die rolls

+
+
+

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

Recall the mean

+
+
+
    +
  • Recall that the average of random sample from a population +is itself a random variable
  • +
  • We know that this distribution is centered around the population +mean, \(E[\bar X] = \mu\)
  • +
  • We also know what its variance is \(Var(\bar X) = \sigma^2 / n\)
  • +
  • This is very useful, since we don't have repeat sample means +to get its variance; now we know how it relates to +the population variance
  • +
  • We call the standard deviation of a statistic a standard error
  • +
+ +
+ +
+ + +
+

To summarize

+
+
+
    +
  • The sample variance, \(S^2\), estimates the population variance, \(\sigma^2\)
  • +
  • The distribution of the sample variance is centered around \(\sigma^2\)
  • +
  • The the variance of sample mean is \(\sigma^2 / n\) + +
      +
    • Its logical estimate is \(s^2 / n\)
    • +
    • The logical estimate of the standard error is \(S / \sqrt{n}\)
    • +
  • +
  • \(S\), the standard deviation, talks about how variable the population is
  • +
  • \(S/\sqrt{n}\), the standard error, talks about how variable averages of random samples of size \(n\) from the population are
  • +
+ +
+ +
+ + +
+

Simulation example

+
+
+

Standard normals have variance 1; means of \(n\) standard normals +have standard deviation \(1/\sqrt{n}\)

+ +
nosim <- 1000
+n <- 10
+sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
+
+ +
## [1] 0.3156
+
+ +
1 / sqrt(n)
+
+ +
## [1] 0.3162
+
+ +
+ +
+ + +
+

Simulation example

+
+
+

Standard uniforms have variance \(1/12\); means of +random samples of \(n\) uniforms have sd \(1/\sqrt{12 \times n}\)

+ +
nosim <- 1000
+n <- 10
+sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
+
+ +
## [1] 0.09017
+
+ +
1 / sqrt(12 * n)
+
+ +
## [1] 0.09129
+
+ +
+ +
+ + +
+

Simulation example

+
+
+

Poisson(4) have variance \(4\); means of +random samples of \(n\) Poisson(4) have sd \(2/\sqrt{n}\)

+ +
nosim <- 1000
+n <- 10
+sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean))
+
+ +
## [1] 0.6219
+
+ +
2 / sqrt(n)
+
+ +
## [1] 0.6325
+
+ +
+ +
+ + +
+

Simulation example

+
+
+

Fair coin flips have variance \(0.25\); means of +random samples of \(n\) coin flips have sd \(1 / (2 \sqrt{n})\)

+ +
nosim <- 1000
+n <- 10
+sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE),
+                nosim), 1, mean))
+
+ +
## [1] 0.1587
+
+ +
1 / (2 * sqrt(n))
+
+ +
## [1] 0.1581
+
+ +
+ +
+ + +
+

Data example

+
+
+
library(UsingR); data(father.son); 
+x <- father.son$sheight
+n<-length(x)
+
+ +
+ +
+ + +
+

Plot of the son's heights

+
+
+

plot of chunk unnamed-chunk-9

+ +
+ +
+ + +
+

Let's interpret these numbers

+
+
+
round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)),2)
+
+ +
## [1] 7.92 0.01 2.81 0.09
+
+ +

plot of chunk unnamed-chunk-11

+ +
+ +
+ + +
+

Summarizing what we know about variances

+
+
+
    +
  • The sample variance estimates the population variance
  • +
  • The distribution of the sample variance is centered at +what its estimating
  • +
  • It gets more concentrated around the population variance with larger sample sizes
  • +
  • The variance of the sample mean is the population variance +divided by \(n\) + +
      +
    • The square root is the standard error
    • +
  • +
  • It turns out that we can say a lot about the distribution of +averages from random samples, +even though we only get one to look at in a given data set
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/05_Variance/index.md b/06_StatisticalInference/05_Variance/index.md new file mode 100644 index 000000000..ac2361fe3 --- /dev/null +++ b/06_StatisticalInference/05_Variance/index.md @@ -0,0 +1,248 @@ +--- +title : The variance +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## The variance + +- The variance of a random variable is a measure of *spread* +- If $X$ is a random variable with mean $\mu$, the variance of $X$ is defined as + +$$ +Var(X) = E[(X - \mu)^2] = E[X^2] - E[X]^2 +$$ + +- The expected (squared) distance from the mean +- Densities with a higher variance are more spread out than densities with a lower variance +- The square root of the variance is called the **standard deviation** +- The standard deviation has the same units as $X$ + +--- + +## Example + +- What's the variance from the result of a toss of a die? + + - $E[X] = 3.5$ + - $E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17$ + +- $Var(X) = E[X^2] - E[X]^2 \approx 2.92$ + +--- + +## Example + +- What's the variance from the result of the toss of a coin with probability of heads (1) of $p$? + + - $E[X] = 0 \times (1 - p) + 1 \times p = p$ + - $E[X^2] = E[X] = p$ + +$$Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)$$ + + +--- +## Distributions with increasing variance +plot of chunk unnamed-chunk-1 + +--- +## The sample variance +- The sample variance is +$$ +S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1} +$$ +(almost, but not quite, the average squared deviation from +the sample mean) +- It is also a random variable + - It has an associate population distribution + - Its expected value is the population variance + - Its distribution gets more concentrated around the population variance with mroe data +- Its square root is the sample standard deviation + + +--- +## Simulation experiment +### Simulating from a population with variance 1 + +plot of chunk unnamed-chunk-2 + +--- +## Variances of x die rolls +plot of chunk unnamed-chunk-3 + + +--- + +## Recall the mean +- Recall that the average of random sample from a population +is itself a random variable +- We know that this distribution is centered around the population +mean, $E[\bar X] = \mu$ +- We also know what its variance is $Var(\bar X) = \sigma^2 / n$ +- This is very useful, since we don't have repeat sample means +to get its variance; now we know how it relates to +the population variance +- We call the standard deviation of a statistic a standard error + +--- +## To summarize +- The sample variance, $S^2$, estimates the population variance, $\sigma^2$ +- The distribution of the sample variance is centered around $\sigma^2$ +- The variance of the sample mean is $\sigma^2 / n$ + - Its logical estimate is $s^2 / n$ + - The logical estimate of the standard error is $S / \sqrt{n}$ +- $S$, the standard deviation, talks about how variable the population is +- $S/\sqrt{n}$, the standard error, talks about how variable averages of random samples of size $n$ from the population are + +--- +## Simulation example +Standard normals have variance 1; means of $n$ standard normals +have standard deviation $1/\sqrt{n}$ + + +```r +nosim <- 1000 +n <- 10 +sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean)) +``` + +``` +## [1] 0.3156 +``` + +```r +1 / sqrt(n) +``` + +``` +## [1] 0.3162 +``` + + +--- +## Simulation example +Standard uniforms have variance $1/12$; means of +random samples of $n$ uniforms have sd $1/\sqrt{12 \times n}$ + + + +```r +nosim <- 1000 +n <- 10 +sd(apply(matrix(runif(nosim * n), nosim), 1, mean)) +``` + +``` +## [1] 0.09017 +``` + +```r +1 / sqrt(12 * n) +``` + +``` +## [1] 0.09129 +``` + + +--- +## Simulation example +Poisson(4) have variance $4$; means of +random samples of $n$ Poisson(4) have sd $2/\sqrt{n}$ + + + +```r +nosim <- 1000 +n <- 10 +sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean)) +``` + +``` +## [1] 0.6219 +``` + +```r +2 / sqrt(n) +``` + +``` +## [1] 0.6325 +``` + + +--- +## Simulation example +Fair coin flips have variance $0.25$; means of +random samples of $n$ coin flips have sd $1 / (2 \sqrt{n})$ + + + +```r +nosim <- 1000 +n <- 10 +sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), + nosim), 1, mean)) +``` + +``` +## [1] 0.1587 +``` + +```r +1 / (2 * sqrt(n)) +``` + +``` +## [1] 0.1581 +``` + +--- +## Data example + +```r +library(UsingR); data(father.son); +x <- father.son$sheight +n<-length(x) +``` + +--- +## Plot of the son's heights +plot of chunk unnamed-chunk-9 + +--- +## Let's interpret these numbers + +```r +round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)),2) +``` + +``` +## [1] 7.92 0.01 2.81 0.09 +``` + +plot of chunk unnamed-chunk-11 + + +--- +## Summarizing what we know about variances +- The sample variance estimates the population variance +- The distribution of the sample variance is centered at +what its estimating +- It gets more concentrated around the population variance with larger sample sizes +- The variance of the sample mean is the population variance +divided by $n$ + - The square root is the standard error +- It turns out that we can say a lot about the distribution of +averages from random samples, +even though we only get one to look at in a given data set diff --git a/06_StatisticalInference/05_Variance/index.pdf b/06_StatisticalInference/05_Variance/index.pdf new file mode 100644 index 000000000..9fdc1ed32 Binary files /dev/null and b/06_StatisticalInference/05_Variance/index.pdf differ diff --git a/06_StatisticalInference/06_CommonDistros/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/06_CommonDistros/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..0822baa05 Binary files /dev/null and b/06_StatisticalInference/06_CommonDistros/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/06_CommonDistros/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/06_CommonDistros/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..0822baa05 Binary files /dev/null and b/06_StatisticalInference/06_CommonDistros/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/02_01_CommonDistributions/fig/unnamed-chunk-1.png b/06_StatisticalInference/06_CommonDistros/fig/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/02_01_CommonDistributions/fig/unnamed-chunk-1.png rename to 06_StatisticalInference/06_CommonDistros/fig/unnamed-chunk-1.png diff --git a/06_StatisticalInference/02_01_CommonDistributions/fig/unnamed-chunk-3.png b/06_StatisticalInference/06_CommonDistros/fig/unnamed-chunk-3.png similarity index 100% rename from 06_StatisticalInference/02_01_CommonDistributions/fig/unnamed-chunk-3.png rename to 06_StatisticalInference/06_CommonDistros/fig/unnamed-chunk-3.png diff --git a/06_StatisticalInference/02_01_CommonDistributions/fig/unnamed-chunk-4.png b/06_StatisticalInference/06_CommonDistros/fig/unnamed-chunk-4.png similarity index 100% rename from 06_StatisticalInference/02_01_CommonDistributions/fig/unnamed-chunk-4.png rename to 06_StatisticalInference/06_CommonDistros/fig/unnamed-chunk-4.png diff --git a/06_StatisticalInference/02_01_CommonDistributions/figure/unnamed-chunk-1.png b/06_StatisticalInference/06_CommonDistros/figure/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/02_01_CommonDistributions/figure/unnamed-chunk-1.png rename to 06_StatisticalInference/06_CommonDistros/figure/unnamed-chunk-1.png diff --git a/06_StatisticalInference/06_CommonDistros/index.Rmd b/06_StatisticalInference/06_CommonDistros/index.Rmd new file mode 100644 index 000000000..3f38a1f54 --- /dev/null +++ b/06_StatisticalInference/06_CommonDistros/index.Rmd @@ -0,0 +1,261 @@ +--- +title : Some Common Distributions +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + + +## The Bernoulli distribution + +- The **Bernoulli distribution** arises as the result of a binary outcome +- Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1-p$ respectively +- The PMF for a Bernoulli random variable $X$ is $$P(X = x) = p^x (1 - p)^{1 - x}$$ +- The mean of a Bernoulli random variable is $p$ and the variance is $p(1 - p)$ +- If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a "success" and $X=0$ as a "failure" + + +--- + +## Binomial trials + +- The *binomial random variables* are obtained as the sum of iid Bernoulli trials +- In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable +- The binomial mass function is +$$ +P(X = x) = +\left( +\begin{array}{c} + n \\ x +\end{array} +\right) +p^x(1 - p)^{n-x} +$$ +for $x=0,\ldots,n$ + +--- + +## Choose + +- Recall that the notation + $$\left( + \begin{array}{c} + n \\ x + \end{array} + \right) = \frac{n!}{x!(n-x)!} + $$ (read "$n$ choose $x$") counts the number of ways of selecting $x$ items out of $n$ + without replacement disregarding the order of the items + +$$\left( + \begin{array}{c} + n \\ 0 + \end{array} + \right) = +\left( + \begin{array}{c} + n \\ n + \end{array} + \right) = 1 + $$ + +--- + +## Example + +- Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? +$$\left( +\begin{array}{c} + 8 \\ 7 +\end{array} +\right) .5^{7}(1-.5)^{1} ++ +\left( +\begin{array}{c} + 8 \\ 8 +\end{array} +\right) .5^{8}(1-.5)^{0} \approx 0.04 +$$ +```{r} +choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 +pbinom(6, size = 8, prob = .5, lower.tail = FALSE) +``` + + +--- + +## The normal distribution + +- A random variable is said to follow a **normal** or **Gaussian** distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is + $$ + (2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} + $$ + If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$ +- We write $X\sim \mbox{N}(\mu, \sigma^2)$ +- When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called **the standard normal distribution** +- Standard normal RVs are often labeled $Z$ + +--- +## The standard normal distribution with reference lines +```{r, fig.height=6, fig.width=6, fig.align='center', echo = FALSE} +x <- seq(-3, 3, length = 1000) +library(ggplot2) +g <- ggplot(data.frame(x = x, y = dnorm(x)), + aes(x = x, y = y)) + geom_line(size = 2) +g <- g + geom_vline(xintercept = -3 : 3, size = 2) +g +``` + +--- + +## Facts about the normal density + +If $X \sim \mbox{N}(\mu,\sigma^2)$ then +$$Z = \frac{X -\mu}{\sigma} \sim N(0, 1)$$ + + +If $Z$ is standard normal $$X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)$$ + +--- + +## More facts about the normal density + +1. Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively +2. $-1.28$, $-1.645$, $-1.96$ and $-2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively +3. By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively + +--- + +## Question + +- What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution? + - Quick answer in R `qnorm(.95, mean = mu, sd = sd)` +- Or, because you have the standard normal quantiles memorized +and you know that 1.645 is the 95th percentile you know that the answer has to be +$$\mu + \sigma 1.645$$ +- (In general $\mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile) + +--- + +## Question + +- What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is larger than $x$? + +--- +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What's the probability of getting +more than 1,160 clicks in a day? + +--- + +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What's the probability of getting +more than 1,160 clicks in a day? + +It's not very likely, 1,160 is `r (1160 - 1020) / 50` standard +deviations from the mean +```{r} +pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE) +pnorm(2.8, lower.tail = FALSE) +``` + +--- + +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What number of daily ad clicks would represent +the one where 75% of days have fewer clicks (assuming +days are independent and identically distributed)? + +--- + +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What number of daily ad clicks would represent +the one where 75% of days have fewer clicks (assuming +days are independent and identically distributed)? + +```{r} +qnorm(0.75, mean = 1020, sd = 50) +``` + +--- +## The Poisson distribution +* Used to model counts +* The Poisson mass function is +$$ +P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} +$$ +for $x=0,1,\ldots$ +* The mean of this distribution is $\lambda$ +* The variance of this distribution is $\lambda$ +* Notice that $x$ ranges from $0$ to $\infty$ + +--- +## Some uses for the Poisson distribution +* Modeling count data +* Modeling event-time or survival data +* Modeling contingency tables +* Approximating binomials when $n$ is large and $p$ is small + +--- +## Rates and Poisson random variables +* Poisson random variables are used to model rates +* $X \sim Poisson(\lambda t)$ where + * $\lambda = E[X / t]$ is the expected count per unit of time + * $t$ is the total monitoring time + +--- +## Example +The number of people that show up at a bus stop is Poisson with +a mean of $2.5$ per hour. + +If watching the bus stop for 4 hours, what is the probability that $3$ +or fewer people show up for the whole time? + +```{r} +ppois(3, lambda = 2.5 * 4) +``` + +--- +## Poisson approximation to the binomial +* When $n$ is large and $p$ is small the Poisson distribution + is an accurate approximation to the binomial distribution +* Notation + * $X \sim \mbox{Binomial}(n, p)$ + * $\lambda = n p$ + * $n$ gets large + * $p$ gets small + + +--- +## Example, Poisson approximation to the binomial + +We flip a coin with success probablity $0.01$ five hundred times. + +What's the probability of 2 or fewer successes? + +```{r} +pbinom(2, size = 500, prob = .01) +ppois(2, lambda=500 * .01) +``` + diff --git a/06_StatisticalInference/06_CommonDistros/index.html b/06_StatisticalInference/06_CommonDistros/index.html new file mode 100644 index 000000000..0835dd5a9 --- /dev/null +++ b/06_StatisticalInference/06_CommonDistros/index.html @@ -0,0 +1,608 @@ + + + + Some Common Distributions + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Some Common Distributions

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

The Bernoulli distribution

+
+
+
    +
  • The Bernoulli distribution arises as the result of a binary outcome
  • +
  • Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) \(p\) and \(1-p\) respectively
  • +
  • The PMF for a Bernoulli random variable \(X\) is \[P(X = x) = p^x (1 - p)^{1 - x}\]
  • +
  • The mean of a Bernoulli random variable is \(p\) and the variance is \(p(1 - p)\)
  • +
  • If we let \(X\) be a Bernoulli random variable, it is typical to call \(X=1\) as a "success" and \(X=0\) as a "failure"
  • +
+ +
+ +
+ + +
+

Binomial trials

+
+
+
    +
  • The binomial random variables are obtained as the sum of iid Bernoulli trials
  • +
  • In specific, let \(X_1,\ldots,X_n\) be iid Bernoulli\((p)\); then \(X = \sum_{i=1}^n X_i\) is a binomial random variable
  • +
  • The binomial mass function is +\[ +P(X = x) = +\left( +\begin{array}{c} +n \\ x +\end{array} +\right) +p^x(1 - p)^{n-x} +\] +for \(x=0,\ldots,n\)
  • +
+ +
+ +
+ + +
+

Choose

+
+
+
    +
  • Recall that the notation +\[\left( +\begin{array}{c} + n \\ x +\end{array} +\right) = \frac{n!}{x!(n-x)!} +\] (read "\(n\) choose \(x\)") counts the number of ways of selecting \(x\) items out of \(n\) +without replacement disregarding the order of the items
  • +
+ +

\[\left( + \begin{array}{c} + n \\ 0 + \end{array} + \right) = +\left( + \begin{array}{c} + n \\ n + \end{array} + \right) = 1 + \]

+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose a friend has \(8\) children (oh my!), \(7\) of which are girls and none are twins
  • +
  • If each gender has an independent \(50\)% probability for each birth, what's the probability of getting \(7\) or more girls out of \(8\) births? +\[\left( +\begin{array}{c} +8 \\ 7 +\end{array} +\right) .5^{7}(1-.5)^{1} ++ +\left( +\begin{array}{c} +8 \\ 8 +\end{array} +\right) .5^{8}(1-.5)^{0} \approx 0.04 +\]
  • +
+ +
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
+
+ +
## [1] 0.03516
+
+ +
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
+
+ +
## [1] 0.03516
+
+ +
+ +
+ + +
+

The normal distribution

+
+
+
    +
  • A random variable is said to follow a normal or Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) if the associated density is +\[ +(2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} +\] +If \(X\) a RV with this density then \(E[X] = \mu\) and \(Var(X) = \sigma^2\)
  • +
  • We write \(X\sim \mbox{N}(\mu, \sigma^2)\)
  • +
  • When \(\mu = 0\) and \(\sigma = 1\) the resulting distribution is called the standard normal distribution
  • +
  • Standard normal RVs are often labeled \(Z\)
  • +
+ +
+ +
+ + +
+

The standard normal distribution with reference lines

+
+
+

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Facts about the normal density

+
+
+

If \(X \sim \mbox{N}(\mu,\sigma^2)\) then +\[Z = \frac{X -\mu}{\sigma} \sim N(0, 1)\]

+ +

If \(Z\) is standard normal \[X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)\]

+ +
+ +
+ + +
+

More facts about the normal density

+
+
+
    +
  1. Approximately \(68\%\), \(95\%\) and \(99\%\) of the normal density lies within \(1\), \(2\) and \(3\) standard deviations from the mean, respectively
  2. +
  3. \(-1.28\), \(-1.645\), \(-1.96\) and \(-2.33\) are the \(10^{th}\), \(5^{th}\), \(2.5^{th}\) and \(1^{st}\) percentiles of the standard normal distribution respectively
  4. +
  5. By symmetry, \(1.28\), \(1.645\), \(1.96\) and \(2.33\) are the \(90^{th}\), \(95^{th}\), \(97.5^{th}\) and \(99^{th}\) percentiles of the standard normal distribution respectively
  6. +
+ +
+ +
+ + +
+

Question

+
+
+
    +
  • What is the \(95^{th}\) percentile of a \(N(\mu, \sigma^2)\) distribution? + +
      +
    • Quick answer in R qnorm(.95, mean = mu, sd = sd)
    • +
  • +
  • Or, because you have the standard normal quantiles memorized +and you know that 1.645 is the 95th percentile you know that the answer has to be +\[\mu + \sigma 1.645\]
  • +
  • (In general \(\mu + \sigma z_0\) where \(z_0\) is the appropriate standard normal quantile)
  • +
+ +
+ +
+ + +
+

Question

+
+
+
    +
  • What is the probability that a \(\mbox{N}(\mu,\sigma^2)\) RV is larger than \(x\)?
  • +
+ +
+ +
+ + +
+

Example

+
+
+

Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What's the probability of getting +more than 1,160 clicks in a day?

+ +
+ +
+ + +
+

Example

+
+
+

Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What's the probability of getting +more than 1,160 clicks in a day?

+ +

It's not very likely, 1,160 is 2.8 standard +deviations from the mean

+ +
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
+
+ +
## [1] 0.002555
+
+ +
pnorm(2.8, lower.tail = FALSE)
+
+ +
## [1] 0.002555
+
+ +
+ +
+ + +
+

Example

+
+
+

Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What number of daily ad clicks would represent +the one where 75% of days have fewer clicks (assuming +days are independent and identically distributed)?

+ +
+ +
+ + +
+

Example

+
+
+

Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What number of daily ad clicks would represent +the one where 75% of days have fewer clicks (assuming +days are independent and identically distributed)?

+ +
qnorm(0.75, mean = 1020, sd = 50)
+
+ +
## [1] 1054
+
+ +
+ +
+ + +
+

The Poisson distribution

+
+
+
    +
  • Used to model counts
  • +
  • The Poisson mass function is +\[ +P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} +\] +for \(x=0,1,\ldots\)
  • +
  • The mean of this distribution is \(\lambda\)
  • +
  • The variance of this distribution is \(\lambda\)
  • +
  • Notice that \(x\) ranges from \(0\) to \(\infty\)
  • +
+ +
+ +
+ + +
+

Some uses for the Poisson distribution

+
+
+
    +
  • Modeling count data
  • +
  • Modeling event-time or survival data
  • +
  • Modeling contingency tables
  • +
  • Approximating binomials when \(n\) is large and \(p\) is small
  • +
+ +
+ +
+ + +
+

Rates and Poisson random variables

+
+
+
    +
  • Poisson random variables are used to model rates
  • +
  • \(X \sim Poisson(\lambda t)\) where + +
      +
    • \(\lambda = E[X / t]\) is the expected count per unit of time
    • +
    • \(t\) is the total monitoring time
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+

The number of people that show up at a bus stop is Poisson with +a mean of \(2.5\) per hour.

+ +

If watching the bus stop for 4 hours, what is the probability that \(3\) +or fewer people show up for the whole time?

+ +
ppois(3, lambda = 2.5 * 4)
+
+ +
## [1] 0.01034
+
+ +
+ +
+ + +
+

Poisson approximation to the binomial

+
+
+
    +
  • When \(n\) is large and \(p\) is small the Poisson distribution +is an accurate approximation to the binomial distribution
  • +
  • Notation + +
      +
    • \(X \sim \mbox{Binomial}(n, p)\)
    • +
    • \(\lambda = n p\)
    • +
    • \(n\) gets large
    • +
    • \(p\) gets small
    • +
  • +
+ +
+ +
+ + +
+

Example, Poisson approximation to the binomial

+
+
+

We flip a coin with success probablity \(0.01\) five hundred times.

+ +

What's the probability of 2 or fewer successes?

+ +
pbinom(2, size = 500, prob = 0.01)
+
+ +
## [1] 0.1234
+
+ +
ppois(2, lambda = 500 * 0.01)
+
+ +
## [1] 0.1247
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/06_CommonDistros/index.md b/06_StatisticalInference/06_CommonDistros/index.md new file mode 100644 index 000000000..744cdd43f --- /dev/null +++ b/06_StatisticalInference/06_CommonDistros/index.md @@ -0,0 +1,306 @@ +--- +title : Some Common Distributions +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + + +## The Bernoulli distribution + +- The **Bernoulli distribution** arises as the result of a binary outcome +- Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1-p$ respectively +- The PMF for a Bernoulli random variable $X$ is $$P(X = x) = p^x (1 - p)^{1 - x}$$ +- The mean of a Bernoulli random variable is $p$ and the variance is $p(1 - p)$ +- If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a "success" and $X=0$ as a "failure" + + +--- + +## Binomial trials + +- The *binomial random variables* are obtained as the sum of iid Bernoulli trials +- In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable +- The binomial mass function is +$$ +P(X = x) = +\left( +\begin{array}{c} + n \\ x +\end{array} +\right) +p^x(1 - p)^{n-x} +$$ +for $x=0,\ldots,n$ + +--- + +## Choose + +- Recall that the notation + $$\left( + \begin{array}{c} + n \\ x + \end{array} + \right) = \frac{n!}{x!(n-x)!} + $$ (read "$n$ choose $x$") counts the number of ways of selecting $x$ items out of $n$ + without replacement disregarding the order of the items + +$$\left( + \begin{array}{c} + n \\ 0 + \end{array} + \right) = +\left( + \begin{array}{c} + n \\ n + \end{array} + \right) = 1 + $$ + +--- + +## Example + +- Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? +$$\left( +\begin{array}{c} + 8 \\ 7 +\end{array} +\right) .5^{7}(1-.5)^{1} ++ +\left( +\begin{array}{c} + 8 \\ 8 +\end{array} +\right) .5^{8}(1-.5)^{0} \approx 0.04 +$$ + +```r +choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8 +``` + +``` +## [1] 0.03516 +``` + +```r +pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE) +``` + +``` +## [1] 0.03516 +``` + + + +--- + +## The normal distribution + +- A random variable is said to follow a **normal** or **Gaussian** distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is + $$ + (2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} + $$ + If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$ +- We write $X\sim \mbox{N}(\mu, \sigma^2)$ +- When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called **the standard normal distribution** +- Standard normal RVs are often labeled $Z$ + +--- +## The standard normal distribution with reference lines +plot of chunk unnamed-chunk-2 + + +--- + +## Facts about the normal density + +If $X \sim \mbox{N}(\mu,\sigma^2)$ then +$$Z = \frac{X -\mu}{\sigma} \sim N(0, 1)$$ + + +If $Z$ is standard normal $$X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)$$ + +--- + +## More facts about the normal density + +1. Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively +2. $-1.28$, $-1.645$, $-1.96$ and $-2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively +3. By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively + +--- + +## Question + +- What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution? + - Quick answer in R `qnorm(.95, mean = mu, sd = sd)` +- Or, because you have the standard normal quantiles memorized +and you know that 1.645 is the 95th percentile you know that the answer has to be +$$\mu + \sigma 1.645$$ +- (In general $\mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile) + +--- + +## Question + +- What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is larger than $x$? + +--- +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What's the probability of getting +more than 1,160 clicks in a day? + +--- + +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What's the probability of getting +more than 1,160 clicks in a day? + +It's not very likely, 1,160 is 2.8 standard +deviations from the mean + +```r +pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE) +``` + +``` +## [1] 0.002555 +``` + +```r +pnorm(2.8, lower.tail = FALSE) +``` + +``` +## [1] 0.002555 +``` + + +--- + +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What number of daily ad clicks would represent +the one where 75% of days have fewer clicks (assuming +days are independent and identically distributed)? + +--- + +## Example + +Assume that the number of daily ad clicks for a company +is (approximately) normally distributed with a mean of 1020 and a standard +deviation of 50. What number of daily ad clicks would represent +the one where 75% of days have fewer clicks (assuming +days are independent and identically distributed)? + + +```r +qnorm(0.75, mean = 1020, sd = 50) +``` + +``` +## [1] 1054 +``` + + +--- +## The Poisson distribution +* Used to model counts +* The Poisson mass function is +$$ +P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} +$$ +for $x=0,1,\ldots$ +* The mean of this distribution is $\lambda$ +* The variance of this distribution is $\lambda$ +* Notice that $x$ ranges from $0$ to $\infty$ + +--- +## Some uses for the Poisson distribution +* Modeling count data +* Modeling event-time or survival data +* Modeling contingency tables +* Approximating binomials when $n$ is large and $p$ is small + +--- +## Rates and Poisson random variables +* Poisson random variables are used to model rates +* $X \sim Poisson(\lambda t)$ where + * $\lambda = E[X / t]$ is the expected count per unit of time + * $t$ is the total monitoring time + +--- +## Example +The number of people that show up at a bus stop is Poisson with +a mean of $2.5$ per hour. + +If watching the bus stop for 4 hours, what is the probability that $3$ +or fewer people show up for the whole time? + + +```r +ppois(3, lambda = 2.5 * 4) +``` + +``` +## [1] 0.01034 +``` + + +--- +## Poisson approximation to the binomial +* When $n$ is large and $p$ is small the Poisson distribution + is an accurate approximation to the binomial distribution +* Notation + * $X \sim \mbox{Binomial}(n, p)$ + * $\lambda = n p$ + * $n$ gets large + * $p$ gets small + + +--- +## Example, Poisson approximation to the binomial + +We flip a coin with success probablity $0.01$ five hundred times. + +What's the probability of 2 or fewer successes? + + +```r +pbinom(2, size = 500, prob = 0.01) +``` + +``` +## [1] 0.1234 +``` + +```r +ppois(2, lambda = 500 * 0.01) +``` + +``` +## [1] 0.1247 +``` + + diff --git a/06_StatisticalInference/06_CommonDistros/index.pdf b/06_StatisticalInference/06_CommonDistros/index.pdf new file mode 100644 index 000000000..1d98f72f9 Binary files /dev/null and b/06_StatisticalInference/06_CommonDistros/index.pdf differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..433cd180b Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-10.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-10.png new file mode 100644 index 000000000..3c28ac41f Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-10.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-11.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-11.png new file mode 100644 index 000000000..207257860 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-11.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-12.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-12.png new file mode 100644 index 000000000..b2a86f37c Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-12.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-13.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-13.png new file mode 100644 index 000000000..a104e90bd Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-13.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-14.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-14.png new file mode 100644 index 000000000..4b2ba14ef Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-14.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-16.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-16.png new file mode 100644 index 000000000..f122776c5 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-16.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-17.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-17.png new file mode 100644 index 000000000..949dc88db Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-17.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-18.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-18.png new file mode 100644 index 000000000..858b8550b Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-18.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..79975fffe Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..417ca9044 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-4.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-4.png new file mode 100644 index 000000000..79799d5f9 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-4.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-5.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-5.png new file mode 100644 index 000000000..1a423b2f0 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-5.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-6.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-6.png new file mode 100644 index 000000000..eb8177f97 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-6.png differ diff --git a/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-9.png b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-9.png new file mode 100644 index 000000000..aadb6890e Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/assets/fig/unnamed-chunk-9.png differ diff --git a/06_StatisticalInference/07_Asymptopia/fig/Thumbs.db b/06_StatisticalInference/07_Asymptopia/fig/Thumbs.db new file mode 100644 index 000000000..961350486 Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/fig/Thumbs.db differ diff --git a/06_StatisticalInference/07_Asymptopia/fig/quincunx.png b/06_StatisticalInference/07_Asymptopia/fig/quincunx.png new file mode 100644 index 000000000..2d77ba0cb Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/fig/quincunx.png differ diff --git a/06_StatisticalInference/02_02_Asymptopia/fig/unnamed-chunk-1.png b/06_StatisticalInference/07_Asymptopia/fig/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/02_02_Asymptopia/fig/unnamed-chunk-1.png rename to 06_StatisticalInference/07_Asymptopia/fig/unnamed-chunk-1.png diff --git a/06_StatisticalInference/02_02_Asymptopia/fig/unnamed-chunk-2.png b/06_StatisticalInference/07_Asymptopia/fig/unnamed-chunk-2.png similarity index 100% rename from 06_StatisticalInference/02_02_Asymptopia/fig/unnamed-chunk-2.png rename to 06_StatisticalInference/07_Asymptopia/fig/unnamed-chunk-2.png diff --git a/06_StatisticalInference/02_02_Asymptopia/fig/unnamed-chunk-3.png b/06_StatisticalInference/07_Asymptopia/fig/unnamed-chunk-3.png similarity index 100% rename from 06_StatisticalInference/02_02_Asymptopia/fig/unnamed-chunk-3.png rename to 06_StatisticalInference/07_Asymptopia/fig/unnamed-chunk-3.png diff --git a/06_StatisticalInference/07_Asymptopia/index.Rmd b/06_StatisticalInference/07_Asymptopia/index.Rmd new file mode 100644 index 000000000..56527da6c --- /dev/null +++ b/06_StatisticalInference/07_Asymptopia/index.Rmd @@ -0,0 +1,405 @@ +--- +title : A trip to Asymptopia +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Asymptotics +* Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number) +* (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.) +* Asymptotics are incredibly useful for simple statistical inference and approximations +* (Not covered in this class) Asymptotics often lead to nice understanding of procedures +* Asymptotics generally give no assurances about finite sample performance +* Asymptotics form the basis for frequency interpretation of probabilities + (the long run proportion of times an event occurs) + + +--- + +## Limits of random variables + +- Fortunately, for the sample mean there's a set of powerful results +- These results allow us to talk about the large sample distribution +of sample means of a collection of $iid$ observations +- The first of these results we intuitively know + - It says that the average limits to what it's estimating, the population mean + - It's called the Law of Large Numbers + - Example $\bar X_n$ could be the average of the result of $n$ coin flips (i.e. the sample proportion of heads) + - As we flip a fair coin over and over, it eventually converges to the + true probability of a head + The LLN forms the basis of frequency style thinking + + +--- +## Law of large numbers in action +```{r, fig.height=5, fig.width=5} +n <- 10000; means <- cumsum(rnorm(n)) / (1 : n); library(ggplot2) +g <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y)) +g <- g + geom_hline(yintercept = 0) + geom_line(size = 2) +g <- g + labs(x = "Number of obs", y = "Cumulative mean") +g +``` + + +--- +## Law of large numbers in action, coin flip +```{r, fig.height=5, fig.width=5} +means <- cumsum(sample(0 : 1, n , replace = TRUE)) / (1 : n) +g <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y)) +g <- g + geom_hline(yintercept = 0.5) + geom_line(size = 2) +g <- g + labs(x = "Number of obs", y = "Cumulative mean") +g +``` + + + +--- +## Discussion +- An estimator is **consistent** if it converges to what you want to estimate + - The LLN says that the sample mean of iid sample is + consistent for the population mean + - Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer +- The sample variance and the sample standard deviation +of iid random variables are consistent as well + +--- + +## The Central Limit Theorem + +- The **Central Limit Theorem** (CLT) is one of the most important theorems in statistics +- For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases +- The CLT applies in an endless variety of settings +- The result is that +$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}= +\frac{\sqrt n (\bar X_n - \mu)}{\sigma} += \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}$$ has a distribution like that of a standard normal for large $n$. +- (Replacing the standard error by its estimated value doesn't change the CLT) +- The useful way to think about the CLT is that +$\bar X_n$ is approximately +$N(\mu, \sigma^2 / n)$ + + + +--- + +## Example + +- Simulate a standard normal random variable by rolling $n$ (six sided) +- Let $X_i$ be the outcome for die $i$ +- Then note that $\mu = E[X_i] = 3.5$ +- $Var(X_i) = 2.92$ +- SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$ +- Let's roll $n$ dice, take their mean, subtract off 3.5, +and divide by $1.71 / \sqrt{n}$ and repeat this over and over + + +--- +## Result of our die rolling experiment + +```{r, echo = FALSE, fig.width=9, fig.height = 6, fig.align='center'} +nosim <- 1000 +cfunc <- function(x, n) sqrt(n) * (mean(x) - 3.5) / 1.71 +dat <- data.frame( + x = c(apply(matrix(sample(1 : 6, nosim * 10, replace = TRUE), + nosim), 1, cfunc, 10), + apply(matrix(sample(1 : 6, nosim * 20, replace = TRUE), + nosim), 1, cfunc, 20), + apply(matrix(sample(1 : 6, nosim * 30, replace = TRUE), + nosim), 1, cfunc, 30) + ), + size = factor(rep(c(10, 20, 30), rep(nosim, 3)))) +g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.3, colour = "black", aes(y = ..density..)) +g <- g + stat_function(fun = dnorm, size = 2) +g + facet_grid(. ~ size) +``` + + +--- +## Coin CLT + +- Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin +- The sample proportion, say $\hat p$, is the average of the coin flips +- $E[X_i] = p$ and $Var(X_i) = p(1-p)$ +- Standard error of the mean is $\sqrt{p(1-p)/n}$ +- Then +$$ + \frac{\hat p - p}{\sqrt{p(1-p)/n}} +$$ +will be approximately normally distributed +- Let's flip a coin $n$ times, take the sample proportion +of heads, subtract off .5 and multiply the result by +$2 \sqrt{n}$ (divide by $1/(2 \sqrt{n})$) + +--- +## Simulation results +```{r, echo = FALSE, fig.width=9, fig.height = 6, fig.align='center'} +nosim <- 1000 +cfunc <- function(x, n) 2 * sqrt(n) * (mean(x) - 0.5) +dat <- data.frame( + x = c(apply(matrix(sample(0:1, nosim * 10, replace = TRUE), + nosim), 1, cfunc, 10), + apply(matrix(sample(0:1, nosim * 20, replace = TRUE), + nosim), 1, cfunc, 20), + apply(matrix(sample(0:1, nosim * 30, replace = TRUE), + nosim), 1, cfunc, 30) + ), + size = factor(rep(c(10, 20, 30), rep(nosim, 3)))) +g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) +g <- g + stat_function(fun = dnorm, size = 2) +g + facet_grid(. ~ size) +``` + +--- +## Simulation results, $p = 0.9$ +```{r, echo = FALSE, fig.width=9, fig.height = 6, fig.align='center'} +nosim <- 1000 +cfunc <- function(x, n) sqrt(n) * (mean(x) - 0.9) / sqrt(.1 * .9) +dat <- data.frame( + x = c(apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 10, replace = TRUE), + nosim), 1, cfunc, 10), + apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 20, replace = TRUE), + nosim), 1, cfunc, 20), + apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 30, replace = TRUE), + nosim), 1, cfunc, 30) + ), + size = factor(rep(c(10, 20, 30), rep(nosim, 3)))) +g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) +g <- g + stat_function(fun = dnorm, size = 2) +g + facet_grid(. ~ size) +``` + +--- +## Galton's quincunx + +http://en.wikipedia.org/wiki/Bean_machine#mediaviewer/File:Quincunx_(Galton_Box)_-_Galton_1889_diagram.png + + + +--- + +## Confidence intervals + +- According to the CLT, the sample mean, $\bar X$, +is approximately normal with mean $\mu$ and sd $\sigma / \sqrt{n}$ +- $\mu + 2 \sigma /\sqrt{n}$ is pretty far out in the tail +(only 2.5% of a normal being larger than 2 sds in the tail) +- Similarly, $\mu - 2 \sigma /\sqrt{n}$ is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail) +- So the probability $\bar X$ is bigger than $\mu + 2 \sigma / \sqrt{n}$ +or smaller than $\mu - 2 \sigma / \sqrt{n}$ is 5% + - Or equivalently, the probability of being between these limits is 95% +- The quantity $\bar X \pm 2 \sigma /\sqrt{n}$ is called +a 95% interval for $\mu$ +- The 95% refers to the fact that if one were to repeatedly +get samples of size $n$, about 95% of the intervals obtained +would contain $\mu$ +- The 97.5th quantile is 1.96 (so I rounded to 2 above) +- 90% interval you want (100 - 90) / 2 = 5% in each tail + - So you want the 95th percentile (1.645) + + +--- +## Give a confidence interval for the average height of sons +in Galton's data +```{r} +library(UsingR);data(father.son); x <- father.son$sheight +(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12 +``` + +--- + +## Sample proportions + +- In the event that each $X_i$ is $0$ or $1$ with common success probability $p$ then $\sigma^2 = p(1 - p)$ +- The interval takes the form +$$ + \hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} +$$ +- Replacing $p$ by $\hat p$ in the standard error results in what is called a Wald confidence interval for $p$ +- For 95% intervals +$$\hat p \pm \frac{1}{\sqrt{n}}$$ +is a quick CI estimate for $p$ + +--- +## Example +* Your campaign advisor told you that in a random sample of 100 likely voters, + 56 intent to vote for you. + * Can you relax? Do you have this race in the bag? + * Without access to a computer or calculator, how precise is this estimate? +* `1/sqrt(100)=0.1` so a back of the envelope calculation gives an approximate 95% interval of `(0.46, 0.66)` + * Not enough for you to relax, better go do more campaigning! +* Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. +```{r} +round(1 / sqrt(10 ^ (1 : 6)), 3) +``` + + + +--- +## Binomial interval + +```{r} +.56 + c(-1, 1) * qnorm(.975) * sqrt(.56 * .44 / 100) +binom.test(56, 100)$conf.int +``` + +--- + +## Simulation + +```{r} +n <- 20; pvals <- seq(.1, .9, by = .05); nosim <- 1000 +coverage <- sapply(pvals, function(p){ + phats <- rbinom(nosim, prob = p, size = n) / n + ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n) + ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n) + mean(ll < p & ul > p) +}) + +``` + + +--- +## Plot of the results (not so good) +```{r, echo=FALSE, fig.align='center', fig.height=6, fig.width=6} +ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0) +```` + +--- +## What's happening? +- $n$ isn't large enough for the CLT to be applicable +for many of the values of $p$ +- Quick fix, form the interval with +$$ +\frac{X + 2}{n + 4} +$$ +- (Add two successes and failures, Agresti/Coull interval) + +--- +## Simulation +First let's show that coverage gets better with $n$ + +```{r} +n <- 100; pvals <- seq(.1, .9, by = .05); nosim <- 1000 +coverage2 <- sapply(pvals, function(p){ + phats <- rbinom(nosim, prob = p, size = n) / n + ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n) + ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n) + mean(ll < p & ul > p) +}) + +``` + +--- +## Plot of coverage for $n=100$ +```{r, fig.align='center', fig.height=6, fig.width=6, echo=FALSE} +ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage2)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ ylim(.75, 1.0) +``` + +--- +## Simulation +Now let's look at $n=20$ but adding 2 successes and failures +```{r} +n <- 20; pvals <- seq(.1, .9, by = .05); nosim <- 1000 +coverage <- sapply(pvals, function(p){ + phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4) + ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n) + ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n) + mean(ll < p & ul > p) +}) +``` + + +--- +## Adding 2 successes and 2 failures +(It's a little conservative) +```{r, fig.align='center', fig.height=6, fig.width=6, echo=FALSE} +ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ ylim(.75, 1.0) +```` + +--- + +## Poisson interval +* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? +* $X \sim Poisson(\lambda t)$. +* Estimate $\hat \lambda = X/t$ +* $Var(\hat \lambda) = \lambda / t$ +* $\hat \lambda / t$ is our variance estimate + +--- +## R code +```{r} +x <- 5; t <- 94.32; lambda <- x / t +round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3) +poisson.test(x, T = 94.32)$conf +``` + + +--- +## Simulating the Poisson coverage rate +Let's see how this interval performs for lambda +values near what we're estimating +```{r} +lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000 +t <- 100 +coverage <- sapply(lambdavals, function(lambda){ + lhats <- rpois(nosim, lambda = lambda * t) / t + ll <- lhats - qnorm(.975) * sqrt(lhats / t) + ul <- lhats + qnorm(.975) * sqrt(lhats / t) + mean(ll < lambda & ul > lambda) +}) +``` + + + +--- +## Covarage +(Gets really bad for small values of lambda) +```{r, fig.align='center', fig.height=6, fig.width=6, echo=FALSE} +ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0) +```` + + + +--- +## What if we increase t to 1000? +```{r, fig.align='center', fig.height=6, fig.width=6, echo=FALSE} +lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000 +t <- 1000 +coverage <- sapply(lambdavals, function(lambda){ + lhats <- rpois(nosim, lambda = lambda * t) / t + ll <- lhats - qnorm(.975) * sqrt(lhats / t) + ul <- lhats + qnorm(.975) * sqrt(lhats / t) + mean(ll < lambda & ul > lambda) +}) +ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(0, 1.0) +``` + + +--- +## Summary +- The LLN states that averages of iid samples +converge to the population means that they are estimating +- The CLT states that averages are approximately normal, with +distributions + - centered at the population mean + - with standard deviation equal to the standard error of the mean + - CLT gives no guarantee that $n$ is large enough +- Taking the mean and adding and subtracting the relevant +normal quantile times the SE yields a confidence interval for the mean + - Adding and subtracting 2 SEs works for 95% intervals +- Confidence intervals get wider as the coverage increases +(why?) +- Confidence intervals get narrower with less variability or +larger sample sizes +- The Poisson and binomial case have exact intervals that +don't require the CLT + - But a quick fix for small sample size binomial calculations is to add 2 successes and failures diff --git a/06_StatisticalInference/07_Asymptopia/index.html b/06_StatisticalInference/07_Asymptopia/index.html new file mode 100644 index 000000000..72b17e765 --- /dev/null +++ b/06_StatisticalInference/07_Asymptopia/index.html @@ -0,0 +1,850 @@ + + + + A trip to Asymptopia + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

A trip to Asymptopia

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Asymptotics

+
+
+
    +
  • Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
  • +
  • (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.)
  • +
  • Asymptotics are incredibly useful for simple statistical inference and approximations
  • +
  • (Not covered in this class) Asymptotics often lead to nice understanding of procedures
  • +
  • Asymptotics generally give no assurances about finite sample performance
  • +
  • Asymptotics form the basis for frequency interpretation of probabilities +(the long run proportion of times an event occurs)
  • +
+ +
+ +
+ + +
+

Limits of random variables

+
+
+
    +
  • Fortunately, for the sample mean there's a set of powerful results
  • +
  • These results allow us to talk about the large sample distribution +of sample means of a collection of \(iid\) observations
  • +
  • The first of these results we inuitively know + +
      +
    • It says that the average limits to what its estimating, the population mean
    • +
    • It's called the Law of Large Numbers
    • +
    • Example \(\bar X_n\) could be the average of the result of \(n\) coin flips (i.e. the sample proportion of heads)
    • +
    • As we flip a fair coin over and over, it evetually converges to the +true probability of a head +The LLN forms the basis of frequency style thinking
    • +
  • +
+ +
+ +
+ + +
+

Law of large numbers in action

+
+
+
n <- 10000
+means <- cumsum(rnorm(n))/(1:n)
+library(ggplot2)
+g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
+g <- g + geom_hline(yintercept = 0) + geom_line(size = 2)
+g <- g + labs(x = "Number of obs", y = "Cumulative mean")
+g
+
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Law of large numbers in action, coin flip

+
+
+
means <- cumsum(sample(0:1, n, replace = TRUE))/(1:n)
+g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
+g <- g + geom_hline(yintercept = 0.5) + geom_line(size = 2)
+g <- g + labs(x = "Number of obs", y = "Cumulative mean")
+g
+
+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Discussion

+
+
+
    +
  • An estimator is consistent if it converges to what you want to estimate + +
      +
    • The LLN says that the sample mean of iid sample is +consistent for the population mean
    • +
    • Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer
    • +
  • +
  • The sample variance and the sample standard deviation +of iid random variables are consistent as well
  • +
+ +
+ +
+ + +
+

The Central Limit Theorem

+
+
+
    +
  • The Central Limit Theorem (CLT) is one of the most important theorems in statistics
  • +
  • For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases
  • +
  • The CLT applies in an endless variety of settings
  • +
  • The result is that +\[\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}= +\frac{\sqrt n (\bar X_n - \mu)}{\sigma} += \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}\] has a distribution like that of a standard normal for large \(n\).
  • +
  • (Replacing the standard error by its estimated value doesn't change the CLT)
  • +
  • The useful way to think about the CLT is that +\(\bar X_n\) is approximately +\(N(\mu, \sigma^2 / n)\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Simulate a standard normal random variable by rolling \(n\) (six sided)
  • +
  • Let \(X_i\) be the outcome for die \(i\)
  • +
  • Then note that \(\mu = E[X_i] = 3.5\)
  • +
  • \(Var(X_i) = 2.92\)
  • +
  • SE \(\sqrt{2.92 / n} = 1.71 / \sqrt{n}\)
  • +
  • Lets roll \(n\) dice, take their mean, subtract off 3.5, +and divide by \(1.71 / \sqrt{n}\) and repeat this over and over
  • +
+ +
+ +
+ + +
+

Result of our die rolling experiment

+
+
+

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

Coin CLT

+
+
+
    +
  • Let \(X_i\) be the \(0\) or \(1\) result of the \(i^{th}\) flip of a possibly unfair coin + +
      +
    • The sample proportion, say \(\hat p\), is the average of the coin flips
    • +
    • \(E[X_i] = p\) and \(Var(X_i) = p(1-p)\)
    • +
    • Standard error of the mean is \(\sqrt{p(1-p)/n}\)
    • +
    • Then +\[ +\frac{\hat p - p}{\sqrt{p(1-p)/n}} +\] +will be approximately normally distributed
    • +
    • Let's flip a coin \(n\) times, take the sample proportion +of heads, subtract off .5 and multiply the result by +\(2 \sqrt{n}\) (divide by \(1/(2 \sqrt{n})\))
    • +
  • +
+ +
+ +
+ + +
+

Simulation results

+
+
+

plot of chunk unnamed-chunk-4

+ +
+ +
+ + +
+

Simulation results, \(p = 0.9\)

+
+
+

plot of chunk unnamed-chunk-5

+ +
+ +
+ + +
+

Galton's quincunx

+
+ + +
+ + +
+

Confidence intervals

+
+
+
    +
  • According to the CLT, the sample mean, \(\bar X\), +is approximately normal with mean \(\mu\) and sd \(\sigma / \sqrt{n}\)
  • +
  • \(\mu + 2 \sigma /\sqrt{n}\) is pretty far out in the tail +(only 2.5% of a normal being larger than 2 sds in the tail)
  • +
  • Similarly, \(\mu - 2 \sigma /\sqrt{n}\) is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail)
  • +
  • So the probability \(\bar X\) is bigger than \(\mu + 2 \sigma / \sqrt{n}\) +or smaller than \(\mu - 2 \sigma / \sqrt{n}\) is 5% + +
      +
    • Or equivalently, the probability of being between these limits is 95%
    • +
  • +
  • The quantity \(\bar X \pm 2 \sigma /\sqrt{n}\) is called +a 95% interval for \(\mu\)
  • +
  • The 95% refers to the fact that if one were to repeatly +get samples of size \(n\), about 95% of the intervals obtained +would contain \(\mu\)
  • +
  • The 97.5th quantile is 1.96 (so I rounded to 2 above)
  • +
  • 90% interval you want (100 - 90) / 2 = 5% in each tail + +
      +
    • So you want the 95th percentile (1.645)
    • +
  • +
+ +
+ +
+ + +
+

Give a confidence interval for the average height of sons

+
+
+

in Galton's data

+ +
library(UsingR)
+data(father.son)
+x <- father.son$sheight
+(mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12
+
+ +
## [1] 5.710 5.738
+
+ +
+ +
+ + +
+

Sample proportions

+
+
+
    +
  • In the event that each \(X_i\) is \(0\) or \(1\) with common success probability \(p\) then \(\sigma^2 = p(1 - p)\)
  • +
  • The interval takes the form +\[ +\hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} +\]
  • +
  • Replacing \(p\) by \(\hat p\) in the standard error results in what is called a Wald confidence interval for \(p\)
  • +
  • For 95% intervals +\[\hat p \pm \frac{1}{\sqrt{n}}\] +is a quick CI estimate for \(p\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Your campaign advisor told you that in a random sample of 100 likely voters, +56 intent to vote for you. + +
      +
    • Can you relax? Do you have this race in the bag?
    • +
    • Without access to a computer or calculator, how precise is this estimate?
    • +
  • +
  • 1/sqrt(100)=0.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66) + +
      +
    • Not enough for you to relax, better go do more campaigning!
    • +
  • +
  • Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.
  • +
+ +
round(1/sqrt(10^(1:6)), 3)
+
+ +
## [1] 0.316 0.100 0.032 0.010 0.003 0.001
+
+ +
+ +
+ + +
+

Binomial interval

+
+
+
0.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * 0.44/100)
+
+ +
## [1] 0.4627 0.6573
+
+ +
binom.test(56, 100)$conf.int
+
+ +
## [1] 0.4572 0.6592
+## attr(,"conf.level")
+## [1] 0.95
+
+ +
+ +
+ + +
+

Simulation

+
+
+
n <- 20
+pvals <- seq(0.1, 0.9, by = 0.05)
+nosim <- 1000
+coverage <- sapply(pvals, function(p) {
+    phats <- rbinom(nosim, prob = p, size = n)/n
+    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
+    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
+    mean(ll < p & ul > p)
+})
+
+ +
+ +
+ + +
+

Plot of the results (not so good)

+
+
+

plot of chunk unnamed-chunk-10

+ +
+ +
+ + +
+

What's happening?

+
+
+
    +
  • \(n\) isn't large enough for the CLT to be applicable +for many of the values of \(p\)
  • +
  • Quick fix, form the interval with +\[ +\frac{X + 2}{n + 4} +\]
  • +
  • (Add two successes and failures, Agresti/Coull interval)
  • +
+ +
+ +
+ + +
+

Simulation

+
+
+

First let's show that coverage gets better with \(n\)

+ +
n <- 100
+pvals <- seq(0.1, 0.9, by = 0.05)
+nosim <- 1000
+coverage2 <- sapply(pvals, function(p) {
+    phats <- rbinom(nosim, prob = p, size = n)/n
+    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
+    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
+    mean(ll < p & ul > p)
+})
+
+ +
+ +
+ + +
+

Plot of coverage for \(n=100\)

+
+
+

plot of chunk unnamed-chunk-12

+ +
+ +
+ + +
+

Simulation

+
+
+

Now let's look at \(n=20\) but adding 2 successes and failures

+ +
n <- 20
+pvals <- seq(0.1, 0.9, by = 0.05)
+nosim <- 1000
+coverage <- sapply(pvals, function(p) {
+    phats <- (rbinom(nosim, prob = p, size = n) + 2)/(n + 4)
+    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
+    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
+    mean(ll < p & ul > p)
+})
+
+ +
+ +
+ + +
+

Adding 2 successes and 2 failures

+
+
+

(It's a little conservative) +plot of chunk unnamed-chunk-14

+ +
+ +
+ + +
+

Poisson interval

+
+
+
    +
  • A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
  • +
  • \(X \sim Poisson(\lambda t)\).
  • +
  • Estimate \(\hat \lambda = X/t\)
  • +
  • \(Var(\hat \lambda) = \lambda / t\)
  • +
  • \(\hat \lambda / t\) is our variance estimate
  • +
+ +
+ +
+ + +
+

R code

+
+
+
x <- 5
+t <- 94.32
+lambda <- x/t
+round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
+
+ +
## [1] 0.007 0.099
+
+ +
poisson.test(x, T = 94.32)$conf
+
+ +
## [1] 0.01721 0.12371
+## attr(,"conf.level")
+## [1] 0.95
+
+ +
+ +
+ + +
+

Simulating the Poisson coverage rate

+
+
+

Let's see how this interval performs for lambda +values near what we're estimating

+ +
lambdavals <- seq(0.005, 0.1, by = 0.01)
+nosim <- 1000
+t <- 100
+coverage <- sapply(lambdavals, function(lambda) {
+    lhats <- rpois(nosim, lambda = lambda * t)/t
+    ll <- lhats - qnorm(0.975) * sqrt(lhats/t)
+    ul <- lhats + qnorm(0.975) * sqrt(lhats/t)
+    mean(ll < lambda & ul > lambda)
+})
+
+ +
+ +
+ + +
+

Covarage

+
+
+

(Gets really bad for small values of lambda) +plot of chunk unnamed-chunk-17

+ +
+ +
+ + +
+

What if we increase t to 1000?

+
+
+

plot of chunk unnamed-chunk-18

+ +
+ +
+ + +
+

Summary

+
+
+
    +
  • The LLN states that averages of iid samples +converge to the population means that they are estimating
  • +
  • The CLT states that averages are approximately normal, with +distributions + +
      +
    • centered at the population mean
    • +
    • with standard deviation equal to the standard error of the mean
    • +
    • CLT gives no guarantee that \(n\) is large enough
    • +
  • +
  • Taking the mean and adding and subtracting the relevant +normal quantile times the SE yields a confidence interval for the mean + +
      +
    • Adding and subtracting 2 SEs works for 95% intervals
    • +
  • +
  • Confidence intervals get wider as the coverage increases +(why?)
  • +
  • Confidence intervals get narrower with less variability or +larger sample sizes
  • +
  • The Poisson and binomial case have exact intervals that +don't require the CLT + +
      +
    • But a quick fix for small sample size binomial calculations is to add 2 successes and failures
    • +
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/07_Asymptopia/index.md b/06_StatisticalInference/07_Asymptopia/index.md new file mode 100644 index 000000000..eccfcd58c --- /dev/null +++ b/06_StatisticalInference/07_Asymptopia/index.md @@ -0,0 +1,419 @@ +--- +title : A trip to Asymptopia +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Asymptotics +* Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number) +* (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.) +* Asymptotics are incredibly useful for simple statistical inference and approximations +* (Not covered in this class) Asymptotics often lead to nice understanding of procedures +* Asymptotics generally give no assurances about finite sample performance +* Asymptotics form the basis for frequency interpretation of probabilities + (the long run proportion of times an event occurs) + + +--- + +## Limits of random variables + +- Fortunately, for the sample mean there's a set of powerful results +- These results allow us to talk about the large sample distribution +of sample means of a collection of $iid$ observations +- The first of these results we inuitively know + - It says that the average limits to what its estimating, the population mean + - It's called the Law of Large Numbers + - Example $\bar X_n$ could be the average of the result of $n$ coin flips (i.e. the sample proportion of heads) + - As we flip a fair coin over and over, it evetually converges to the + true probability of a head + The LLN forms the basis of frequency style thinking + + +--- +## Law of large numbers in action + +```r +n <- 10000 +means <- cumsum(rnorm(n))/(1:n) +library(ggplot2) +g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y)) +g <- g + geom_hline(yintercept = 0) + geom_line(size = 2) +g <- g + labs(x = "Number of obs", y = "Cumulative mean") +g +``` + +![plot of chunk unnamed-chunk-1](assets/fig/unnamed-chunk-1.png) + + + +--- +## Law of large numbers in action, coin flip + +```r +means <- cumsum(sample(0:1, n, replace = TRUE))/(1:n) +g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y)) +g <- g + geom_hline(yintercept = 0.5) + geom_line(size = 2) +g <- g + labs(x = "Number of obs", y = "Cumulative mean") +g +``` + +![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) + + + + +--- +## Discussion +- An estimator is **consistent** if it converges to what you want to estimate + - The LLN says that the sample mean of iid sample is + consistent for the population mean + - Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer +- The sample variance and the sample standard deviation +of iid random variables are consistent as well + +--- + +## The Central Limit Theorem + +- The **Central Limit Theorem** (CLT) is one of the most important theorems in statistics +- For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases +- The CLT applies in an endless variety of settings +- The result is that +$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}= +\frac{\sqrt n (\bar X_n - \mu)}{\sigma} += \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}$$ has a distribution like that of a standard normal for large $n$. +- (Replacing the standard error by its estimated value doesn't change the CLT) +- The useful way to think about the CLT is that +$\bar X_n$ is approximately +$N(\mu, \sigma^2 / n)$ + + + +--- + +## Example + +- Simulate a standard normal random variable by rolling $n$ (six sided) +- Let $X_i$ be the outcome for die $i$ +- Then note that $\mu = E[X_i] = 3.5$ +- $Var(X_i) = 2.92$ +- SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$ +- Lets roll $n$ dice, take their mean, subtract off 3.5, +and divide by $1.71 / \sqrt{n}$ and repeat this over and over + + +--- +## Result of our die rolling experiment + +plot of chunk unnamed-chunk-3 + + + +--- +## Coin CLT + + - Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin +- The sample proportion, say $\hat p$, is the average of the coin flips +- $E[X_i] = p$ and $Var(X_i) = p(1-p)$ +- Standard error of the mean is $\sqrt{p(1-p)/n}$ +- Then +$$ + \frac{\hat p - p}{\sqrt{p(1-p)/n}} +$$ +will be approximately normally distributed +- Let's flip a coin $n$ times, take the sample proportion +of heads, subtract off .5 and multiply the result by +$2 \sqrt{n}$ (divide by $1/(2 \sqrt{n})$) + +--- +## Simulation results +plot of chunk unnamed-chunk-4 + + +--- +## Simulation results, $p = 0.9$ +plot of chunk unnamed-chunk-5 + + +--- +## Galton's quincunx + +http://en.wikipedia.org/wiki/Bean_machine#mediaviewer/File:Quincunx_(Galton_Box)_-_Galton_1889_diagram.png + + + +--- + +## Confidence intervals + +- According to the CLT, the sample mean, $\bar X$, +is approximately normal with mean $\mu$ and sd $\sigma / \sqrt{n}$ +- $\mu + 2 \sigma /\sqrt{n}$ is pretty far out in the tail +(only 2.5% of a normal being larger than 2 sds in the tail) +- Similarly, $\mu - 2 \sigma /\sqrt{n}$ is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail) +- So the probability $\bar X$ is bigger than $\mu + 2 \sigma / \sqrt{n}$ +or smaller than $\mu - 2 \sigma / \sqrt{n}$ is 5% + - Or equivalently, the probability of being between these limits is 95% +- The quantity $\bar X \pm 2 \sigma /\sqrt{n}$ is called +a 95% interval for $\mu$ +- The 95% refers to the fact that if one were to repeatly +get samples of size $n$, about 95% of the intervals obtained +would contain $\mu$ +- The 97.5th quantile is 1.96 (so I rounded to 2 above) +- 90% interval you want (100 - 90) / 2 = 5% in each tail + - So you want the 95th percentile (1.645) + + +--- +## Give a confidence interval for the average height of sons +in Galton's data + +```r +library(UsingR) +data(father.son) +x <- father.son$sheight +(mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12 +``` + +``` +## [1] 5.710 5.738 +``` + + +--- + +## Sample proportions + +- In the event that each $X_i$ is $0$ or $1$ with common success probability $p$ then $\sigma^2 = p(1 - p)$ +- The interval takes the form +$$ + \hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} +$$ +- Replacing $p$ by $\hat p$ in the standard error results in what is called a Wald confidence interval for $p$ +- For 95% intervals +$$\hat p \pm \frac{1}{\sqrt{n}}$$ +is a quick CI estimate for $p$ + +--- +## Example +* Your campaign advisor told you that in a random sample of 100 likely voters, + 56 intent to vote for you. + * Can you relax? Do you have this race in the bag? + * Without access to a computer or calculator, how precise is this estimate? +* `1/sqrt(100)=0.1` so a back of the envelope calculation gives an approximate 95% interval of `(0.46, 0.66)` + * Not enough for you to relax, better go do more campaigning! +* Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. + +```r +round(1/sqrt(10^(1:6)), 3) +``` + +``` +## [1] 0.316 0.100 0.032 0.010 0.003 0.001 +``` + + + + +--- +## Binomial interval + + +```r +0.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * 0.44/100) +``` + +``` +## [1] 0.4627 0.6573 +``` + +```r +binom.test(56, 100)$conf.int +``` + +``` +## [1] 0.4572 0.6592 +## attr(,"conf.level") +## [1] 0.95 +``` + + +--- + +## Simulation + + +```r +n <- 20 +pvals <- seq(0.1, 0.9, by = 0.05) +nosim <- 1000 +coverage <- sapply(pvals, function(p) { + phats <- rbinom(nosim, prob = p, size = n)/n + ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) + ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n) + mean(ll < p & ul > p) +}) +``` + + + +--- +## Plot of the results (not so good) +plot of chunk unnamed-chunk-10 + + +--- +## What's happening? +- $n$ isn't large enough for the CLT to be applicable +for many of the values of $p$ +- Quick fix, form the interval with +$$ +\frac{X + 2}{n + 4} +$$ +- (Add two successes and failures, Agresti/Coull interval) + +--- +## Simulation +First let's show that coverage gets better with $n$ + + +```r +n <- 100 +pvals <- seq(0.1, 0.9, by = 0.05) +nosim <- 1000 +coverage2 <- sapply(pvals, function(p) { + phats <- rbinom(nosim, prob = p, size = n)/n + ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) + ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n) + mean(ll < p & ul > p) +}) +``` + + +--- +## Plot of coverage for $n=100$ +plot of chunk unnamed-chunk-12 + + +--- +## Simulation +Now let's look at $n=20$ but adding 2 successes and failures + +```r +n <- 20 +pvals <- seq(0.1, 0.9, by = 0.05) +nosim <- 1000 +coverage <- sapply(pvals, function(p) { + phats <- (rbinom(nosim, prob = p, size = n) + 2)/(n + 4) + ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) + ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n) + mean(ll < p & ul > p) +}) +``` + + + +--- +## Adding 2 successes and 2 failures +(It's a little conservative) +plot of chunk unnamed-chunk-14 + + +--- + +## Poisson interval +* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? +* $X \sim Poisson(\lambda t)$. +* Estimate $\hat \lambda = X/t$ +* $Var(\hat \lambda) = \lambda / t$ +* $\hat \lambda / t$ is our variance estimate + +--- +## R code + +```r +x <- 5 +t <- 94.32 +lambda <- x/t +round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3) +``` + +``` +## [1] 0.007 0.099 +``` + +```r +poisson.test(x, T = 94.32)$conf +``` + +``` +## [1] 0.01721 0.12371 +## attr(,"conf.level") +## [1] 0.95 +``` + + + +--- +## Simulating the Poisson coverage rate +Let's see how this interval performs for lambda +values near what we're estimating + +```r +lambdavals <- seq(0.005, 0.1, by = 0.01) +nosim <- 1000 +t <- 100 +coverage <- sapply(lambdavals, function(lambda) { + lhats <- rpois(nosim, lambda = lambda * t)/t + ll <- lhats - qnorm(0.975) * sqrt(lhats/t) + ul <- lhats + qnorm(0.975) * sqrt(lhats/t) + mean(ll < lambda & ul > lambda) +}) +``` + + + + +--- +## Covarage +(Gets really bad for small values of lambda) +plot of chunk unnamed-chunk-17 + + + + +--- +## What if we increase t to 1000? +plot of chunk unnamed-chunk-18 + + + +--- +## Summary +- The LLN states that averages of iid samples +converge to the population means that they are estimating +- The CLT states that averages are approximately normal, with +distributions + - centered at the population mean + - with standard deviation equal to the standard error of the mean + - CLT gives no guarantee that $n$ is large enough +- Taking the mean and adding and subtracting the relevant +normal quantile times the SE yields a confidence interval for the mean + - Adding and subtracting 2 SEs works for 95% intervals +- Confidence intervals get wider as the coverage increases +(why?) +- Confidence intervals get narrower with less variability or +larger sample sizes +- The Poisson and binomial case have exact intervals that +don't require the CLT + - But a quick fix for small sample size binomial calculations is to add 2 successes and failures diff --git a/06_StatisticalInference/07_Asymptopia/index.pdf b/06_StatisticalInference/07_Asymptopia/index.pdf new file mode 100644 index 000000000..79cf80d5c Binary files /dev/null and b/06_StatisticalInference/07_Asymptopia/index.pdf differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-10.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-10.png new file mode 100644 index 000000000..83ff203af Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-10.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-11.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-11.png new file mode 100644 index 000000000..42169d11d Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-11.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-12.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-12.png new file mode 100644 index 000000000..a6b1571c6 Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-12.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-13.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-13.png new file mode 100644 index 000000000..42169d11d Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-13.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..83ff203af Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..c9cf93fe1 Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-4.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-4.png new file mode 100644 index 000000000..83ff203af Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-4.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-5.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-5.png new file mode 100644 index 000000000..1793327e3 Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-5.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-6.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-6.png new file mode 100644 index 000000000..1793327e3 Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-6.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-7.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-7.png new file mode 100644 index 000000000..1793327e3 Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-7.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-8.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-8.png new file mode 100644 index 000000000..83ff203af Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-8.png differ diff --git a/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-9.png b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-9.png new file mode 100644 index 000000000..a6b1571c6 Binary files /dev/null and b/06_StatisticalInference/08_tCIs/assets/fig/unnamed-chunk-9.png differ diff --git a/06_StatisticalInference/08_tCIs/index.Rmd b/06_StatisticalInference/08_tCIs/index.Rmd new file mode 100644 index 000000000..44d768a00 --- /dev/null +++ b/06_StatisticalInference/08_tCIs/index.Rmd @@ -0,0 +1,292 @@ +--- +title : T Confidence Intervals +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## T Confidence intervals + +- In the previous, we discussed creating a confidence interval using the CLT + - They took the form $Est \pm ZQ \times SE_{Est}$ +- In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution and $t$ confidence intervals + - They are of the form $Est \pm TQ \times SE_{Est}$ +- These are some of the handiest of intervals +- If you want a rule between whether to use a $t$ interval +or normal interval, just always use the $t$ interval +- We'll cover the one and two group versions + +--- + +## Gosset's $t$ distribution + +- Invented by William Gosset (under the pseudonym "Student") in 1908 +- Has thicker tails than the normal +- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger +- It assumes that the underlying data are iid +Gaussian with the result that +$$ +\frac{\bar X - \mu}{S/\sqrt{n}} +$$ +follows Gosset's $t$ distribution with $n-1$ degrees of freedom +- (If we replaced $s$ by $\sigma$ the statistic would be exactly standard normal) +- Interval is $\bar X \pm t_{n-1} S/\sqrt{n}$ where $t_{n-1}$ +is the relevant quantile + +--- +## Code for manipulate +```{r, echo=TRUE,eval=FALSE} +library(ggplot2); library(manipulate) +k <- 1000 +xvals <- seq(-5, 5, length = k) +myplot <- function(df){ + d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)), + x = xvals, + dist = factor(rep(c("Normal", "T"), c(k,k)))) + g <- ggplot(d, aes(x = x, y = y)) + g <- g + geom_line(size = 2, aes(colour = dist)) + g +} +manipulate(myplot(mu), mu = slider(1, 20, step = 1)) +``` + +--- +## Easier to see +```{r, eval = FALSE, echo = TRUE} +pvals <- seq(.5, .99, by = .01) +myplot2 <- function(df){ + d <- data.frame(n= qnorm(pvals),t=qt(pvals, df), + p = pvals) + g <- ggplot(d, aes(x= n, y = t)) + g <- g + geom_abline(size = 2, col = "lightblue") + g <- g + geom_line(size = 2, col = "black") + g <- g + geom_vline(xintercept = qnorm(0.975)) + g <- g + geom_hline(yintercept = qt(0.975, df)) + g +} +manipulate(myplot2(df), df = slider(1, 20, step = 1)) +``` + +--- + +## Note's about the $t$ interval + +- The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption +- It works well whenever the distribution of the data is roughly symmetric and mound shaped +- Paired observations are often analyzed using the $t$ interval by taking differences +- For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded +- For skewed distributions, the spirit of the $t$ interval assumptions are violated + - Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean + - In this case, consider taking logs or using a different summary like the median +- For highly discrete data, like binary, other intervals are available + +--- + +## Sleep data + +In R typing `data(sleep)` brings up the sleep data originally +analyzed in Gosset's Biometrika paper, which shows the increase in +hours for 10 patients on two soporific drugs. R treats the data as two +groups rather than paired. + +--- +## The data +```{r} +data(sleep) +head(sleep) +``` + +--- +## Plotting the data +```{r, echo = FALSE, fig.width=6, fig.height=6, fig.align='center'} +library(ggplot2) +g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID))) +g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5) +g +``` + +--- +## Results +```{r, echo=TRUE} +g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20] +difference <- g2 - g1 +mn <- mean(difference); s <- sd(difference); n <- 10 +``` +```{r, echo=TRUE,eval=FALSE} +mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n) +t.test(difference) +t.test(g2, g1, paired = TRUE) +t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep) +``` + +--- +## The results +(After a little formatting) +```{r, echo = FALSE} +rbind( +mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n), +as.vector(t.test(difference)$conf.int), +as.vector(t.test(g2, g1, paired = TRUE)$conf.int), +as.vector(t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)$conf.int) +) +``` + +--- + +## Independent group $t$ confidence intervals + +- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo +- We cannot use the paired t test because the groups are independent and may have different sample sizes +- We now present methods for comparing independent groups + +--- +## Confidence interval + +- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is +$$ + \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} +$$ +- The pooled variance estimator is $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ +- Remember this interval is assuming a constant variance across the two groups +- If there is some doubt, assume a different variance per group, which we will discuss later + +--- + +## Example +### Based on Rosner, Fundamentals of Biostatistics +(Really a very good reference book) + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- Pooled variance estimate +```{r} +sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2)) +132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5 +``` + + +--- +## Mistakenly treating the sleep data as grouped +```{r} +n1 <- length(g1); n2 <- length(g2) +sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2)) +md <- mean(g2) - mean(g1) +semd <- sp * sqrt(1 / n1 + 1/n2) +rbind( +md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd, +t.test(g2, g1, paired = FALSE, var.equal = TRUE)$conf, +t.test(g2, g1, paired = TRUE)$conf +) +``` + +--- +## Grouped versus independent +```{r, echo = FALSE, fig.width=6, fig.height=6, fig.align='center'} +library(ggplot2) +g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID))) +g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5) +g +``` + +--- + +## `ChickWeight` data in R +```{r} +library(datasets); data(ChickWeight); library(reshape2) +##define weight gain or loss +wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight") +names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "") +library(dplyr) +wideCW <- mutate(wideCW, + gain = time21 - time0 +) + +``` + +--- +## Plotting the raw data + +```{r, echo =FALSE, fig.align='center', fig.width=12, fig.height=6} +g <- ggplot(ChickWeight, aes(x = Time, y = weight, + colour = Diet, group = Chick)) +g <- g + geom_line() +g <- g + stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +g <- g + facet_grid(. ~ Diet) +g +``` + + + +--- +## Weight gain by diet +```{r, echo=FALSE, fig.align='center', fig.width=6, fig.height=6, warning=FALSE} +g <- ggplot(wideCW, aes(x = factor(Diet), y = gain, fill = factor(Diet))) +g <- g + geom_violin(col = "black", size = 2) +g + +``` + +--- +## Let's do a t interval +```{r} +wideCW14 <- subset(wideCW, Diet %in% c(1, 4)) +rbind( +t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf, +t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf +) +``` + + +--- + +## Unequal variances + +- Under unequal variances +$$ +\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2} +$$ +where $t_{df}$ is calculated with degrees of freedom +$$ +df= \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} + {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} +$$ +will be approximately a 95% interval +- This works really well + - So when in doubt, just assume unequal variances + +--- + +## Example + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- $df=15.04$, $t_{15.04, .975} = 2.13$ +- Interval +$$ +132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} += [-8.91, 19.75] +$$ +- In R, `t.test(..., var.equal = FALSE)` + +--- +## Comparing other kinds of data +* For binomial data, there's lots of ways to compare two groups + * Relative risk, risk difference, odds ratio. + * Chi-squared tests, normal approximations, exact tests. +* For count data, there's also Chi-squared tests and exact tests. +* We'll leave the discussions for comparing groups of data for binary + and count data until covering glms in the regression class. +* In addition, Mathematical Biostatistics Boot Camp 2 covers many special + cases relevant to biostatistics. + diff --git a/06_StatisticalInference/08_tCIs/index.html b/06_StatisticalInference/08_tCIs/index.html new file mode 100644 index 000000000..910843b4e --- /dev/null +++ b/06_StatisticalInference/08_tCIs/index.html @@ -0,0 +1,668 @@ + + + + T Confidence Intervals + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

T Confidence Intervals

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

T Confidence intervals

+
+
+
    +
  • In the previous, we discussed creating a confidence interval using the CLT + +
      +
    • They took the form \(Est \pm ZQ \times SE_{Est}\)
    • +
  • +
  • In this lecture, we discuss some methods for small samples, notably Gosset's \(t\) distribution and \(t\) confidence intervals + +
      +
    • They are of the form \(Est \pm TQ \times SE_{Est}\)
    • +
  • +
  • These are some of the handiest of intervals
  • +
  • If you want a rule between whether to use a \(t\) interval +or normal interval, just always use the \(t\) interval
  • +
  • We'll cover the one and two group versions
  • +
+ +
+ +
+ + +
+

Gosset's \(t\) distribution

+
+
+
    +
  • Invented by William Gosset (under the pseudonym "Student") in 1908
  • +
  • Has thicker tails than the normal
  • +
  • Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger
  • +
  • It assumes that the underlying data are iid +Gaussian with the result that +\[ +\frac{\bar X - \mu}{S/\sqrt{n}} +\] +follows Gosset's \(t\) distribution with \(n-1\) degrees of freedom
  • +
  • (If we replaced \(s\) by \(\sigma\) the statistic would be exactly standard normal)
  • +
  • Interval is \(\bar X \pm t_{n-1} S/\sqrt{n}\) where \(t_{n-1}\) +is the relevant quantile
  • +
+ +
+ +
+ + +
+

Code for manipulate

+
+
+
library(ggplot2)
+library(manipulate)
+k <- 1000
+xvals <- seq(-5, 5, length = k)
+myplot <- function(df) {
+    d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)), x = xvals, dist = factor(rep(c("Normal", 
+        "T"), c(k, k))))
+    g <- ggplot(d, aes(x = x, y = y))
+    g <- g + geom_line(size = 2, aes(colour = dist))
+    g
+}
+manipulate(myplot(mu), mu = slider(1, 20, step = 1))
+
+ +
+ +
+ + +
+

Easier to see

+
+
+
pvals <- seq(0.5, 0.99, by = 0.01)
+myplot2 <- function(df) {
+    d <- data.frame(n = qnorm(pvals), t = qt(pvals, df), p = pvals)
+    g <- ggplot(d, aes(x = n, y = t))
+    g <- g + geom_abline(size = 2, col = "lightblue")
+    g <- g + geom_line(size = 2, col = "black")
+    g <- g + geom_vline(xintercept = qnorm(0.975))
+    g <- g + geom_hline(yintercept = qt(0.975, df))
+    g
+}
+manipulate(myplot2(df), df = slider(1, 20, step = 1))
+
+ +
+ +
+ + +
+

Note's about the \(t\) interval

+
+
+
    +
  • The \(t\) interval technically assumes that the data are iid normal, though it is robust to this assumption
  • +
  • It works well whenever the distribution of the data is roughly symmetric and mound shaped
  • +
  • Paired observations are often analyzed using the \(t\) interval by taking differences
  • +
  • For large degrees of freedom, \(t\) quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded
  • +
  • For skewed distributions, the spirit of the \(t\) interval assumptions are violated + +
      +
    • Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean
    • +
    • In this case, consider taking logs or using a different summary like the median
    • +
  • +
  • For highly discrete data, like binary, other intervals are available
  • +
+ +
+ +
+ + +
+

Sleep data

+
+
+

In R typing data(sleep) brings up the sleep data originally +analyzed in Gosset's Biometrika paper, which shows the increase in +hours for 10 patients on two soporific drugs. R treats the data as two +groups rather than paired.

+ +
+ +
+ + +
+

The data

+
+
+
data(sleep)
+head(sleep)
+
+ +
##   extra group ID
+## 1   0.7     1  1
+## 2  -1.6     1  2
+## 3  -0.2     1  3
+## 4  -1.2     1  4
+## 5  -0.1     1  5
+## 6   3.4     1  6
+
+ +
+ +
+ + +
+

Plotting the data

+
+
+
## Warning: package 'ggplot2' was built under R version 3.1.1
+
+ +

plot of chunk unnamed-chunk-4

+ +
+ +
+ + +
+

Results

+
+
+
g1 <- sleep$extra[1:10]
+g2 <- sleep$extra[11:20]
+difference <- g2 - g1
+mn <- mean(difference)
+s <- sd(difference)
+n <- 10
+
+ +
mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n)
+t.test(difference)
+t.test(g2, g1, paired = TRUE)
+t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
+
+ +
+ +
+ + +
+

The results

+
+
+

(After a little formatting)

+ +
##        [,1] [,2]
+## [1,] 0.7001 2.46
+## [2,] 0.7001 2.46
+## [3,] 0.7001 2.46
+## [4,] 0.7001 2.46
+
+ +
+ +
+ + +
+

Independent group \(t\) confidence intervals

+
+
+
    +
  • Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo
  • +
  • We cannot use the paired t test because the groups are independent and may have different sample sizes
  • +
  • We now present methods for comparing independent groups
  • +
+ +
+ +
+ + +
+

Confidence interval

+
+
+
    +
  • Therefore a \((1 - \alpha)\times 100\%\) confidence interval for \(\mu_y - \mu_x\) is +\[ +\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} +\]
  • +
  • The pooled variance estimator is \[S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)\]
  • +
  • Remember this interval is assuming a constant variance across the two groups
  • +
  • If there is some doubt, assume a different variance per group, which we will discuss later
  • +
+ +
+ +
+ + +
+

Example

+
+
+

Based on Rosner, Fundamentals of Biostatistics

+ +

(Really a very good reference book)

+ +
    +
  • Comparing SBP for 8 oral contraceptive users versus 21 controls
  • +
  • \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg
  • +
  • \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
  • +
  • Pooled variance estimate
  • +
+ +
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2)/(8 + 21 - 2))
+132.86 - 127.44 + c(-1, 1) * qt(0.975, 27) * sp * (1/8 + 1/21)^0.5
+
+ +
## [1] -9.521 20.361
+
+ +
+ +
+ + +
+

Mistakenly treating the sleep data as grouped

+
+
+
n1 <- length(g1)
+n2 <- length(g2)
+sp <- sqrt(((n1 - 1) * sd(x1)^2 + (n2 - 1) * sd(x2)^2)/(n1 + n2 - 2))
+
+ +
## Error: object 'x1' not found
+
+ +
md <- mean(g2) - mean(g1)
+semd <- sp * sqrt(1/n1 + 1/n2)
+rbind(md + c(-1, 1) * qt(0.975, n1 + n2 - 2) * semd, t.test(g2, g1, paired = FALSE, 
+    var.equal = TRUE)$conf, t.test(g2, g1, paired = TRUE)$conf)
+
+ +
##          [,1]   [,2]
+## [1,] -14.8873 18.047
+## [2,]  -0.2039  3.364
+## [3,]   0.7001  2.460
+
+ +
+ +
+ + +
+

Grouped versus independent

+
+
+

plot of chunk unnamed-chunk-10

+ +
+ +
+ + +
+

ChickWeight data in R

+
+
+
library(datasets)
+data(ChickWeight)
+library(reshape2)
+## define weight gain or loss
+wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
+names(wideCW)[-(1:2)] <- paste("time", names(wideCW)[-(1:2)], sep = "")
+library(dplyr)
+
+ +
## 
+## Attaching package: 'dplyr'
+## 
+## The following objects are masked from 'package:stats':
+## 
+##     filter, lag
+## 
+## The following objects are masked from 'package:base':
+## 
+##     intersect, setdiff, setequal, union
+
+ +
wideCW <- mutate(wideCW, gain = time21 - time0)
+
+ +
+ +
+ + +
+

Plotting the raw data

+
+
+

plot of chunk unnamed-chunk-12

+ +
+ +
+ + +
+

Weight gain by diet

+
+
+

plot of chunk unnamed-chunk-13

+ +
+ +
+ + +
+

Let's do a t interval

+
+
+
wideCW14 <- subset(wideCW, Diet %in% c(1, 4))
+rbind(t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf, 
+    t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf)
+
+ +
##        [,1]   [,2]
+## [1,] -108.1 -14.81
+## [2,] -104.7 -18.30
+
+ +
+ +
+ + +
+

Unequal variances

+
+
+
    +
  • Under unequal variances +\[ +\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2} +\] +where \(t_{df}\) is calculated with degrees of freedom +\[ +df= \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} +{\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} +\] +will be approximately a 95% interval
  • +
  • This works really well + +
      +
    • So when in doubt, just assume unequal variances
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Comparing SBP for 8 oral contraceptive users versus 21 controls
  • +
  • \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg
  • +
  • \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
  • +
  • \(df=15.04\), \(t_{15.04, .975} = 2.13\)
  • +
  • Interval +\[ +132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} += [-8.91, 19.75] +\]
  • +
  • In R, t.test(..., var.equal = FALSE)
  • +
+ +
+ +
+ + +
+

Comparing other kinds of data

+
+
+
    +
  • For binomial data, there's lots of ways to compare two groups + +
      +
    • Relative risk, risk difference, odds ratio.
    • +
    • Chi-squared tests, normal approximations, exact tests.
    • +
  • +
  • For count data, there's also Chi-squared tests and exact tests.
  • +
  • We'll leave the discussions for comparing groups of data for binary +and count data until covering glms in the regression class.
  • +
  • In addition, Mathematical Biostatistics Boot Camp 2 covers many special +cases relevant to biostatistics.
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/08_tCIs/index.md b/06_StatisticalInference/08_tCIs/index.md new file mode 100644 index 000000000..e140c1b27 --- /dev/null +++ b/06_StatisticalInference/08_tCIs/index.md @@ -0,0 +1,345 @@ +--- +title : T Confidence Intervals +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## T Confidence intervals + +- In the previous, we discussed creating a confidence interval using the CLT + - They took the form $Est \pm ZQ \times SE_{Est}$ +- In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution and $t$ confidence intervals + - They are of the form $Est \pm TQ \times SE_{Est}$ +- These are some of the handiest of intervals +- If you want a rule between whether to use a $t$ interval +or normal interval, just always use the $t$ interval +- We'll cover the one and two group versions + +--- + +## Gosset's $t$ distribution + +- Invented by William Gosset (under the pseudonym "Student") in 1908 +- Has thicker tails than the normal +- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger +- It assumes that the underlying data are iid +Gaussian with the result that +$$ +\frac{\bar X - \mu}{S/\sqrt{n}} +$$ +follows Gosset's $t$ distribution with $n-1$ degrees of freedom +- (If we replaced $s$ by $\sigma$ the statistic would be exactly standard normal) +- Interval is $\bar X \pm t_{n-1} S/\sqrt{n}$ where $t_{n-1}$ +is the relevant quantile + +--- +## Code for manipulate + +```r +library(ggplot2) +library(manipulate) +k <- 1000 +xvals <- seq(-5, 5, length = k) +myplot <- function(df) { + d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)), x = xvals, dist = factor(rep(c("Normal", + "T"), c(k, k)))) + g <- ggplot(d, aes(x = x, y = y)) + g <- g + geom_line(size = 2, aes(colour = dist)) + g +} +manipulate(myplot(mu), mu = slider(1, 20, step = 1)) +``` + + +--- +## Easier to see + +```r +pvals <- seq(0.5, 0.99, by = 0.01) +myplot2 <- function(df) { + d <- data.frame(n = qnorm(pvals), t = qt(pvals, df), p = pvals) + g <- ggplot(d, aes(x = n, y = t)) + g <- g + geom_abline(size = 2, col = "lightblue") + g <- g + geom_line(size = 2, col = "black") + g <- g + geom_vline(xintercept = qnorm(0.975)) + g <- g + geom_hline(yintercept = qt(0.975, df)) + g +} +manipulate(myplot2(df), df = slider(1, 20, step = 1)) +``` + + +--- + +## Note's about the $t$ interval + +- The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption +- It works well whenever the distribution of the data is roughly symmetric and mound shaped +- Paired observations are often analyzed using the $t$ interval by taking differences +- For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded +- For skewed distributions, the spirit of the $t$ interval assumptions are violated + - Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean + - In this case, consider taking logs or using a different summary like the median +- For highly discrete data, like binary, other intervals are available + +--- + +## Sleep data + +In R typing `data(sleep)` brings up the sleep data originally +analyzed in Gosset's Biometrika paper, which shows the increase in +hours for 10 patients on two soporific drugs. R treats the data as two +groups rather than paired. + +--- +## The data + +```r +data(sleep) +head(sleep) +``` + +``` +## extra group ID +## 1 0.7 1 1 +## 2 -1.6 1 2 +## 3 -0.2 1 3 +## 4 -1.2 1 4 +## 5 -0.1 1 5 +## 6 3.4 1 6 +``` + + +--- +## Plotting the data + +``` +## Warning: package 'ggplot2' was built under R version 3.1.1 +``` + +plot of chunk unnamed-chunk-4 + + +--- +## Results + +```r +g1 <- sleep$extra[1:10] +g2 <- sleep$extra[11:20] +difference <- g2 - g1 +mn <- mean(difference) +s <- sd(difference) +n <- 10 +``` + + +```r +mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n) +t.test(difference) +t.test(g2, g1, paired = TRUE) +t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep) +``` + + +--- +## The results +(After a little formatting) + +``` +## [,1] [,2] +## [1,] 0.7001 2.46 +## [2,] 0.7001 2.46 +## [3,] 0.7001 2.46 +## [4,] 0.7001 2.46 +``` + + +--- + +## Independent group $t$ confidence intervals + +- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo +- We cannot use the paired t test because the groups are independent and may have different sample sizes +- We now present methods for comparing independent groups + +--- +## Confidence interval + +- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is +$$ + \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} +$$ +- The pooled variance estimator is $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ +- Remember this interval is assuming a constant variance across the two groups +- If there is some doubt, assume a different variance per group, which we will discuss later + +--- + +## Example +### Based on Rosner, Fundamentals of Biostatistics +(Really a very good reference book) + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- Pooled variance estimate + +```r +sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2)/(8 + 21 - 2)) +132.86 - 127.44 + c(-1, 1) * qt(0.975, 27) * sp * (1/8 + 1/21)^0.5 +``` + +``` +## [1] -9.521 20.361 +``` + + + +--- +## Mistakenly treating the sleep data as grouped + +```r +n1 <- length(g1) +n2 <- length(g2) +sp <- sqrt(((n1 - 1) * sd(x1)^2 + (n2 - 1) * sd(x2)^2)/(n1 + n2 - 2)) +``` + +``` +## Error: object 'x1' not found +``` + +```r +md <- mean(g2) - mean(g1) +semd <- sp * sqrt(1/n1 + 1/n2) +rbind(md + c(-1, 1) * qt(0.975, n1 + n2 - 2) * semd, t.test(g2, g1, paired = FALSE, + var.equal = TRUE)$conf, t.test(g2, g1, paired = TRUE)$conf) +``` + +``` +## [,1] [,2] +## [1,] -14.8873 18.047 +## [2,] -0.2039 3.364 +## [3,] 0.7001 2.460 +``` + + +--- +## Grouped versus independent +plot of chunk unnamed-chunk-10 + + +--- + +## `ChickWeight` data in R + +```r +library(datasets) +data(ChickWeight) +library(reshape2) +## define weight gain or loss +wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight") +names(wideCW)[-(1:2)] <- paste("time", names(wideCW)[-(1:2)], sep = "") +library(dplyr) +``` + +``` +## +## Attaching package: 'dplyr' +## +## The following objects are masked from 'package:stats': +## +## filter, lag +## +## The following objects are masked from 'package:base': +## +## intersect, setdiff, setequal, union +``` + +```r +wideCW <- mutate(wideCW, gain = time21 - time0) +``` + + +--- +## Plotting the raw data + +plot of chunk unnamed-chunk-12 + + + + +--- +## Weight gain by diet +plot of chunk unnamed-chunk-13 + + +--- +## Let's do a t interval + +```r +wideCW14 <- subset(wideCW, Diet %in% c(1, 4)) +rbind(t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf, + t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf) +``` + +``` +## [,1] [,2] +## [1,] -108.1 -14.81 +## [2,] -104.7 -18.30 +``` + + + +--- + +## Unequal variances + +- Under unequal variances +$$ +\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2} +$$ +where $t_{df}$ is calculated with degrees of freedom +$$ +df= \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} + {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} +$$ +will be approximately a 95% interval +- This works really well + - So when in doubt, just assume unequal variances + +--- + +## Example + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- $df=15.04$, $t_{15.04, .975} = 2.13$ +- Interval +$$ +132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} += [-8.91, 19.75] +$$ +- In R, `t.test(..., var.equal = FALSE)` + +--- +## Comparing other kinds of data +* For binomial data, there's lots of ways to compare two groups + * Relative risk, risk difference, odds ratio. + * Chi-squared tests, normal approximations, exact tests. +* For count data, there's also Chi-squared tests and exact tests. +* We'll leave the discussions for comparing groups of data for binary + and count data until covering glms in the regression class. +* In addition, Mathematical Biostatistics Boot Camp 2 covers many special + cases relevant to biostatistics. + diff --git a/06_StatisticalInference/08_tCIs/index.pdf b/06_StatisticalInference/08_tCIs/index.pdf new file mode 100644 index 000000000..9c12c073a Binary files /dev/null and b/06_StatisticalInference/08_tCIs/index.pdf differ diff --git a/06_StatisticalInference/09_HT/index.Rmd b/06_StatisticalInference/09_HT/index.Rmd new file mode 100644 index 000000000..40140aa48 --- /dev/null +++ b/06_StatisticalInference/09_HT/index.Rmd @@ -0,0 +1,241 @@ +--- +title : Hypothesis testing +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Hypothesis testing +* Hypothesis testing is concerned with making decisions using data +* A null hypothesis is specified that represents the status quo, + usually labeled $H_0$ +* The null hypothesis is assumed true and statistical evidence is required + to reject it in favor of a research or alternative hypothesis + +--- +## Example +* A respiratory disturbance index of more than $30$ events / hour, say, is + considered evidence of severe sleep disordered breathing (SDB). +* Suppose that in a sample of $100$ overweight subjects with other + risk factors for sleep disordered breathing at a sleep clinic, the + mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour. +* We might want to test the hypothesis that + * $H_0 : \mu = 30$ + * $H_a : \mu > 30$ + * where $\mu$ is the population mean RDI. + +--- +## Hypothesis testing +* The alternative hypotheses are typically of the form $<$, $>$ or $\neq$ +* Note that there are four possible outcomes of our statistical decision process + +Truth | Decide | Result | +---|---|---| +$H_0$ | $H_0$ | Correctly accept null | +$H_0$ | $H_a$ | Type I error | +$H_a$ | $H_a$ | Correctly reject null | +$H_a$ | $H_0$ | Type II error | + +--- +## Discussion +* Consider a court of law; the null hypothesis is that the + defendant is innocent +* We require a standard on the available evidence to reject the null hypothesis (convict) +* If we set a low standard, then we would increase the + percentage of innocent people convicted (type I errors); however we + would also increase the percentage of guilty people convicted + (correctly rejecting the null) +* If we set a high standard, then we increase the the + percentage of innocent people let free (correctly accepting the + null) while we would also increase the percentage of guilty people + let free (type II errors) + +--- +## Example +* Consider our sleep example again +* A reasonable strategy would reject the null hypothesis if + $\bar X$ was larger than some constant, say $C$ +* Typically, $C$ is chosen so that the probability of a Type I + error, $\alpha$, is $.05$ (or some other relevant constant) +* $\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct + +--- +## Example continued +- Standard error of the mean $10 / \sqrt{100} = 1$ +- Under $H_0$ $\bar X \sim N(30, 1)$ +- We want to chose $C$ so that the $P(\bar X > C; H_0)$ is +5% +- The 95th percentile of a normal distribution is 1.645 +standard deviations from the mean +- If $C = 30 + 1 \times 1.645 = 31.645$ + - Then the probability that a $N(30, 1)$ is larger + than it is 5% + - So the rule "Reject $H_0$ when $\bar X \geq 31.645$" + has the property that the probability of rejection + is 5% when $H_0$ is true (for the $\mu_0$, $\sigma$ + and $n$ given) + + +--- +## Discussion +* In general we don't convert $C$ back to the original scale +* We would just reject because the Z-score; which is how many + standard errors the sample mean is above the hypothesized mean + $$ + \frac{32 - 30}{10 / \sqrt{100}} = 2 + $$ + is greater than $1.645$ +* Or, whenever $\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}$ + +--- +## General rules +* The $Z$ test for $H_0:\mu = \mu_0$ versus + * $H_1: \mu < \mu_0$ + * $H_2: \mu \neq \mu_0$ + * $H_3: \mu > \mu_0$ +* Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $ +* Reject the null hypothesis when + * $TS \leq Z_{\alpha} = -Z_{1 - \alpha}$ + * $|TS| \geq Z_{1 - \alpha / 2}$ + * $TS \geq Z_{1 - \alpha}$ + +--- +## Notes +* We have fixed $\alpha$ to be low, so if we reject $H_0$ (either + our model is wrong) or there is a low probability that we have made + an error +* We have not fixed the probability of a type II error, $\beta$; + therefore we tend to say ``Fail to reject $H_0$'' rather than + accepting $H_0$ +* Statistical significance is no the same as scientific + significance +* The region of TS values for which you reject $H_0$ is called the + rejection region + +--- +## More notes +* The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough + for it to apply +* If $n$ is small, then a Gossett's $T$ test is performed exactly in the same way, + with the normal quantiles replaced by the appropriate Student's $T$ quantiles and + $n-1$ df +* The probability of rejecting the null hypothesis when it is false is called *power* +* Power is a used a lot to calculate sample sizes for experiments + +--- +## Example reconsidered +- Consider our example again. Suppose that $n= 16$ (rather than +$100$) +- The statistic +$$ +\frac{\bar X - 30}{s / \sqrt{16}} +$$ +follows a $T$ distribution with 15 df under $H_0$ +- Under $H_0$, the probability that it is larger +that the 95th percentile of the $T$ distribution is 5% +- The 95th percentile of the T distribution with 15 +df is `r qt(.95, 15)` (obtained via `qt(.95, 15)`) +- So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $ +- We now fail to reject. + +--- +## Two sided tests +* Suppose that we would reject the null hypothesis if in fact the mean was too large or too small +* That is, we want to test the alternative $H_a : \mu \neq 30$ +* We will reject if the test statistic, $0.8$, is either too large or too small +* Then we want the probability of rejecting under the +null to be 5%, split equally as 2.5% in the upper +tail and 2.5% in the lower tail +* Thus we reject if our test statistic is larger +than `qt(.975, 15)` or smaller than `qt(.025, 15)` + * This is the same as saying: reject if the + absolute value of our statistic is larger than + `qt(0.975, 15)` = `r qt(0.975, 15)` + * So we fail to reject the two sided test as well + * (If you fail to reject the one sided test, you + know that you will fail to reject the two sided) + +--- +## T test in R +```{r, echo=TRUE, comment=">", results='markup'} +library(UsingR); data(father.son) +t.test(father.son$sheight - father.son$fheight) +``` + +--- +## Connections with confidence intervals +* Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$ +* Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1-\alpha)100\%$ confidence interval for $\mu$ +* The same works in reverse; if a $(1-\alpha)100\%$ interval + contains $\mu_0$, then we *fail to* reject $H_0$ + +--- +## Two group intervals +- First, now you know how to do two group T tests +since we already covered indepedent group T intervals +- Rejection rules are the same +- Test $H_0 : \mu_1 = \mu_2$ +- Let's just go through an example + +--- +## `chickWeight` data +Recall that we reformatted this data +```{r, echo=TRUE,results='hide'} +library(datasets); data(ChickWeight); library(reshape2) +##define weight gain or loss +wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight") +names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "") +library(dplyr) +wideCW <- mutate(wideCW, + gain = time21 - time0 +) +``` + +--- +### Unequal variance T test comparing diets 1 and 4 +```{r,echo=TRUE, comment="> ", results='markup'} +wideCW14 <- subset(wideCW, Diet %in% c(1, 4)) +t.test(gain ~ Diet, paired = FALSE, + var.equal = TRUE, data = wideCW14) +``` + + + +--- +## Exact binomial test +- Recall this problem, *Suppose a friend has $8$ children, $7$ of which are girls and none are twins* +- Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$ + - What is the relevant rejection region so that the probability of rejecting is (less than) 5%? + +Rejection region | Type I error rate | +---|---| +[0 : 8] | `r pbinom(-1, size = 8, p = .5, lower.tail = FALSE)` +[1 : 8] | `r pbinom( 0, size = 8, p = .5, lower.tail = FALSE)` +[2 : 8] | `r pbinom( 1, size = 8, p = .5, lower.tail = FALSE)` +[3 : 8] | `r pbinom( 2, size = 8, p = .5, lower.tail = FALSE)` +[4 : 8] | `r pbinom( 3, size = 8, p = .5, lower.tail = FALSE)` +[5 : 8] | `r pbinom( 4, size = 8, p = .5, lower.tail = FALSE)` +[6 : 8] | `r pbinom( 5, size = 8, p = .5, lower.tail = FALSE)` +[7 : 8] | `r pbinom( 6, size = 8, p = .5, lower.tail = FALSE)` +[8 : 8] | `r pbinom( 7, size = 8, p = .5, lower.tail = FALSE)` + +--- +## Notes +* It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. + * The closest is the rejection region [7 : 8] + * Any alpha level lower than `r 1 / 2 ^8` is not attainable. +* For larger sample sizes, we could do a normal approximation, but you already knew this. +* Two sided test isn't obvious. + * Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW) +* For these problems, people always create a P-value (next lecture) rather than computing the rejection region. + + diff --git a/06_StatisticalInference/09_HT/index.html b/06_StatisticalInference/09_HT/index.html new file mode 100644 index 000000000..1422d9b3c --- /dev/null +++ b/06_StatisticalInference/09_HT/index.html @@ -0,0 +1,682 @@ + + + + Hypothesis testing + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Hypothesis testing

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Hypothesis testing

+
+
+
    +
  • Hypothesis testing is concerned with making decisions using data
  • +
  • A null hypothesis is specified that represents the status quo, +usually labeled \(H_0\)
  • +
  • The null hypothesis is assumed true and statistical evidence is required +to reject it in favor of a research or alternative hypothesis
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • A respiratory disturbance index of more than \(30\) events / hour, say, is +considered evidence of severe sleep disordered breathing (SDB).
  • +
  • Suppose that in a sample of \(100\) overweight subjects with other +risk factors for sleep disordered breathing at a sleep clinic, the +mean RDI was \(32\) events / hour with a standard deviation of \(10\) events / hour.
  • +
  • We might want to test the hypothesis that + +
      +
    • \(H_0 : \mu = 30\)
    • +
    • \(H_a : \mu > 30\)
    • +
    • where \(\mu\) is the population mean RDI.
    • +
  • +
+ +
+ +
+ + +
+

Hypothesis testing

+
+
+
    +
  • The alternative hypotheses are typically of the form \(<\), \(>\) or \(\neq\)
  • +
  • Note that there are four possible outcomes of our statistical decision process
  • +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TruthDecideResult
\(H_0\)\(H_0\)Correctly accept null
\(H_0\)\(H_a\)Type I error
\(H_a\)\(H_a\)Correctly reject null
\(H_a\)\(H_0\)Type II error
+ +
+ +
+ + +
+

Discussion

+
+
+
    +
  • Consider a court of law; the null hypothesis is that the +defendant is innocent
  • +
  • We require a standard on the available evidence to reject the null hypothesis (convict)
  • +
  • If we set a low standard, then we would increase the +percentage of innocent people convicted (type I errors); however we +would also increase the percentage of guilty people convicted +(correctly rejecting the null)
  • +
  • If we set a high standard, then we increase the the +percentage of innocent people let free (correctly accepting the +null) while we would also increase the percentage of guilty people +let free (type II errors)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Consider our sleep example again
  • +
  • A reasonable strategy would reject the null hypothesis if +\(\bar X\) was larger than some constant, say \(C\)
  • +
  • Typically, \(C\) is chosen so that the probability of a Type I +error, \(\alpha\), is \(.05\) (or some other relevant constant)
  • +
  • \(\alpha\) = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct
  • +
+ +
+ +
+ + +
+

Example continued

+
+
+
    +
  • Standard error of the mean \(10 / \sqrt{100} = 1\)
  • +
  • Under \(H_0\) \(\bar X \sim N(30, 1)\)
  • +
  • We want to chose \(C\) so that the \(P(\bar X > C; H_0)\) is +5%
  • +
  • The 95th percentile of a normal distribution is 1.645 +standard deviations from the mean
  • +
  • If \(C = 30 + 1 \times 1.645 = 31.645\) + +
      +
    • Then the probability that a \(N(30, 1)\) is larger +than it is 5%
    • +
    • So the rule "Reject \(H_0\) when \(\bar X \geq 31.645\)" +has the property that the probability of rejection +is 5% when \(H_0\) is true (for the \(\mu_0\), \(\sigma\) +and \(n\) given)
    • +
  • +
+ +
+ +
+ + +
+

Discussion

+
+
+
    +
  • In general we don't convert \(C\) back to the original scale
  • +
  • We would just reject because the Z-score; which is how many +standard errors the sample mean is above the hypothesized mean +\[ +\frac{32 - 30}{10 / \sqrt{100}} = 2 +\] +is greater than \(1.645\)
  • +
  • Or, whenever \(\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}\)
  • +
+ +
+ +
+ + +
+

General rules

+
+
+
    +
  • The \(Z\) test for \(H_0:\mu = \mu_0\) versus + +
      +
    • \(H_1: \mu < \mu_0\)
    • +
    • \(H_2: \mu \neq \mu_0\)
    • +
    • \(H_3: \mu > \mu_0\)
    • +
  • +
  • Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $
  • +
  • Reject the null hypothesis when + +
      +
    • \(TS \leq Z_{\alpha} = -Z_{1 - \alpha}\)
    • +
    • \(|TS| \geq Z_{1 - \alpha / 2}\)
    • +
    • \(TS \geq Z_{1 - \alpha}\)
    • +
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • We have fixed \(\alpha\) to be low, so if we reject \(H_0\) (either +our model is wrong) or there is a low probability that we have made +an error
  • +
  • We have not fixed the probability of a type II error, \(\beta\); +therefore we tend to say ``Fail to reject \(H_0\)'' rather than +accepting \(H_0\)
  • +
  • Statistical significance is no the same as scientific +significance
  • +
  • The region of TS values for which you reject \(H_0\) is called the +rejection region
  • +
+ +
+ +
+ + +
+

More notes

+
+
+
    +
  • The \(Z\) test requires the assumptions of the CLT and for \(n\) to be large enough +for it to apply
  • +
  • If \(n\) is small, then a Gossett's \(T\) test is performed exactly in the same way, +with the normal quantiles replaced by the appropriate Student's \(T\) quantiles and +\(n-1\) df
  • +
  • The probability of rejecting the null hypothesis when it is false is called power
  • +
  • Power is a used a lot to calculate sample sizes for experiments
  • +
+ +
+ +
+ + +
+

Example reconsidered

+
+
+
    +
  • Consider our example again. Suppose that \(n= 16\) (rather than +\(100\))
  • +
  • The statistic +\[ +\frac{\bar X - 30}{s / \sqrt{16}} +\] +follows a \(T\) distribution with 15 df under \(H_0\)
  • +
  • Under \(H_0\), the probability that it is larger +that the 95th percentile of the \(T\) distribution is 5%
  • +
  • The 95th percentile of the T distribution with 15 +df is 1.7531 (obtained via qt(.95, 15))
  • +
  • So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $
  • +
  • We now fail to reject.
  • +
+ +
+ +
+ + +
+

Two sided tests

+
+
+
    +
  • Suppose that we would reject the null hypothesis if in fact the mean was too large or too small
  • +
  • That is, we want to test the alternative \(H_a : \mu \neq 30\)
  • +
  • We will reject if the test statistic, \(0.8\), is either too large or too small
  • +
  • Then we want the probability of rejecting under the +null to be 5%, split equally as 2.5% in the upper +tail and 2.5% in the lower tail
  • +
  • Thus we reject if our test statistic is larger +than qt(.975, 15) or smaller than qt(.025, 15) + +
      +
    • This is the same as saying: reject if the +absolute value of our statistic is larger than +qt(0.975, 15) = 2.1314
    • +
    • So we fail to reject the two sided test as well
    • +
    • (If you fail to reject the one sided test, you +know that you will fail to reject the two sided)
    • +
  • +
+ +
+ +
+ + +
+

T test in R

+
+
+
library(UsingR); data(father.son)
+t.test(father.son$sheight - father.son$fheight)
+
+ +
> 
+>   One Sample t-test
+> 
+> data:  father.son$sheight - father.son$fheight
+> t = 11.79, df = 1077, p-value < 2.2e-16
+> alternative hypothesis: true mean is not equal to 0
+> 95 percent confidence interval:
+>  0.831 1.163
+> sample estimates:
+> mean of x 
+>     0.997
+
+ +
+ +
+ + +
+

Connections with confidence intervals

+
+
+
    +
  • Consider testing \(H_0: \mu = \mu_0\) versus \(H_a: \mu \neq \mu_0\)
  • +
  • Take the set of all possible values for which you fail to reject \(H_0\), this set is a \((1-\alpha)100\%\) confidence interval for \(\mu\)
  • +
  • The same works in reverse; if a \((1-\alpha)100\%\) interval +contains \(\mu_0\), then we fail to reject \(H_0\)
  • +
+ +
+ +
+ + +
+

Two group intervals

+
+
+
    +
  • First, now you know how to do two group T tests +since we already covered indepedent group T intervals
  • +
  • Rejection rules are the same
  • +
  • Test \(H_0 : \mu_1 = \mu_2\)
  • +
  • Let's just go through an example
  • +
+ +
+ +
+ + +
+

chickWeight data

+
+
+

Recall that we reformatted this data

+ +
library(datasets); data(ChickWeight); library(reshape2)
+##define weight gain or loss
+wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
+names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
+library(dplyr)
+wideCW <- mutate(wideCW,
+  gain = time21 - time0
+)
+
+ +
+ +
+ + +
+

Unequal variance T test comparing diets 1 and 4

+
+
+
wideCW14 <- subset(wideCW, Diet %in% c(1, 4))
+t.test(gain ~ Diet, paired = FALSE, 
+       var.equal = TRUE, data = wideCW14)
+
+ +
>  
+>   Two Sample t-test
+>  
+>  data:  gain by Diet
+>  t = -2.725, df = 23, p-value = 0.01207
+>  alternative hypothesis: true difference in means is not equal to 0
+>  95 percent confidence interval:
+>   -108.15  -14.81
+>  sample estimates:
+>  mean in group 1 mean in group 4 
+>            136.2           197.7
+
+ +
+ +
+ + +
+

Exact binomial test

+
+
+
    +
  • Recall this problem, Suppose a friend has \(8\) children, \(7\) of which are girls and none are twins
  • +
  • Perform the relevant hypothesis test. \(H_0 : p = 0.5\) \(H_a : p > 0.5\) + +
      +
    • What is the relevant rejection region so that the probability of rejecting is (less than) 5%?
    • +
  • +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rejection regionType I error rate
[0 : 8]1
[1 : 8]0.9961
[2 : 8]0.9648
[3 : 8]0.8555
[4 : 8]0.6367
[5 : 8]0.3633
[6 : 8]0.1445
[7 : 8]0.0352
[8 : 8]0.0039
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. + +
      +
    • The closest is the rejection region [7 : 8]
    • +
    • Any alpha level lower than 0.0039 is not attainable.
    • +
  • +
  • For larger sample sizes, we could do a normal approximation, but you already knew this.
  • +
  • Two sided test isn't obvious. + +
      +
    • Given a way to do two sided tests, we could take the set of values of \(p_0\) for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW)
    • +
  • +
  • For these problems, people always create a P-value (next lecture) rather than computing the rejection region.
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/09_HT/index.md b/06_StatisticalInference/09_HT/index.md new file mode 100644 index 000000000..f357d7a14 --- /dev/null +++ b/06_StatisticalInference/09_HT/index.md @@ -0,0 +1,272 @@ +--- +title : Hypothesis testing +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Hypothesis testing +* Hypothesis testing is concerned with making decisions using data +* A null hypothesis is specified that represents the status quo, + usually labeled $H_0$ +* The null hypothesis is assumed true and statistical evidence is required + to reject it in favor of a research or alternative hypothesis + +--- +## Example +* A respiratory disturbance index of more than $30$ events / hour, say, is + considered evidence of severe sleep disordered breathing (SDB). +* Suppose that in a sample of $100$ overweight subjects with other + risk factors for sleep disordered breathing at a sleep clinic, the + mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour. +* We might want to test the hypothesis that + * $H_0 : \mu = 30$ + * $H_a : \mu > 30$ + * where $\mu$ is the population mean RDI. + +--- +## Hypothesis testing +* The alternative hypotheses are typically of the form $<$, $>$ or $\neq$ +* Note that there are four possible outcomes of our statistical decision process + +Truth | Decide | Result | +---|---|---| +$H_0$ | $H_0$ | Correctly accept null | +$H_0$ | $H_a$ | Type I error | +$H_a$ | $H_a$ | Correctly reject null | +$H_a$ | $H_0$ | Type II error | + +--- +## Discussion +* Consider a court of law; the null hypothesis is that the + defendant is innocent +* We require a standard on the available evidence to reject the null hypothesis (convict) +* If we set a low standard, then we would increase the + percentage of innocent people convicted (type I errors); however we + would also increase the percentage of guilty people convicted + (correctly rejecting the null) +* If we set a high standard, then we increase the the + percentage of innocent people let free (correctly accepting the + null) while we would also increase the percentage of guilty people + let free (type II errors) + +--- +## Example +* Consider our sleep example again +* A reasonable strategy would reject the null hypothesis if + $\bar X$ was larger than some constant, say $C$ +* Typically, $C$ is chosen so that the probability of a Type I + error, $\alpha$, is $.05$ (or some other relevant constant) +* $\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct + +--- +## Example continued +- Standard error of the mean $10 / \sqrt{100} = 1$ +- Under $H_0$ $\bar X \sim N(30, 1)$ +- We want to chose $C$ so that the $P(\bar X > C; H_0)$ is +5% +- The 95th percentile of a normal distribution is 1.645 +standard deviations from the mean +- If $C = 30 + 1 \times 1.645 = 31.645$ + - Then the probability that a $N(30, 1)$ is larger + than it is 5% + - So the rule "Reject $H_0$ when $\bar X \geq 31.645$" + has the property that the probability of rejection + is 5% when $H_0$ is true (for the $\mu_0$, $\sigma$ + and $n$ given) + + +--- +## Discussion +* In general we don't convert $C$ back to the original scale +* We would just reject because the Z-score; which is how many + standard errors the sample mean is above the hypothesized mean + $$ + \frac{32 - 30}{10 / \sqrt{100}} = 2 + $$ + is greater than $1.645$ +* Or, whenever $\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}$ + +--- +## General rules +* The $Z$ test for $H_0:\mu = \mu_0$ versus + * $H_1: \mu < \mu_0$ + * $H_2: \mu \neq \mu_0$ + * $H_3: \mu > \mu_0$ +* Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $ +* Reject the null hypothesis when + * $TS \leq Z_{\alpha} = -Z_{1 - \alpha}$ + * $|TS| \geq Z_{1 - \alpha / 2}$ + * $TS \geq Z_{1 - \alpha}$ + +--- +## Notes +* We have fixed $\alpha$ to be low, so if we reject $H_0$ (either + our model is wrong) or there is a low probability that we have made + an error +* We have not fixed the probability of a type II error, $\beta$; + therefore we tend to say ``Fail to reject $H_0$'' rather than + accepting $H_0$ +* Statistical significance is no the same as scientific + significance +* The region of TS values for which you reject $H_0$ is called the + rejection region + +--- +## More notes +* The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough + for it to apply +* If $n$ is small, then a Gossett's $T$ test is performed exactly in the same way, + with the normal quantiles replaced by the appropriate Student's $T$ quantiles and + $n-1$ df +* The probability of rejecting the null hypothesis when it is false is called *power* +* Power is a used a lot to calculate sample sizes for experiments + +--- +## Example reconsidered +- Consider our example again. Suppose that $n= 16$ (rather than +$100$) +- The statistic +$$ +\frac{\bar X - 30}{s / \sqrt{16}} +$$ +follows a $T$ distribution with 15 df under $H_0$ +- Under $H_0$, the probability that it is larger +that the 95th percentile of the $T$ distribution is 5% +- The 95th percentile of the T distribution with 15 +df is 1.7531 (obtained via `qt(.95, 15)`) +- So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $ +- We now fail to reject. + +--- +## Two sided tests +* Suppose that we would reject the null hypothesis if in fact the mean was too large or too small +* That is, we want to test the alternative $H_a : \mu \neq 30$ +* We will reject if the test statistic, $0.8$, is either too large or too small +* Then we want the probability of rejecting under the +null to be 5%, split equally as 2.5% in the upper +tail and 2.5% in the lower tail +* Thus we reject if our test statistic is larger +than `qt(.975, 15)` or smaller than `qt(.025, 15)` + * This is the same as saying: reject if the + absolute value of our statistic is larger than + `qt(0.975, 15)` = 2.1314 + * So we fail to reject the two sided test as well + * (If you fail to reject the one sided test, you + know that you will fail to reject the two sided) + +--- +## T test in R + +```r +library(UsingR); data(father.son) +t.test(father.son$sheight - father.son$fheight) +``` + +``` +> +> One Sample t-test +> +> data: father.son$sheight - father.son$fheight +> t = 11.79, df = 1077, p-value < 2.2e-16 +> alternative hypothesis: true mean is not equal to 0 +> 95 percent confidence interval: +> 0.831 1.163 +> sample estimates: +> mean of x +> 0.997 +``` + +--- +## Connections with confidence intervals +* Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$ +* Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1-\alpha)100\%$ confidence interval for $\mu$ +* The same works in reverse; if a $(1-\alpha)100\%$ interval + contains $\mu_0$, then we *fail to* reject $H_0$ + +--- +## Two group intervals +- First, now you know how to do two group T tests +since we already covered indepedent group T intervals +- Rejection rules are the same +- Test $H_0 : \mu_1 = \mu_2$ +- Let's just go through an example + +--- +## `chickWeight` data +Recall that we reformatted this data + +```r +library(datasets); data(ChickWeight); library(reshape2) +##define weight gain or loss +wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight") +names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "") +library(dplyr) +wideCW <- mutate(wideCW, + gain = time21 - time0 +) +``` + +--- +### Unequal variance T test comparing diets 1 and 4 + +```r +wideCW14 <- subset(wideCW, Diet %in% c(1, 4)) +t.test(gain ~ Diet, paired = FALSE, + var.equal = TRUE, data = wideCW14) +``` + +``` +> +> Two Sample t-test +> +> data: gain by Diet +> t = -2.725, df = 23, p-value = 0.01207 +> alternative hypothesis: true difference in means is not equal to 0 +> 95 percent confidence interval: +> -108.15 -14.81 +> sample estimates: +> mean in group 1 mean in group 4 +> 136.2 197.7 +``` + + + +--- +## Exact binomial test +- Recall this problem, *Suppose a friend has $8$ children, $7$ of which are girls and none are twins* +- Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$ + - What is the relevant rejection region so that the probability of rejecting is (less than) 5%? + +Rejection region | Type I error rate | +---|---| +[0 : 8] | 1 +[1 : 8] | 0.9961 +[2 : 8] | 0.9648 +[3 : 8] | 0.8555 +[4 : 8] | 0.6367 +[5 : 8] | 0.3633 +[6 : 8] | 0.1445 +[7 : 8] | 0.0352 +[8 : 8] | 0.0039 + +--- +## Notes +* It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. + * The closest is the rejection region [7 : 8] + * Any alpha level lower than 0.0039 is not attainable. +* For larger sample sizes, we could do a normal approximation, but you already knew this. +* Two sided test isn't obvious. + * Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW) +* For these problems, people always create a P-value (next lecture) rather than computing the rejection region. + + diff --git a/06_StatisticalInference/09_HT/index.pdf b/06_StatisticalInference/09_HT/index.pdf new file mode 100644 index 000000000..9ed5b7d41 Binary files /dev/null and b/06_StatisticalInference/09_HT/index.pdf differ diff --git a/06_StatisticalInference/03_02_HypothesisTesting/lecture1.tex b/06_StatisticalInference/09_HT/lecture1.tex similarity index 100% rename from 06_StatisticalInference/03_02_HypothesisTesting/lecture1.tex rename to 06_StatisticalInference/09_HT/lecture1.tex diff --git a/06_StatisticalInference/03_03_pValues/P-values.pdf b/06_StatisticalInference/10_pValues/P-values.pdf similarity index 100% rename from 06_StatisticalInference/03_03_pValues/P-values.pdf rename to 06_StatisticalInference/10_pValues/P-values.pdf diff --git a/06_StatisticalInference/03_03_pValues/data/quakesRaw.rda b/06_StatisticalInference/10_pValues/data/quakesRaw.rda similarity index 100% rename from 06_StatisticalInference/03_03_pValues/data/quakesRaw.rda rename to 06_StatisticalInference/10_pValues/data/quakesRaw.rda diff --git a/06_StatisticalInference/03_03_pValues/fig/galton.png b/06_StatisticalInference/10_pValues/fig/galton.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/galton.png rename to 06_StatisticalInference/10_pValues/fig/galton.png diff --git a/06_StatisticalInference/03_03_pValues/fig/loadGalton.png b/06_StatisticalInference/10_pValues/fig/loadGalton.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/loadGalton.png rename to 06_StatisticalInference/10_pValues/fig/loadGalton.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-1.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-1.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-1.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-10.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-10.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-10.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-10.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-101.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-101.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-101.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-101.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-102.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-102.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-102.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-102.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-11.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-11.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-11.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-11.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-12.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-12.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-12.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-12.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-13.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-13.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-13.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-13.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-14.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-14.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-14.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-14.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-15.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-15.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-15.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-15.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-16.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-16.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-16.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-16.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-17.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-17.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-17.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-17.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-18.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-18.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-18.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-18.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-19.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-19.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-19.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-19.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-2.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-2.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-2.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-2.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-20.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-20.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-20.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-20.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-21.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-21.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-21.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-21.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-22.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-22.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-22.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-22.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-23.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-23.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-23.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-23.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-24.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-24.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-24.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-24.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-3.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-3.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-3.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-3.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-4.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-4.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-4.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-4.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-5.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-5.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-5.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-5.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-6.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-6.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-6.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-6.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-7.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-7.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-7.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-7.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-8.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-8.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-8.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-8.png diff --git a/06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-9.png b/06_StatisticalInference/10_pValues/fig/unnamed-chunk-9.png similarity index 100% rename from 06_StatisticalInference/03_03_pValues/fig/unnamed-chunk-9.png rename to 06_StatisticalInference/10_pValues/fig/unnamed-chunk-9.png diff --git a/06_StatisticalInference/10_pValues/index.Rmd b/06_StatisticalInference/10_pValues/index.Rmd new file mode 100644 index 000000000..36ce9f492 --- /dev/null +++ b/06_StatisticalInference/10_pValues/index.Rmd @@ -0,0 +1,93 @@ +--- +title : P-values +subtitle : Statistical inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## P-values + +* Most common measure of statistical significance +* Their ubiquity, along with concern over their interpretation and use + makes them controversial among statisticians + * [http://warnercnr.colostate.edu/~anderson/thompson1.html](http://warnercnr.colostate.edu/~anderson/thompson1.html) + * Also see *Statistical Evidence: A Likelihood Paradigm* by Richard Royall + * *Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy* by Steve Goodman + * The hilariously titled: *The Earth is Round (p < .05)* by Cohen. +* Some positive comments + * [simply statistics](http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/) + * [normal deviate](http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/) + * [Error statistics](http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/) + +--- + + +## What is a P-value? + +__Idea__: Suppose nothing is going on - how unusual is it to see the estimate we got? + +__Approach__: + +1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (_null hypothesis_) +2. Calculate the summary/statistic with the data we have (_test statistic_) +3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (_p-value_) + +--- +## P-values +* The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than that obtained +* If the P-value is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false +* Suppos that you get a $T$ statistic of $2.5$ for 15 df testing $H_0:\mu = \mu_0$ +versus $H_a : \mu > \mu_0$. + * What's the probability of getting a $T$ statistic as large as $2.5$? +```{r} +pt(2.5, 15, lower.tail = FALSE) +``` +* Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is `r pt(2.5, 15, lower.tail = FALSE)` + +--- +## The attained significance level +* Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$. +* Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$? +* The smallest value for alpha that you still reject the null hypothesis is called the *attained significance level* +* This is equivalent, but philosophically a little different from, the *P-value* + +--- +## Notes +* By reporting a P-value the reader can perform the hypothesis + test at whatever $\alpha$ level he or she choses +* If the P-value is less than $\alpha$ you reject the null hypothesis +* For two sided hypothesis test, double the smaller of the two one + sided hypothesis test Pvalues + +--- +## Revisiting an earlier example +- Suppose a friend has $8$ children, $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? +```{r} +choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 +pbinom(6, size = 8, prob = .5, lower.tail = FALSE) +``` + +--- +## Poisson example +- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period. +- Assume that an infection rate of 0.05 is an important benchmark. +- Given the model, could the observed rate being larger than 0.05 be attributed to chance? +- Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$ +- Consider $H_a: \lambda > 0.05$. + +```{r} +ppois(9, 5, lower.tail = FALSE) +``` + + + diff --git a/06_StatisticalInference/10_pValues/index.html b/06_StatisticalInference/10_pValues/index.html new file mode 100644 index 000000000..2ea2aa210 --- /dev/null +++ b/06_StatisticalInference/10_pValues/index.html @@ -0,0 +1,281 @@ + + + + P-values + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

P-values

+

Statistical inference

+

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

P-values

+
+
+
    +
  • Most common measure of statistical significance
  • +
  • Their ubiquity, along with concern over their interpretation and use +makes them controversial among statisticians + +
      +
    • http://warnercnr.colostate.edu/~anderson/thompson1.html
    • +
    • Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall
    • +
    • Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy by Steve Goodman
    • +
    • The hilariously titled: The Earth is Round (p < .05) by Cohen.
    • +
  • +
  • Some positive comments + +
  • +
+ +
+ +
+ + +
+

What is a P-value?

+
+
+

Idea: Suppose nothing is going on - how unusual is it to see the estimate we got?

+ +

Approach:

+ +
    +
  1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (null hypothesis)
  2. +
  3. Calculate the summary/statistic with the data we have (test statistic)
  4. +
  5. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (p-value)
  6. +
+ +
+ +
+ + +
+

P-values

+
+
+
    +
  • The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than that obtained
  • +
  • If the P-value is small, then either \(H_0\) is true and we have observed a rare event or \(H_0\) is false
  • +
  • Suppos that you get a \(T\) statistic of \(2.5\) for 15 df testing \(H_0:\mu = \mu_0\) +versus \(H_a : \mu > \mu_0\). + +
      +
    • What's the probability of getting a \(T\) statistic as large as \(2.5\)?
    • +
  • +
+ +
pt(2.5, 15, lower.tail = FALSE)
+
+ +
## [1] 0.01225
+
+ +
    +
  • Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under \(H_0\) is 0.0123
  • +
+ +
+ +
+ + +
+

The attained significance level

+
+
+
    +
  • Our test statistic was \(2\) for \(H_0 : \mu_0 = 30\) versus \(H_a:\mu > 30\).
  • +
  • Notice that we rejected the one sided test when \(\alpha = 0.05\), would we reject if \(\alpha = 0.01\), how about \(0.001\)?
  • +
  • The smallest value for alpha that you still reject the null hypothesis is called the attained significance level
  • +
  • This is equivalent, but philosophically a little different from, the P-value
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • By reporting a P-value the reader can perform the hypothesis +test at whatever \(\alpha\) level he or she choses
  • +
  • If the P-value is less than \(\alpha\) you reject the null hypothesis
  • +
  • For two sided hypothesis test, double the smaller of the two one +sided hypothesis test Pvalues
  • +
+ +
+ +
+ + +
+

Revisiting an earlier example

+
+
+
    +
  • Suppose a friend has \(8\) children, \(7\) of which are girls and none are twins
  • +
  • If each gender has an independent \(50\)% probability for each birth, what's the probability of getting \(7\) or more girls out of \(8\) births?
  • +
+ +
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
+
+ +
## [1] 0.03516
+
+ +
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
+
+ +
## [1] 0.03516
+
+ +
+ +
+ + +
+

Poisson example

+
+
+
    +
  • Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period.
  • +
  • Assume that an infection rate of 0.05 is an important benchmark.
  • +
  • Given the model, could the observed rate being larger than 0.05 be attributed to chance?
  • +
  • Under \(H_0: \lambda = 0.05\) so that \(\lambda_0 100 = 5\)
  • +
  • Consider \(H_a: \lambda > 0.05\).
  • +
+ +
ppois(9, 5, lower.tail = FALSE)
+
+ +
## [1] 0.03183
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/10_pValues/index.md b/06_StatisticalInference/10_pValues/index.md new file mode 100644 index 000000000..10d09758f --- /dev/null +++ b/06_StatisticalInference/10_pValues/index.md @@ -0,0 +1,118 @@ +--- +title : P-values +subtitle : Statistical inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## P-values + +* Most common measure of statistical significance +* Their ubiquity, along with concern over their interpretation and use + makes them controversial among statisticians + * [http://warnercnr.colostate.edu/~anderson/thompson1.html](http://warnercnr.colostate.edu/~anderson/thompson1.html) + * Also see *Statistical Evidence: A Likelihood Paradigm* by Richard Royall + * *Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy* by Steve Goodman + * The hilariously titled: *The Earth is Round (p < .05)* by Cohen. +* Some positive comments + * [simply statistics](http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/) + * [normal deviate](http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/) + * [Error statistics](http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/) + +--- + + +## What is a P-value? + +__Idea__: Suppose nothing is going on - how unusual is it to see the estimate we got? + +__Approach__: + +1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (_null hypothesis_) +2. Calculate the summary/statistic with the data we have (_test statistic_) +3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (_p-value_) + +--- +## P-values +* The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than that obtained +* If the P-value is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false +* Suppos that you get a $T$ statistic of $2.5$ for 15 df testing $H_0:\mu = \mu_0$ +versus $H_a : \mu > \mu_0$. + * What's the probability of getting a $T$ statistic as large as $2.5$? + +```r +pt(2.5, 15, lower.tail = FALSE) +``` + +``` +## [1] 0.01225 +``` + +* Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is 0.0123 + +--- +## The attained significance level +* Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$. +* Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$? +* The smallest value for alpha that you still reject the null hypothesis is called the *attained significance level* +* This is equivalent, but philosophically a little different from, the *P-value* + +--- +## Notes +* By reporting a P-value the reader can perform the hypothesis + test at whatever $\alpha$ level he or she choses +* If the P-value is less than $\alpha$ you reject the null hypothesis +* For two sided hypothesis test, double the smaller of the two one + sided hypothesis test Pvalues + +--- +## Revisiting an earlier example +- Suppose a friend has $8$ children, $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? + +```r +choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8 +``` + +``` +## [1] 0.03516 +``` + +```r +pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE) +``` + +``` +## [1] 0.03516 +``` + + +--- +## Poisson example +- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period. +- Assume that an infection rate of 0.05 is an important benchmark. +- Given the model, could the observed rate being larger than 0.05 be attributed to chance? +- Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$ +- Consider $H_a: \lambda > 0.05$. + + +```r +ppois(9, 5, lower.tail = FALSE) +``` + +``` +## [1] 0.03183 +``` + + + + diff --git a/06_StatisticalInference/10_pValues/index.pdf b/06_StatisticalInference/10_pValues/index.pdf new file mode 100644 index 000000000..ba31db25c Binary files /dev/null and b/06_StatisticalInference/10_pValues/index.pdf differ diff --git a/06_StatisticalInference/11_Power/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/11_Power/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..1e08993c6 Binary files /dev/null and b/06_StatisticalInference/11_Power/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/03_04_Power/fig/unnamed-chunk-2.png b/06_StatisticalInference/11_Power/fig/unnamed-chunk-2.png similarity index 100% rename from 06_StatisticalInference/03_04_Power/fig/unnamed-chunk-2.png rename to 06_StatisticalInference/11_Power/fig/unnamed-chunk-2.png diff --git a/06_StatisticalInference/11_Power/index.Rmd b/06_StatisticalInference/11_Power/index.Rmd new file mode 100644 index 000000000..3b597112e --- /dev/null +++ b/06_StatisticalInference/11_Power/index.Rmd @@ -0,0 +1,160 @@ +--- +title : Power +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Power +- Power is the probability of rejecting the null hypothesis when it is false +- Ergo, power (as its name would suggest) is a good thing; you want more power +- A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called $\beta$ +- Note Power $= 1 - \beta$ + +--- +## Notes +- Consider our previous example involving RDI +- $H_0: \mu = 30$ versus $H_a: \mu > 30$ +- Then power is +$$P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~;~ \mu = \mu_a \right)$$ +- Note that this is a function that depends on the specific value of $\mu_a$! +- Notice as $\mu_a$ approaches $30$ the power approaches $\alpha$ + + +--- +## Calculating power for Gaussian data +- We reject if $\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha}$ + - Equivalently if $\bar X > 30 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}$ +- Under $H_0 : \bar X \sim N(\mu_0, \sigma^2 / n)$ +- Under $H_a : \bar X \sim N(\mu_a, \sigma^2 / n)$ +- So we want +```{r, echo=TRUE,eval=FALSE} +alpha = 0.05 +z = qnorm(1 - alpha) +pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), + lower.tail = FALSE) +``` + +--- +## Example continued +- $\mu_a = 32$, $\mu_0 = 30$, $n =16$, $\sigma = 4$ +```{r, echo=TRUE,eval=TRUE} +mu0 = 30; mua = 32; sigma = 4; n = 16 +z = qnorm(1 - alpha) +pnorm(mu0 + z * sigma / sqrt(n), mean = mu0, sd = sigma / sqrt(n), + lower.tail = FALSE) +pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), + lower.tail = FALSE) +``` + +--- +## Plotting the power curve + +```{r, fig.align='center', fig.height=6, fig.width=12, echo=FALSE} +library(ggplot2) +nseq = c(8, 16, 32, 64, 128) +mua = seq(30, 35, by = 0.1) +z = qnorm(.95) +power = sapply(nseq, function(n) +pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), + lower.tail = FALSE) + ) +colnames(power) <- paste("n", nseq, sep = "") +d <- data.frame(mua, power) +library(reshape2) +d2 <- melt(d, id.vars = "mua") +names(d2) <- c("mua", "n", "power") +g <- ggplot(d2, + aes(x = mua, y = power, col = n)) + geom_line(size = 2) +g +``` + + +--- +## Graphical Depiction of Power +```{r, echo = TRUE, eval=FALSE} +library(manipulate) +mu0 = 30 +myplot <- function(sigma, mua, n, alpha){ + g = ggplot(data.frame(mu = c(27, 36)), aes(x = mu)) + g = g + stat_function(fun=dnorm, geom = "line", + args = list(mean = mu0, sd = sigma / sqrt(n)), + size = 2, col = "red") + g = g + stat_function(fun=dnorm, geom = "line", + args = list(mean = mua, sd = sigma / sqrt(n)), + size = 2, col = "blue") + xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n) + g = g + geom_vline(xintercept=xitc, size = 3) + g +} +manipulate( + myplot(sigma, mua, n, alpha), + sigma = slider(1, 10, step = 1, initial = 4), + mua = slider(30, 35, step = 1, initial = 32), + n = slider(1, 50, step = 1, initial = 16), + alpha = slider(0.01, 0.1, step = 0.01, initial = 0.05) + ) + +``` + + +--- +## Question +- When testing $H_a : \mu > \mu_0$, notice if power is $1 - \beta$, then +$$1 - \beta = P\left(\bar X > \mu_0 + z_{1-\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)$$ +- where $\bar X \sim N(\mu_a, \sigma^2 / n)$ +- Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$ +- Knowns: $\mu_0$, $\alpha$ +- Specify any 3 of the unknowns and you can solve for the remainder + +--- +## Notes +- The calculation for $H_a:\mu < \mu_0$ is similar +- For $H_a: \mu \neq \mu_0$ calculate the one sided power using + $\alpha / 2$ (this is only approximately right, it excludes the probability of + getting a large TS in the opposite direction of the truth) +- Power goes up as $\alpha$ gets larger +- Power of a one sided test is greater than the power of the + associated two sided test +- Power goes up as $\mu_1$ gets further away from $\mu_0$ +- Power goes up as $n$ goes up +- Power doesn't need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$ + - The quantity $\frac{\mu_a - \mu_0}{\sigma}$ is called the effect size, the difference in the means in standard deviation units. + - Being unit free, it has some hope of interpretability across settings + +--- +## T-test power +- Consider calculating power for a Gossett's $T$ test for our example +- The power is + $$ + P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~;~ \mu = \mu_a \right) + $$ +- Calcuting this requires the non-central t distribution. +- `power.t.test` does this very well + - Omit one of the arguments and it solves for it + +--- +## Example +```{r} +power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$power +power.t.test(n = 16, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$power +power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power +``` + +--- +## Example +```{r} +power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$n +power.t.test(power = .8, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$n +power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n +``` + diff --git a/06_StatisticalInference/11_Power/index.html b/06_StatisticalInference/11_Power/index.html new file mode 100644 index 000000000..725b505bf --- /dev/null +++ b/06_StatisticalInference/11_Power/index.html @@ -0,0 +1,404 @@ + + + + Power + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Power

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Power

+
+
+
    +
  • Power is the probability of rejecting the null hypothesis when it is false
  • +
  • Ergo, power (as its name would suggest) is a good thing; you want more power
  • +
  • A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called \(\beta\)
  • +
  • Note Power \(= 1 - \beta\)
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • Consider our previous example involving RDI
  • +
  • \(H_0: \mu = 30\) versus \(H_a: \mu > 30\)
  • +
  • Then power is +\[P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~;~ \mu = \mu_a \right)\]
  • +
  • Note that this is a function that depends on the specific value of \(\mu_a\)!
  • +
  • Notice as \(\mu_a\) approaches \(30\) the power approaches \(\alpha\)
  • +
+ +
+ +
+ + +
+

Calculating power for Gaussian data

+
+
+
    +
  • We reject if \(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha}\)
    + +
      +
    • Equivalently if \(\bar X > 30 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}\)
    • +
  • +
  • Under \(H_0 : \bar X \sim N(\mu_0, \sigma^2 / n)\)
  • +
  • Under \(H_a : \bar X \sim N(\mu_a, \sigma^2 / n)\)
  • +
  • So we want
  • +
+ +
alpha = 0.05
+z = qnorm(1 - alpha)
+pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), 
+      lower.tail = FALSE)
+
+ +
+ +
+ + +
+

Example continued

+
+
+
    +
  • \(\mu_a = 32\), \(\mu_0 = 30\), \(n =16\), \(\sigma = 4\)
  • +
+ +
mu0 = 30; mua = 32; sigma = 4; n = 16
+z = qnorm(1 - alpha)
+
+ +
## Error: object 'alpha' not found
+
+ +
pnorm(mu0 + z * sigma / sqrt(n), mean = mu0, sd = sigma / sqrt(n), 
+      lower.tail = FALSE)
+
+ +
## Error: object 'z' not found
+
+ +
pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), 
+      lower.tail = FALSE)
+
+ +
## Error: object 'z' not found
+
+ +
+ +
+ + +
+

Plotting the power curve

+
+
+

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

Graphical Depiction of Power

+
+
+
library(manipulate)
+mu0 = 30
+myplot <- function(sigma, mua, n, alpha){
+    g = ggplot(data.frame(mu = c(27, 36)), aes(x = mu))
+    g = g + stat_function(fun=dnorm, geom = "line", 
+                          args = list(mean = mu0, sd = sigma / sqrt(n)), 
+                          size = 2, col = "red")
+    g = g + stat_function(fun=dnorm, geom = "line", 
+                          args = list(mean = mua, sd = sigma / sqrt(n)), 
+                          size = 2, col = "blue")
+    xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n)
+    g = g + geom_vline(xintercept=xitc, size = 3)
+    g
+}
+manipulate(
+    myplot(sigma, mua, n, alpha),
+    sigma = slider(1, 10, step = 1, initial = 4),
+    mua = slider(30, 35, step = 1, initial = 32),
+    n = slider(1, 50, step = 1, initial = 16),
+    alpha = slider(0.01, 0.1, step = 0.01, initial = 0.05)
+    )
+
+ +
+ +
+ + +
+

Question

+
+
+
    +
  • When testing \(H_a : \mu > \mu_0\), notice if power is \(1 - \beta\), then +\[1 - \beta = P\left(\bar X > \mu_0 + z_{1-\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)\]
  • +
  • where \(\bar X \sim N(\mu_a, \sigma^2 / n)\)
  • +
  • Unknowns: \(\mu_a\), \(\sigma\), \(n\), \(\beta\)
  • +
  • Knowns: \(\mu_0\), \(\alpha\)
  • +
  • Specify any 3 of the unknowns and you can solve for the remainder
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • The calculation for \(H_a:\mu < \mu_0\) is similar
  • +
  • For \(H_a: \mu \neq \mu_0\) calculate the one sided power using +\(\alpha / 2\) (this is only approximately right, it excludes the probability of +getting a large TS in the opposite direction of the truth)
  • +
  • Power goes up as \(\alpha\) gets larger
  • +
  • Power of a one sided test is greater than the power of the +associated two sided test
  • +
  • Power goes up as \(\mu_1\) gets further away from \(\mu_0\)
  • +
  • Power goes up as \(n\) goes up
  • +
  • Power doesn't need \(\mu_a\), \(\sigma\) and \(n\), instead only \(\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}\) + +
      +
    • The quantity \(\frac{\mu_a - \mu_0}{\sigma}\) is called the effect size, the difference in the means in standard deviation units.
    • +
    • Being unit free, it has some hope of interpretability across settings
    • +
  • +
+ +
+ +
+ + +
+

T-test power

+
+
+
    +
  • Consider calculating power for a Gossett's \(T\) test for our example
  • +
  • The power is +\[ +P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~;~ \mu = \mu_a \right) +\]
  • +
  • Calcuting this requires the non-central t distribution.
  • +
  • power.t.test does this very well + +
      +
    • Omit one of the arguments and it solves for it
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+
power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample",  alt = "one.sided")$power
+
+ +
## [1] 0.604
+
+ +
power.t.test(n = 16, delta = 2, sd=4, type = "one.sample",  alt = "one.sided")$power
+
+ +
## [1] 0.604
+
+ +
power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power
+
+ +
## [1] 0.604
+
+ +
+ +
+ + +
+

Example

+
+
+
power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample",  alt = "one.sided")$n
+
+ +
## [1] 26.14
+
+ +
power.t.test(power = .8, delta = 2, sd=4, type = "one.sample",  alt = "one.sided")$n
+
+ +
## [1] 26.14
+
+ +
power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n
+
+ +
## [1] 26.14
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/11_Power/index.md b/06_StatisticalInference/11_Power/index.md new file mode 100644 index 000000000..71ed1ff8b --- /dev/null +++ b/06_StatisticalInference/11_Power/index.md @@ -0,0 +1,201 @@ +--- +title : Power +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Power +- Power is the probability of rejecting the null hypothesis when it is false +- Ergo, power (as its name would suggest) is a good thing; you want more power +- A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called $\beta$ +- Note Power $= 1 - \beta$ + +--- +## Notes +- Consider our previous example involving RDI +- $H_0: \mu = 30$ versus $H_a: \mu > 30$ +- Then power is +$$P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~;~ \mu = \mu_a \right)$$ +- Note that this is a function that depends on the specific value of $\mu_a$! +- Notice as $\mu_a$ approaches $30$ the power approaches $\alpha$ + + +--- +## Calculating power for Gaussian data +- We reject if $\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha}$ + - Equivalently if $\bar X > 30 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}$ +- Under $H_0 : \bar X \sim N(\mu_0, \sigma^2 / n)$ +- Under $H_a : \bar X \sim N(\mu_a, \sigma^2 / n)$ +- So we want + +```r +alpha = 0.05 +z = qnorm(1 - alpha) +pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), + lower.tail = FALSE) +``` + +--- +## Example continued +- $\mu_a = 32$, $\mu_0 = 30$, $n =16$, $\sigma = 4$ + +```r +mu0 = 30; mua = 32; sigma = 4; n = 16 +z = qnorm(1 - alpha) +``` + +``` +## Error: object 'alpha' not found +``` + +```r +pnorm(mu0 + z * sigma / sqrt(n), mean = mu0, sd = sigma / sqrt(n), + lower.tail = FALSE) +``` + +``` +## Error: object 'z' not found +``` + +```r +pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), + lower.tail = FALSE) +``` + +``` +## Error: object 'z' not found +``` + +--- +## Plotting the power curve + +plot of chunk unnamed-chunk-3 + + +--- +## Graphical Depiction of Power + +```r +library(manipulate) +mu0 = 30 +myplot <- function(sigma, mua, n, alpha){ + g = ggplot(data.frame(mu = c(27, 36)), aes(x = mu)) + g = g + stat_function(fun=dnorm, geom = "line", + args = list(mean = mu0, sd = sigma / sqrt(n)), + size = 2, col = "red") + g = g + stat_function(fun=dnorm, geom = "line", + args = list(mean = mua, sd = sigma / sqrt(n)), + size = 2, col = "blue") + xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n) + g = g + geom_vline(xintercept=xitc, size = 3) + g +} +manipulate( + myplot(sigma, mua, n, alpha), + sigma = slider(1, 10, step = 1, initial = 4), + mua = slider(30, 35, step = 1, initial = 32), + n = slider(1, 50, step = 1, initial = 16), + alpha = slider(0.01, 0.1, step = 0.01, initial = 0.05) + ) +``` + + +--- +## Question +- When testing $H_a : \mu > \mu_0$, notice if power is $1 - \beta$, then +$$1 - \beta = P\left(\bar X > \mu_0 + z_{1-\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)$$ +- where $\bar X \sim N(\mu_a, \sigma^2 / n)$ +- Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$ +- Knowns: $\mu_0$, $\alpha$ +- Specify any 3 of the unknowns and you can solve for the remainder + +--- +## Notes +- The calculation for $H_a:\mu < \mu_0$ is similar +- For $H_a: \mu \neq \mu_0$ calculate the one sided power using + $\alpha / 2$ (this is only approximately right, it excludes the probability of + getting a large TS in the opposite direction of the truth) +- Power goes up as $\alpha$ gets larger +- Power of a one sided test is greater than the power of the + associated two sided test +- Power goes up as $\mu_1$ gets further away from $\mu_0$ +- Power goes up as $n$ goes up +- Power doesn't need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$ + - The quantity $\frac{\mu_a - \mu_0}{\sigma}$ is called the effect size, the difference in the means in standard deviation units. + - Being unit free, it has some hope of interpretability across settings + +--- +## T-test power +- Consider calculating power for a Gossett's $T$ test for our example +- The power is + $$ + P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~;~ \mu = \mu_a \right) + $$ +- Calcuting this requires the non-central t distribution. +- `power.t.test` does this very well + - Omit one of the arguments and it solves for it + +--- +## Example + +```r +power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$power +``` + +``` +## [1] 0.604 +``` + +```r +power.t.test(n = 16, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$power +``` + +``` +## [1] 0.604 +``` + +```r +power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power +``` + +``` +## [1] 0.604 +``` + +--- +## Example + +```r +power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$n +``` + +``` +## [1] 26.14 +``` + +```r +power.t.test(power = .8, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$n +``` + +``` +## [1] 26.14 +``` + +```r +power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n +``` + +``` +## [1] 26.14 +``` + diff --git a/06_StatisticalInference/11_Power/index.pdf b/06_StatisticalInference/11_Power/index.pdf new file mode 100644 index 000000000..d4ef53661 Binary files /dev/null and b/06_StatisticalInference/11_Power/index.pdf differ diff --git a/06_StatisticalInference/03_05_MultipleTesting/Multiple testing.pdf b/06_StatisticalInference/12_MultipleTesting/Multiple testing.pdf similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/Multiple testing.pdf rename to 06_StatisticalInference/12_MultipleTesting/Multiple testing.pdf diff --git a/06_StatisticalInference/12_MultipleTesting/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/12_MultipleTesting/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..556c3a44b Binary files /dev/null and b/06_StatisticalInference/12_MultipleTesting/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/03_05_MultipleTesting/data/cd4.data b/06_StatisticalInference/12_MultipleTesting/data/cd4.data similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/data/cd4.data rename to 06_StatisticalInference/12_MultipleTesting/data/cd4.data diff --git a/06_StatisticalInference/03_05_MultipleTesting/data/movies.txt b/06_StatisticalInference/12_MultipleTesting/data/movies.txt similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/data/movies.txt rename to 06_StatisticalInference/12_MultipleTesting/data/movies.txt diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/datasources.png b/06_StatisticalInference/12_MultipleTesting/fig/datasources.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/datasources.png rename to 06_StatisticalInference/12_MultipleTesting/fig/datasources.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/example10pvals.png b/06_StatisticalInference/12_MultipleTesting/fig/example10pvals.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/example10pvals.png rename to 06_StatisticalInference/12_MultipleTesting/fig/example10pvals.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/galton.png b/06_StatisticalInference/12_MultipleTesting/fig/galton.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/galton.png rename to 06_StatisticalInference/12_MultipleTesting/fig/galton.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/jellybeans1.png b/06_StatisticalInference/12_MultipleTesting/fig/jellybeans1.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/jellybeans1.png rename to 06_StatisticalInference/12_MultipleTesting/fig/jellybeans1.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/jellybeans2.png b/06_StatisticalInference/12_MultipleTesting/fig/jellybeans2.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/jellybeans2.png rename to 06_StatisticalInference/12_MultipleTesting/fig/jellybeans2.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/lowess.png b/06_StatisticalInference/12_MultipleTesting/fig/lowess.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/lowess.png rename to 06_StatisticalInference/12_MultipleTesting/fig/lowess.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/significant.png b/06_StatisticalInference/12_MultipleTesting/fig/significant.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/significant.png rename to 06_StatisticalInference/12_MultipleTesting/fig/significant.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/splines.png b/06_StatisticalInference/12_MultipleTesting/fig/splines.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/splines.png rename to 06_StatisticalInference/12_MultipleTesting/fig/splines.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-1.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-1.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-1.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-1.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-10.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-10.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-10.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-10.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-101.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-101.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-101.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-101.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-102.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-102.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-102.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-102.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-11.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-11.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-11.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-11.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-12.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-12.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-12.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-12.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-13.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-13.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-13.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-13.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-14.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-14.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-14.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-14.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-15.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-15.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-15.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-15.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-16.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-16.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-16.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-16.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-17.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-17.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-17.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-17.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-18.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-18.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-18.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-18.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-19.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-19.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-19.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-19.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-2.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-2.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-2.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-2.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-20.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-20.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-20.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-20.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-21.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-21.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-21.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-21.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-22.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-22.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-22.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-22.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-23.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-23.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-23.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-23.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-24.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-24.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-24.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-24.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-3.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-3.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-3.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-3.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-4.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-4.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-4.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-4.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-5.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-5.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-5.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-5.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-6.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-6.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-6.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-6.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-7.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-7.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-7.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-7.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-8.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-8.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-8.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-8.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-9.png b/06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-9.png similarity index 100% rename from 06_StatisticalInference/03_05_MultipleTesting/fig/unnamed-chunk-9.png rename to 06_StatisticalInference/12_MultipleTesting/fig/unnamed-chunk-9.png diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.Rmd b/06_StatisticalInference/12_MultipleTesting/index.Rmd similarity index 90% rename from 06_StatisticalInference/03_05_MultipleTesting/index.Rmd rename to 06_StatisticalInference/12_MultipleTesting/index.Rmd index 6c19901a0..4d5cc68a4 100644 --- a/06_StatisticalInference/03_05_MultipleTesting/index.Rmd +++ b/06_StatisticalInference/12_MultipleTesting/index.Rmd @@ -1,271 +1,253 @@ ---- -title : Multiple testing -subtitle : Statistical Inference -author : Brian Caffo, Jeffrey Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -``` - -## Key ideas - -* Hypothesis testing/significance analysis is commonly overused -* Correcting for multiple testing avoids false positives or discoveries -* Two key components - * Error measure - * Correction - - ---- - -## Three eras of statistics - -__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? - -The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? - -__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? - -[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) - ---- - -## Reasons for multiple testing - - - - ---- - -## Why correct for multiple tests? - - - - -[http://xkcd.com/882/](http://xkcd.com/882/) - ---- - -## Why correct for multiple tests? - - - -[http://xkcd.com/882/](http://xkcd.com/882/) - - ---- - -## Types of errors - -Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. -

- - | $\beta=0$ | $\beta\neq0$ | Hypotheses ---------------------|-------------|----------------|--------- -Claim $\beta=0$ | $U$ | $T$ | $m-R$ -Claim $\beta\neq 0$ | $V$ | $S$ | $R$ - Claims | $m_0$ | $m-m_0$ | $m$ - -

- -__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does - -__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't - - ---- - -## Error rates - -__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* - -__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ - -__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ - -* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) - ---- - -## Controlling the false positive rate - -If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. - -Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. - -Suppose that you call all $P < 0.05$ significant. - -The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. - -__How do we avoid so many false positives?__ - - ---- - -## Controlling family-wise error rate (FWER) - - -The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. - -__Basic idea__: -* Suppose you do $m$ tests -* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ -* Calculate P-values normally -* Set $\alpha_{fwer} = \alpha/m$ -* Call all $P$-values less than $\alpha_{fwer}$ significant - -__Pros__: Easy to calculate, conservative -__Cons__: May be very conservative - - ---- - -## Controlling false discovery rate (FDR) - -This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. - -__Basic idea__: -* Suppose you do $m$ tests -* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ -* Calculate P-values normally -* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ -* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant - -__Pros__: Still pretty easy to calculate, less conservative (maybe much less) - -__Cons__: Allows for more false positives, may behave strangely under dependence - ---- - -## Example with 10 P-values - - - -Controlling all error rates at $\alpha = 0.20$ - ---- - -## Adjusted P-values - -* One approach is to adjust the threshold $\alpha$ -* A different approach is to calculate "adjusted p-values" -* They _are not p-values_ anymore -* But they can be used directly without adjusting $\alpha$ - -__Example__: -* Suppose P-values are $P_1,\ldots,P_m$ -* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. -* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. - ---- - -## Case study I: no true positives - -```{r createPvals,cache=TRUE} -set.seed(1010093) -pValues <- rep(NA,1000) -for(i in 1:1000){ - y <- rnorm(20) - x <- rnorm(20) - pValues[i] <- summary(lm(y ~ x))$coeff[2,4] -} - -# Controls false positive rate -sum(pValues < 0.05) -``` - ---- - -## Case study I: no true positives - -```{r, dependson="createPvals"} -# Controls FWER -sum(p.adjust(pValues,method="bonferroni") < 0.05) -# Controls FDR -sum(p.adjust(pValues,method="BH") < 0.05) -``` - - ---- - -## Case study II: 50% true positives - -```{r createPvals2,cache=TRUE} -set.seed(1010093) -pValues <- rep(NA,1000) -for(i in 1:1000){ - x <- rnorm(20) - # First 500 beta=0, last 500 beta=2 - if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)} - pValues[i] <- summary(lm(y ~ x))$coeff[2,4] -} -trueStatus <- rep(c("zero","not zero"),each=500) -table(pValues < 0.05, trueStatus) -``` - ---- - - -## Case study II: 50% true positives - -```{r, dependson="createPvals2"} -# Controls FWER -table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus) -# Controls FDR -table(p.adjust(pValues,method="BH") < 0.05,trueStatus) -``` - - ---- - - -## Case study II: 50% true positives - -__P-values versus adjusted P-values__ -```{r, dependson="createPvals2",fig.height=4,fig.width=8} -par(mfrow=c(1,2)) -plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19) -plot(pValues,p.adjust(pValues,method="BH"),pch=19) -``` - - ---- - - -## Notes and resources - -__Notes__: -* Multiple testing is an entire subfield -* A basic Bonferroni/BH correction is usually enough -* If there is strong dependence between tests there may be problems - * Consider method="BY" - -__Further resources__: -* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) -* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) -* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) - +--- +title : Multiple testing +subtitle : Statistical Inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Key ideas + +* Hypothesis testing/significance analysis is commonly overused +* Correcting for multiple testing avoids false positives or discoveries +* Two key components + * Error measure + * Correction + + +--- + +## Three eras of statistics + +__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? + +The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? + +__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? + +[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) + +--- + +## Reasons for multiple testing + + + + +--- + +## Why correct for multiple tests? + + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + +--- + +## Why correct for multiple tests? + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + + +--- + +## Types of errors + +Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. +

+ + | $\beta=0$ | $\beta\neq0$ | Hypotheses +--------------------|-------------|----------------|--------- +Claim $\beta=0$ | $U$ | $T$ | $m-R$ +Claim $\beta\neq 0$ | $V$ | $S$ | $R$ + Claims | $m_0$ | $m-m_0$ | $m$ + +

+ +__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does + +__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't + + +--- + +## Error rates + +__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* + +__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ + +__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ + +* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) + +--- + +## Controlling the false positive rate + +If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. + +Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. + +Suppose that you call all $P < 0.05$ significant. + +The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. + +__How do we avoid so many false positives?__ + + +--- + +## Controlling family-wise error rate (FWER) + + +The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ +* Calculate P-values normally +* Set $\alpha_{fwer} = \alpha/m$ +* Call all $P$-values less than $\alpha_{fwer}$ significant + +__Pros__: Easy to calculate, conservative +__Cons__: May be very conservative + + +--- + +## Controlling false discovery rate (FDR) + +This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ +* Calculate P-values normally +* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ +* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant + +__Pros__: Still pretty easy to calculate, less conservative (maybe much less) + +__Cons__: Allows for more false positives, may behave strangely under dependence + +--- + +## Example with 10 P-values + + + +Controlling all error rates at $\alpha = 0.20$ + +--- + +## Adjusted P-values + +* One approach is to adjust the threshold $\alpha$ +* A different approach is to calculate "adjusted p-values" +* They _are not p-values_ anymore +* But they can be used directly without adjusting $\alpha$ + +__Example__: +* Suppose P-values are $P_1,\ldots,P_m$ +* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. +* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. + +--- + +## Case study I: no true positives + +```{r createPvals,cache=TRUE} +set.seed(1010093) +pValues <- rep(NA,1000) +for(i in 1:1000){ + y <- rnorm(20) + x <- rnorm(20) + pValues[i] <- summary(lm(y ~ x))$coeff[2,4] +} + +# Controls false positive rate +sum(pValues < 0.05) +``` + +--- + +## Case study I: no true positives + +```{r, dependson="createPvals"} +# Controls FWER +sum(p.adjust(pValues,method="bonferroni") < 0.05) +# Controls FDR +sum(p.adjust(pValues,method="BH") < 0.05) +``` + + +--- + +## Case study II: 50% true positives + +```{r createPvals2,cache=TRUE} +set.seed(1010093) +pValues <- rep(NA,1000) +for(i in 1:1000){ + x <- rnorm(20) + # First 500 beta=0, last 500 beta=2 + if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)} + pValues[i] <- summary(lm(y ~ x))$coeff[2,4] +} +trueStatus <- rep(c("zero","not zero"),each=500) +table(pValues < 0.05, trueStatus) +``` + +--- + + +## Case study II: 50% true positives + +```{r, dependson="createPvals2"} +# Controls FWER +table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus) +# Controls FDR +table(p.adjust(pValues,method="BH") < 0.05,trueStatus) +``` + + +--- + + +## Case study II: 50% true positives + +__P-values versus adjusted P-values__ +```{r, dependson="createPvals2",fig.height=4,fig.width=8} +par(mfrow=c(1,2)) +plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19) +plot(pValues,p.adjust(pValues,method="BH"),pch=19) +``` + + +--- + + +## Notes and resources + +__Notes__: +* Multiple testing is an entire subfield +* A basic Bonferroni/BH correction is usually enough +* If there is strong dependence between tests there may be problems + * Consider method="BY" + +__Further resources__: +* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) +* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) +* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) + diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.html b/06_StatisticalInference/12_MultipleTesting/index.html similarity index 88% rename from 06_StatisticalInference/03_05_MultipleTesting/index.html rename to 06_StatisticalInference/12_MultipleTesting/index.html index bb3a271db..dfc498eeb 100644 --- a/06_StatisticalInference/03_05_MultipleTesting/index.html +++ b/06_StatisticalInference/12_MultipleTesting/index.html @@ -1,581 +1,585 @@ - - - - Multiple testing - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Multiple testing

-

Statistical Inference

-

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Key ideas

-
-
-
    -
  • Hypothesis testing/significance analysis is commonly overused
  • -
  • Correcting for multiple testing avoids false positives or discoveries
  • -
  • Two key components - -
      -
    • Error measure
    • -
    • Correction
    • -
  • -
- -
- -
- - -
-

Three eras of statistics

-
-
-

The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?

- -

The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?

- -

The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?

- -

http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf

- -
- -
- - -
-

Reasons for multiple testing

-
-
-

- -
- -
- - -
-

Why correct for multiple tests?

-
- - -
- - -
-

Why correct for multiple tests?

-
- - -
- - -
-

Types of errors

-
-
-

Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes. -

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
\(\beta=0\)\(\beta\neq0\)Hypotheses
Claim \(\beta=0\)\(U\)\(T\)\(m-R\)
Claim \(\beta\neq 0\)\(V\)\(S\)\(R\)
Claims\(m_0\)\(m-m_0\)\(m\)
- -



- -

Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does

- -

Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn't

- -
- -
- - -
-

Error rates

-
-
-

False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*

- -

Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)

- -

False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)

- - - -
- -
- - -
-

Controlling the false positive rate

-
-
-

If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.

- -

Problem: Suppose that you perform 10,000 tests and \(\beta = 0\) for all of them.

- -

Suppose that you call all \(P < 0.05\) significant.

- -

The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.

- -

How do we avoid so many false positives?

- -
- -
- - -
-

Controlling family-wise error rate (FWER)

-
-
-

The Bonferroni correction is the oldest multiple testing correction.

- -

Basic idea:

- -
    -
  • Suppose you do \(m\) tests
  • -
  • You want to control FWER at level \(\alpha\) so \(Pr(V \geq 1) < \alpha\)
  • -
  • Calculate P-values normally
  • -
  • Set \(\alpha_{fwer} = \alpha/m\)
  • -
  • Call all \(P\)-values less than \(\alpha_{fwer}\) significant
  • -
- -

Pros: Easy to calculate, conservative -Cons: May be very conservative

- -
- -
- - -
-

Controlling false discovery rate (FDR)

-
-
-

This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.

- -

Basic idea:

- -
    -
  • Suppose you do \(m\) tests
  • -
  • You want to control FDR at level \(\alpha\) so \(E\left[\frac{V}{R}\right]\)
  • -
  • Calculate P-values normally
  • -
  • Order the P-values from smallest to largest \(P_{(1)},...,P_{(m)}\)
  • -
  • Call any \(P_{(i)} \leq \alpha \times \frac{i}{m}\) significant
  • -
- -

Pros: Still pretty easy to calculate, less conservative (maybe much less)

- -

Cons: Allows for more false positives, may behave strangely under dependence

- -
- -
- - -
-

Example with 10 P-values

-
-
-

- -

Controlling all error rates at \(\alpha = 0.20\)

- -
- -
- - -
-

Adjusted P-values

-
-
-
    -
  • One approach is to adjust the threshold \(\alpha\)
  • -
  • A different approach is to calculate "adjusted p-values"
  • -
  • They are not p-values anymore
  • -
  • But they can be used directly without adjusting \(\alpha\)
  • -
- -

Example:

- -
    -
  • Suppose P-values are \(P_1,\ldots,P_m\)
  • -
  • You could adjust them by taking \(P_i^{fwer} = \max{m \times P_i,1}\) for each P-value.
  • -
  • Then if you call all \(P_i^{fwer} < \alpha\) significant you will control the FWER.
  • -
- -
- -
- - -
-

Case study I: no true positives

-
-
-
set.seed(1010093)
-pValues <- rep(NA,1000)
-for(i in 1:1000){
-  y <- rnorm(20)
-  x <- rnorm(20)
-  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
-}
-
-# Controls false positive rate
-sum(pValues < 0.05)
-
- -
[1] 51
-
- -
- -
- - -
-

Case study I: no true positives

-
-
-
# Controls FWER 
-sum(p.adjust(pValues,method="bonferroni") < 0.05)
-
- -
[1] 0
-
- -
# Controls FDR 
-sum(p.adjust(pValues,method="BH") < 0.05)
-
- -
[1] 0
-
- -
- -
- - -
-

Case study II: 50% true positives

-
-
-
set.seed(1010093)
-pValues <- rep(NA,1000)
-for(i in 1:1000){
-  x <- rnorm(20)
-  # First 500 beta=0, last 500 beta=2
-  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
-  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
-}
-trueStatus <- rep(c("zero","not zero"),each=500)
-table(pValues < 0.05, trueStatus)
-
- -
       trueStatus
-        not zero zero
-  FALSE        0  476
-  TRUE       500   24
-
- -
- -
- - -
-

Case study II: 50% true positives

-
-
-
# Controls FWER 
-table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
-
- -
       trueStatus
-        not zero zero
-  FALSE       23  500
-  TRUE       477    0
-
- -
# Controls FDR 
-table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
-
- -
       trueStatus
-        not zero zero
-  FALSE        0  487
-  TRUE       500   13
-
- -
- -
- - -
-

Case study II: 50% true positives

-
-
-

P-values versus adjusted P-values

- -
par(mfrow=c(1,2))
-plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
-plot(pValues,p.adjust(pValues,method="BH"),pch=19)
-
- -
plot of chunk unnamed-chunk-3
- -
- -
- - -
-

Notes and resources

-
- - -
- - -
- - - - - - - - - - - - - - + + + + Multiple testing + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Multiple testing

+

Statistical Inference

+

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Key ideas

+
+
+
    +
  • Hypothesis testing/significance analysis is commonly overused
  • +
  • Correcting for multiple testing avoids false positives or discoveries
  • +
  • Two key components + +
      +
    • Error measure
    • +
    • Correction
    • +
  • +
+ +
+ +
+ + +
+

Three eras of statistics

+
+
+

The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?

+ +

The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?

+ +

The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?

+ +

http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf

+ +
+ +
+ + +
+

Reasons for multiple testing

+
+
+

+ +
+ +
+ + +
+

Why correct for multiple tests?

+
+ + +
+ + +
+

Why correct for multiple tests?

+
+ + +
+ + +
+

Types of errors

+
+
+

Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
\(\beta=0\)\(\beta\neq0\)Hypotheses
Claim \(\beta=0\)\(U\)\(T\)\(m-R\)
Claim \(\beta\neq 0\)\(V\)\(S\)\(R\)
Claims\(m_0\)\(m-m_0\)\(m\)
+ +



+ +

Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does

+ +

Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn't

+ +
+ +
+ + +
+

Error rates

+
+
+

False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*

+ +

Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)

+ +

False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)

+ + + +
+ +
+ + +
+

Controlling the false positive rate

+
+
+

If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.

+ +

Problem: Suppose that you perform 10,000 tests and \(\beta = 0\) for all of them.

+ +

Suppose that you call all \(P < 0.05\) significant.

+ +

The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.

+ +

How do we avoid so many false positives?

+ +
+ +
+ + +
+

Controlling family-wise error rate (FWER)

+
+
+

The Bonferroni correction is the oldest multiple testing correction.

+ +

Basic idea:

+ +
    +
  • Suppose you do \(m\) tests
  • +
  • You want to control FWER at level \(\alpha\) so \(Pr(V \geq 1) < \alpha\)
  • +
  • Calculate P-values normally
  • +
  • Set \(\alpha_{fwer} = \alpha/m\)
  • +
  • Call all \(P\)-values less than \(\alpha_{fwer}\) significant
  • +
+ +

Pros: Easy to calculate, conservative +Cons: May be very conservative

+ +
+ +
+ + +
+

Controlling false discovery rate (FDR)

+
+
+

This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.

+ +

Basic idea:

+ +
    +
  • Suppose you do \(m\) tests
  • +
  • You want to control FDR at level \(\alpha\) so \(E\left[\frac{V}{R}\right]\)
  • +
  • Calculate P-values normally
  • +
  • Order the P-values from smallest to largest \(P_{(1)},...,P_{(m)}\)
  • +
  • Call any \(P_{(i)} \leq \alpha \times \frac{i}{m}\) significant
  • +
+ +

Pros: Still pretty easy to calculate, less conservative (maybe much less)

+ +

Cons: Allows for more false positives, may behave strangely under dependence

+ +
+ +
+ + +
+

Example with 10 P-values

+
+
+

+ +

Controlling all error rates at \(\alpha = 0.20\)

+ +
+ +
+ + +
+

Adjusted P-values

+
+
+
    +
  • One approach is to adjust the threshold \(\alpha\)
  • +
  • A different approach is to calculate "adjusted p-values"
  • +
  • They are not p-values anymore
  • +
  • But they can be used directly without adjusting \(\alpha\)
  • +
+ +

Example:

+ +
    +
  • Suppose P-values are \(P_1,\ldots,P_m\)
  • +
  • You could adjust them by taking \(P_i^{fwer} = \max{m \times P_i,1}\) for each P-value.
  • +
  • Then if you call all \(P_i^{fwer} < \alpha\) significant you will control the FWER.
  • +
+ +
+ +
+ + +
+

Case study I: no true positives

+
+
+
set.seed(1010093)
+pValues <- rep(NA, 1000)
+for (i in 1:1000) {
+    y <- rnorm(20)
+    x <- rnorm(20)
+    pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
+}
+
+# Controls false positive rate
+sum(pValues < 0.05)
+
+ +
## [1] 51
+
+ +
+ +
+ + +
+

Case study I: no true positives

+
+
+
# Controls FWER
+sum(p.adjust(pValues, method = "bonferroni") < 0.05)
+
+ +
## [1] 0
+
+ +
# Controls FDR
+sum(p.adjust(pValues, method = "BH") < 0.05)
+
+ +
## [1] 0
+
+ +
+ +
+ + +
+

Case study II: 50% true positives

+
+
+
set.seed(1010093)
+pValues <- rep(NA, 1000)
+for (i in 1:1000) {
+    x <- rnorm(20)
+    # First 500 beta=0, last 500 beta=2
+    if (i <= 500) {
+        y <- rnorm(20)
+    } else {
+        y <- rnorm(20, mean = 2 * x)
+    }
+    pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
+}
+trueStatus <- rep(c("zero", "not zero"), each = 500)
+table(pValues < 0.05, trueStatus)
+
+ +
##        trueStatus
+##         not zero zero
+##   FALSE        0  476
+##   TRUE       500   24
+
+ +
+ +
+ + +
+

Case study II: 50% true positives

+
+
+
# Controls FWER
+table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus)
+
+ +
##        trueStatus
+##         not zero zero
+##   FALSE       23  500
+##   TRUE       477    0
+
+ +
# Controls FDR
+table(p.adjust(pValues, method = "BH") < 0.05, trueStatus)
+
+ +
##        trueStatus
+##         not zero zero
+##   FALSE        0  487
+##   TRUE       500   13
+
+ +
+ +
+ + +
+

Case study II: 50% true positives

+
+
+

P-values versus adjusted P-values

+ +
par(mfrow = c(1, 2))
+plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19)
+plot(pValues, p.adjust(pValues, method = "BH"), pch = 19)
+
+ +

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

Notes and resources

+
+ + +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/12_MultipleTesting/index.md b/06_StatisticalInference/12_MultipleTesting/index.md new file mode 100644 index 000000000..08f1afa2f --- /dev/null +++ b/06_StatisticalInference/12_MultipleTesting/index.md @@ -0,0 +1,308 @@ +--- +title : Multiple testing +subtitle : Statistical Inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Key ideas + +* Hypothesis testing/significance analysis is commonly overused +* Correcting for multiple testing avoids false positives or discoveries +* Two key components + * Error measure + * Correction + + +--- + +## Three eras of statistics + +__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? + +The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? + +__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? + +[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) + +--- + +## Reasons for multiple testing + + + + +--- + +## Why correct for multiple tests? + + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + +--- + +## Why correct for multiple tests? + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + + +--- + +## Types of errors + +Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. +

+ + | $\beta=0$ | $\beta\neq0$ | Hypotheses +--------------------|-------------|----------------|--------- +Claim $\beta=0$ | $U$ | $T$ | $m-R$ +Claim $\beta\neq 0$ | $V$ | $S$ | $R$ + Claims | $m_0$ | $m-m_0$ | $m$ + +

+ +__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does + +__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't + + +--- + +## Error rates + +__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* + +__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ + +__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ + +* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) + +--- + +## Controlling the false positive rate + +If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. + +Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. + +Suppose that you call all $P < 0.05$ significant. + +The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. + +__How do we avoid so many false positives?__ + + +--- + +## Controlling family-wise error rate (FWER) + + +The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ +* Calculate P-values normally +* Set $\alpha_{fwer} = \alpha/m$ +* Call all $P$-values less than $\alpha_{fwer}$ significant + +__Pros__: Easy to calculate, conservative +__Cons__: May be very conservative + + +--- + +## Controlling false discovery rate (FDR) + +This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ +* Calculate P-values normally +* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ +* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant + +__Pros__: Still pretty easy to calculate, less conservative (maybe much less) + +__Cons__: Allows for more false positives, may behave strangely under dependence + +--- + +## Example with 10 P-values + + + +Controlling all error rates at $\alpha = 0.20$ + +--- + +## Adjusted P-values + +* One approach is to adjust the threshold $\alpha$ +* A different approach is to calculate "adjusted p-values" +* They _are not p-values_ anymore +* But they can be used directly without adjusting $\alpha$ + +__Example__: +* Suppose P-values are $P_1,\ldots,P_m$ +* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. +* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. + +--- + +## Case study I: no true positives + + +```r +set.seed(1010093) +pValues <- rep(NA, 1000) +for (i in 1:1000) { + y <- rnorm(20) + x <- rnorm(20) + pValues[i] <- summary(lm(y ~ x))$coeff[2, 4] +} + +# Controls false positive rate +sum(pValues < 0.05) +``` + +``` +## [1] 51 +``` + + +--- + +## Case study I: no true positives + + +```r +# Controls FWER +sum(p.adjust(pValues, method = "bonferroni") < 0.05) +``` + +``` +## [1] 0 +``` + +```r +# Controls FDR +sum(p.adjust(pValues, method = "BH") < 0.05) +``` + +``` +## [1] 0 +``` + + + +--- + +## Case study II: 50% true positives + + +```r +set.seed(1010093) +pValues <- rep(NA, 1000) +for (i in 1:1000) { + x <- rnorm(20) + # First 500 beta=0, last 500 beta=2 + if (i <= 500) { + y <- rnorm(20) + } else { + y <- rnorm(20, mean = 2 * x) + } + pValues[i] <- summary(lm(y ~ x))$coeff[2, 4] +} +trueStatus <- rep(c("zero", "not zero"), each = 500) +table(pValues < 0.05, trueStatus) +``` + +``` +## trueStatus +## not zero zero +## FALSE 0 476 +## TRUE 500 24 +``` + + +--- + + +## Case study II: 50% true positives + + +```r +# Controls FWER +table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus) +``` + +``` +## trueStatus +## not zero zero +## FALSE 23 500 +## TRUE 477 0 +``` + +```r +# Controls FDR +table(p.adjust(pValues, method = "BH") < 0.05, trueStatus) +``` + +``` +## trueStatus +## not zero zero +## FALSE 0 487 +## TRUE 500 13 +``` + + + +--- + + +## Case study II: 50% true positives + +__P-values versus adjusted P-values__ + +```r +par(mfrow = c(1, 2)) +plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19) +plot(pValues, p.adjust(pValues, method = "BH"), pch = 19) +``` + +![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) + + + +--- + + +## Notes and resources + +__Notes__: +* Multiple testing is an entire subfield +* A basic Bonferroni/BH correction is usually enough +* If there is strong dependence between tests there may be problems + * Consider method="BY" + +__Further resources__: +* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) +* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) +* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) + diff --git a/06_StatisticalInference/12_MultipleTesting/index.pdf b/06_StatisticalInference/12_MultipleTesting/index.pdf new file mode 100644 index 000000000..88d17ad14 Binary files /dev/null and b/06_StatisticalInference/12_MultipleTesting/index.pdf differ diff --git a/06_StatisticalInference/03_06_resampledInference/Bootstrapping.pdf b/06_StatisticalInference/13_Resampling/Bootstrapping.pdf similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/Bootstrapping.pdf rename to 06_StatisticalInference/13_Resampling/Bootstrapping.pdf diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..d1942083a Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-10.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-10.png new file mode 100644 index 000000000..7a15bddea Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-10.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-11.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-11.png new file mode 100644 index 000000000..9e8c3415a Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-11.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-12.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-12.png new file mode 100644 index 000000000..7dac51d10 Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-12.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..e352482dd Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..8c870b65b Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-4.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-4.png new file mode 100644 index 000000000..bedda710c Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-4.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-5.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-5.png new file mode 100644 index 000000000..8c870b65b Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-5.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-6.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-6.png new file mode 100644 index 000000000..f6dfe3779 Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-6.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-7.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-7.png new file mode 100644 index 000000000..bad3b9ea5 Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-7.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-8.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-8.png new file mode 100644 index 000000000..46e951432 Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-8.png differ diff --git a/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-9.png b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-9.png new file mode 100644 index 000000000..09ea388cd Binary files /dev/null and b/06_StatisticalInference/13_Resampling/assets/fig/unnamed-chunk-9.png differ diff --git a/06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-4.png b/06_StatisticalInference/13_Resampling/fig/unnamed-chunk-4.png similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-4.png rename to 06_StatisticalInference/13_Resampling/fig/unnamed-chunk-4.png diff --git a/06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-5.png b/06_StatisticalInference/13_Resampling/fig/unnamed-chunk-5.png similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-5.png rename to 06_StatisticalInference/13_Resampling/fig/unnamed-chunk-5.png diff --git a/06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-6.png b/06_StatisticalInference/13_Resampling/fig/unnamed-chunk-6.png similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-6.png rename to 06_StatisticalInference/13_Resampling/fig/unnamed-chunk-6.png diff --git a/06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-7.png b/06_StatisticalInference/13_Resampling/fig/unnamed-chunk-7.png similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/fig/unnamed-chunk-7.png rename to 06_StatisticalInference/13_Resampling/fig/unnamed-chunk-7.png diff --git a/06_StatisticalInference/03_06_resampledInference/figure/unnamed-chunk-4.png b/06_StatisticalInference/13_Resampling/figure/unnamed-chunk-4.png similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/figure/unnamed-chunk-4.png rename to 06_StatisticalInference/13_Resampling/figure/unnamed-chunk-4.png diff --git a/06_StatisticalInference/13_Resampling/index.Rmd b/06_StatisticalInference/13_Resampling/index.Rmd new file mode 100644 index 000000000..027b2fa97 --- /dev/null +++ b/06_StatisticalInference/13_Resampling/index.Rmd @@ -0,0 +1,222 @@ +--- +title : Resampled inference +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## The bootstrap + +- The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics +- For example, how would one derive a confidence interval for the median? +- The bootstrap procedure follows from the so called bootstrap principle + +--- +## Sample of 50 die rolls + +```{r, echo = FALSE, fig.width=12, fig.height = 6, fig.align='center'} +library(ggplot2) +library(gridExtra) +nosim <- 1000 + +cfunc <- function(x, n) mean(x) +g1 = ggplot(data.frame(y = rep(1/6, 6), x = 1 : 6), aes(y = y, x = x)) +g1 = g1 + geom_bar(stat = "identity", fill = "lightblue", colour = "black") + +dat <- data.frame(x = apply(matrix(sample(1 : 6, nosim * 50, replace = TRUE), + nosim), 1, mean)) +g2 <- ggplot(dat, aes(x = x)) + geom_histogram(binwidth=.2, colour = "black", fill = "salmon", aes(y = ..density..)) + +grid.arrange(g1, g2, ncol = 2) + +``` + + +--- +## What if we only had one sample? +```{r, echo = FALSE, fig.width=9, fig.height = 6, fig.align='center'} +n = 50 +B = 1000 +## our data +x = sample(1 : 6, n, replace = TRUE) +## bootstrap resamples +resamples = matrix(sample(x, + n * B, + replace = TRUE), + B, n) +resampledMeans = apply(resamples, 1, mean) +g1 <- ggplot(as.data.frame(prop.table(table(x))), aes(x = x, y = Freq)) + geom_bar(colour = "black", fill = "lightblue", stat = "identity") +g2 <- ggplot(data.frame(x = resampledMeans), aes(x = x)) + geom_histogram(binwidth=.2, colour = "black", fill = "salmon", aes(y = ..density..)) +grid.arrange(g1, g2, ncol = 2) +``` + + +--- +## Consider a data set +```{r} +library(UsingR) +data(father.son) +x <- father.son$sheight +n <- length(x) +B <- 10000 +resamples <- matrix(sample(x, + n * B, + replace = TRUE), + B, n) +resampledMedians <- apply(resamples, 1, median) +``` + +--- +## A plot of the histrogram of the resamples +```{r, fig.align='center', fig.height=6, fig.width=6, echo=FALSE, warning=FALSE} +g = ggplot(data.frame(x = resampledMedians), aes(x = x)) +g = g + geom_density(size = 2, fill = "red") +#g = g + geom_histogram(alpha = .20, binwidth=.3, colour = "black", fill = "blue", aes(y = ..density..)) +g = g + geom_vline(xintercept = median(x), size = 2) +g +``` + +--- + + +## The bootstrap principle + +- Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution +- The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution + +--- + +## The bootstrap in practice + +- In practice, the bootstrap principle is always carried out using simulation +- We will cover only a few aspects of bootstrap resampling +- The general procedure follows by first simulating complete data sets from the observed data with replacement + + - This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution + +- Calculate the statistic for each simulated data set +- Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error + + +--- +## Nonparametric bootstrap algorithm example + +- Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations + + i. Sample $n$ observations **with replacement** from the observed data resulting in one simulated complete data set + + ii. Take the median of the simulated data set + + iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians + + iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can + + - Draw a histogram of them + - Calculate their standard deviation to estimate the standard error of the median + - Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median + +--- + +## Example code + +```{r} +B <- 10000 +resamples <- matrix(sample(x, + n * B, + replace = TRUE), + B, n) +medians <- apply(resamples, 1, median) +sd(medians) +quantile(medians, c(.025, .975)) +``` + +--- +## Histogram of bootstrap resamples + +```{r, fig.height=6, fig.width=6, echo=TRUE,fig.align='center', warning=FALSE} +g = ggplot(data.frame(medians = medians), aes(x = medians)) +g = g + geom_histogram(color = "black", fill = "lightblue", binwidth = 0.05) +g +``` + +--- + +## Notes on the bootstrap + +- The bootstrap is non-parametric +- Better percentile bootstrap confidence intervals correct for bias +- There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information + + +--- +## Group comparisons +- Consider comparing two independent groups. +- Example, comparing sprays B and C + +```{r, fig.height=6, fig.width=8, echo=FALSE, fig.align='center'} +data(InsectSprays) +g = ggplot(InsectSprays, aes(spray, count, fill = spray)) +g = g + geom_boxplot() +g +``` + +--- +## Permutation tests +- Consider the null hypothesis that the distribution of the observations from each group is the same +- Then, the group labels are irrelevant +- Consider a data frome with count and spray +- Permute the spray (group) labels +- Recalculate the statistic + - Mean difference in counts + - Geometric means + - T statistic +- Calculate the percentage of simulations where +the simulated statistic was more extreme (toward +the alternative) than the observed + +--- +## Variations on permutation testing +Data type | Statistic | Test name +---|---|---| +Ranks | rank sum | rank sum test +Binary | hypergeometric prob | Fisher's exact test +Raw data | | ordinary permutation test + +- Also, so-called *randomization tests* are exactly permutation tests, with a different motivation. +- For matched data, one can randomize the signs + - For ranks, this results in the signed rank test +- Permutation strategies work for regression as well + - Permuting a regressor of interest +- Permutation tests work very well in multivariate settings + +--- +## Permutation test B v C +```{r} +subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),] +y <- subdata$count +group <- as.character(subdata$spray) +testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) +observedStat <- testStat(y, group) +permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group))) +observedStat +mean(permutations > observedStat) +``` + +--- +## Histogram of permutations B v C +```{r, echo= FALSE, fig.width=6, fig.height=6, fig.align='center'} +g = ggplot(data.frame(permutations = permutations), + aes(permutations)) +g = g + geom_histogram(fill = "lightblue", color = "black", binwidth = 1) +g = g + geom_vline(xintercept = observedStat, size = 2) +g +``` diff --git a/06_StatisticalInference/13_Resampling/index.html b/06_StatisticalInference/13_Resampling/index.html new file mode 100644 index 000000000..e7e900e93 --- /dev/null +++ b/06_StatisticalInference/13_Resampling/index.html @@ -0,0 +1,549 @@ + + + + Resampled inference + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Resampled inference

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

The bootstrap

+
+
+
    +
  • The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics
  • +
  • For example, how would one derive a confidence interval for the median?
  • +
  • The bootstrap procedure follows from the so called bootstrap principle
  • +
+ +
+ +
+ + +
+

Sample of 50 die rolls

+
+
+
## Error: there is no package called 'gridExtra'
+
+ +
## Error: could not find function "grid.arrange"
+
+ +
+ +
+ + +
+

What if we only had one sample?

+
+
+
## Error: could not find function "grid.arrange"
+
+ +
+ +
+ + +
+

Consider a data set

+
+
+
library(UsingR)
+
+ +
## Loading required package: MASS
+## Loading required package: HistData
+## Loading required package: Hmisc
+## Loading required package: grid
+## Loading required package: lattice
+## Loading required package: survival
+## Loading required package: splines
+## Loading required package: Formula
+## 
+## Attaching package: 'Hmisc'
+## 
+## The following objects are masked from 'package:base':
+## 
+##     format.pval, round.POSIXt, trunc.POSIXt, units
+## 
+## Loading required package: aplpack
+## Loading required package: tcltk
+## Loading required package: quantreg
+## Loading required package: SparseM
+## 
+## Attaching package: 'SparseM'
+## 
+## The following object is masked from 'package:base':
+## 
+##     backsolve
+## 
+## 
+## Attaching package: 'quantreg'
+## 
+## The following object is masked from 'package:Hmisc':
+## 
+##     latex
+## 
+## The following object is masked from 'package:survival':
+## 
+##     untangle.specials
+## 
+## 
+## Attaching package: 'UsingR'
+## 
+## The following object is masked from 'package:survival':
+## 
+##     cancer
+## 
+## The following object is masked from 'package:ggplot2':
+## 
+##     movies
+
+ +
data(father.son)
+x <- father.son$sheight
+n <- length(x)
+B <- 10000
+resamples <- matrix(sample(x,
+                           n * B,
+                           replace = TRUE),
+                    B, n)
+resampledMedians <- apply(resamples, 1, median)
+
+ +
+ +
+ + +
+

A plot of the histrogram of the resamples

+
+
+

plot of chunk unnamed-chunk-4

+ +
+ +
+ + +
+

The bootstrap principle

+
+
+
    +
  • Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution
  • +
  • The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution
  • +
+ +
+ +
+ + +
+

The bootstrap in practice

+
+
+
    +
  • In practice, the bootstrap principle is always carried out using simulation
  • +
  • We will cover only a few aspects of bootstrap resampling
  • +
  • The general procedure follows by first simulating complete data sets from the observed data with replacement

    + +
      +
    • This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution
    • +
  • +
  • Calculate the statistic for each simulated data set

  • +
  • Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error

  • +
+ +
+ +
+ + +
+

Nonparametric bootstrap algorithm example

+
+
+
    +
  • Bootstrap procedure for calculating confidence interval for the median from a data set of \(n\) observations

    + +

    i. Sample \(n\) observations with replacement from the observed data resulting in one simulated complete data set

    + +

    ii. Take the median of the simulated data set

    + +

    iii. Repeat these two steps \(B\) times, resulting in \(B\) simulated medians

    + +

    iv. These medians are approximately drawn from the sampling distribution of the median of \(n\) observations; therefore we can

    + +
      +
    • Draw a histogram of them
    • +
    • Calculate their standard deviation to estimate the standard error of the median
    • +
    • Take the \(2.5^{th}\) and \(97.5^{th}\) percentiles as a confidence interval for the median
    • +
  • +
+ +
+ +
+ + +
+

Example code

+
+
+
B <- 10000
+resamples <- matrix(sample(x,
+                           n * B,
+                           replace = TRUE),
+                    B, n)
+medians <- apply(resamples, 1, median)
+sd(medians)
+
+ +
## [1] 0.08424
+
+ +
quantile(medians, c(.025, .975))
+
+ +
##  2.5% 97.5% 
+## 68.43 68.81
+
+ +
+ +
+ + +
+

Histogram of bootstrap resamples

+
+
+
g = ggplot(data.frame(medians = medians), aes(x = medians))
+g = g + geom_histogram(color = "black", fill = "lightblue", binwidth = 0.05)
+g
+
+ +

plot of chunk unnamed-chunk-6

+ +
+ +
+ + +
+

Notes on the bootstrap

+
+
+
    +
  • The bootstrap is non-parametric
  • +
  • Better percentile bootstrap confidence intervals correct for bias
  • +
  • There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information
  • +
+ +
+ +
+ + +
+

Group comparisons

+
+
+
    +
  • Consider comparing two independent groups.
  • +
  • Example, comparing sprays B and C
  • +
+ +

plot of chunk unnamed-chunk-7

+ +
+ +
+ + +
+

Permutation tests

+
+
+
    +
  • Consider the null hypothesis that the distribution of the observations from each group is the same
  • +
  • Then, the group labels are irrelevant
  • +
  • Consider a data frome with count and spray
  • +
  • Permute the spray (group) labels
  • +
  • Recalculate the statistic + +
      +
    • Mean difference in counts
    • +
    • Geometric means
    • +
    • T statistic
    • +
  • +
  • Calculate the percentage of simulations where +the simulated statistic was more extreme (toward +the alternative) than the observed
  • +
+ +
+ +
+ + +
+

Variations on permutation testing

+
+
+ + + + + + + + + + + + + + + + + + + + + + +
Data typeStatisticTest name
Ranksrank sumrank sum test
Binaryhypergeometric probFisher's exact test
Raw dataordinary permutation test
+ +
    +
  • Also, so-called randomization tests are exactly permutation tests, with a different motivation.
  • +
  • For matched data, one can randomize the signs + +
      +
    • For ranks, this results in the signed rank test
    • +
  • +
  • Permutation strategies work for regression as well + +
      +
    • Permuting a regressor of interest
    • +
  • +
  • Permutation tests work very well in multivariate settings
  • +
+ +
+ +
+ + +
+

Permutation test B v C

+
+
+
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
+y <- subdata$count
+group <- as.character(subdata$spray)
+testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
+observedStat <- testStat(y, group)
+permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
+observedStat
+
+ +
## [1] 13.25
+
+ +
mean(permutations > observedStat)
+
+ +
## [1] 0
+
+ +
+ +
+ + +
+

Histogram of permutations B v C

+
+
+

plot of chunk unnamed-chunk-9

+ +
+ +
+ + +
+ + + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/13_Resampling/index.md b/06_StatisticalInference/13_Resampling/index.md new file mode 100644 index 000000000..00c62cfb0 --- /dev/null +++ b/06_StatisticalInference/13_Resampling/index.md @@ -0,0 +1,268 @@ +--- +title : Resampled inference +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## The bootstrap + +- The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics +- For example, how would one derive a confidence interval for the median? +- The bootstrap procedure follows from the so called bootstrap principle + +--- +## Sample of 50 die rolls + + +``` +## Error: there is no package called 'gridExtra' +``` + +``` +## Error: could not find function "grid.arrange" +``` + + +--- +## What if we only had one sample? + +``` +## Error: could not find function "grid.arrange" +``` + + +--- +## Consider a data set + +```r +library(UsingR) +``` + +``` +## Loading required package: MASS +## Loading required package: HistData +## Loading required package: Hmisc +## Loading required package: grid +## Loading required package: lattice +## Loading required package: survival +## Loading required package: splines +## Loading required package: Formula +## +## Attaching package: 'Hmisc' +## +## The following objects are masked from 'package:base': +## +## format.pval, round.POSIXt, trunc.POSIXt, units +## +## Loading required package: aplpack +## Loading required package: tcltk +## Loading required package: quantreg +## Loading required package: SparseM +## +## Attaching package: 'SparseM' +## +## The following object is masked from 'package:base': +## +## backsolve +## +## +## Attaching package: 'quantreg' +## +## The following object is masked from 'package:Hmisc': +## +## latex +## +## The following object is masked from 'package:survival': +## +## untangle.specials +## +## +## Attaching package: 'UsingR' +## +## The following object is masked from 'package:survival': +## +## cancer +## +## The following object is masked from 'package:ggplot2': +## +## movies +``` + +```r +data(father.son) +x <- father.son$sheight +n <- length(x) +B <- 10000 +resamples <- matrix(sample(x, + n * B, + replace = TRUE), + B, n) +resampledMedians <- apply(resamples, 1, median) +``` + +--- +## A plot of the histrogram of the resamples +plot of chunk unnamed-chunk-4 + +--- + + +## The bootstrap principle + +- Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution +- The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution + +--- + +## The bootstrap in practice + +- In practice, the bootstrap principle is always carried out using simulation +- We will cover only a few aspects of bootstrap resampling +- The general procedure follows by first simulating complete data sets from the observed data with replacement + + - This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution + +- Calculate the statistic for each simulated data set +- Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error + + +--- +## Nonparametric bootstrap algorithm example + +- Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations + + i. Sample $n$ observations **with replacement** from the observed data resulting in one simulated complete data set + + ii. Take the median of the simulated data set + + iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians + + iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can + + - Draw a histogram of them + - Calculate their standard deviation to estimate the standard error of the median + - Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median + +--- + +## Example code + + +```r +B <- 10000 +resamples <- matrix(sample(x, + n * B, + replace = TRUE), + B, n) +medians <- apply(resamples, 1, median) +sd(medians) +``` + +``` +## [1] 0.08424 +``` + +```r +quantile(medians, c(.025, .975)) +``` + +``` +## 2.5% 97.5% +## 68.43 68.81 +``` + +--- +## Histogram of bootstrap resamples + + +```r +g = ggplot(data.frame(medians = medians), aes(x = medians)) +g = g + geom_histogram(color = "black", fill = "lightblue", binwidth = 0.05) +g +``` + +plot of chunk unnamed-chunk-6 + +--- + +## Notes on the bootstrap + +- The bootstrap is non-parametric +- Better percentile bootstrap confidence intervals correct for bias +- There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information + + +--- +## Group comparisons +- Consider comparing two independent groups. +- Example, comparing sprays B and C + +plot of chunk unnamed-chunk-7 + +--- +## Permutation tests +- Consider the null hypothesis that the distribution of the observations from each group is the same +- Then, the group labels are irrelevant +- Consider a data frome with count and spray +- Permute the spray (group) labels +- Recalculate the statistic + - Mean difference in counts + - Geometric means + - T statistic +- Calculate the percentage of simulations where +the simulated statistic was more extreme (toward +the alternative) than the observed + +--- +## Variations on permutation testing +Data type | Statistic | Test name +---|---|---| +Ranks | rank sum | rank sum test +Binary | hypergeometric prob | Fisher's exact test +Raw data | | ordinary permutation test + +- Also, so-called *randomization tests* are exactly permutation tests, with a different motivation. +- For matched data, one can randomize the signs + - For ranks, this results in the signed rank test +- Permutation strategies work for regression as well + - Permuting a regressor of interest +- Permutation tests work very well in multivariate settings + +--- +## Permutation test B v C + +```r +subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),] +y <- subdata$count +group <- as.character(subdata$spray) +testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) +observedStat <- testStat(y, group) +permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group))) +observedStat +``` + +``` +## [1] 13.25 +``` + +```r +mean(permutations > observedStat) +``` + +``` +## [1] 0 +``` + +--- +## Histogram of permutations B v C +plot of chunk unnamed-chunk-9 diff --git a/06_StatisticalInference/13_Resampling/index.pdf b/06_StatisticalInference/13_Resampling/index.pdf new file mode 100644 index 000000000..ce8822a3c Binary files /dev/null and b/06_StatisticalInference/13_Resampling/index.pdf differ diff --git a/06_StatisticalInference/03_06_resampledInference/lecture12.tex b/06_StatisticalInference/13_Resampling/lecture12.tex similarity index 100% rename from 06_StatisticalInference/03_06_resampledInference/lecture12.tex rename to 06_StatisticalInference/13_Resampling/lecture12.tex diff --git a/06_StatisticalInference/Random Formulae/Random Formulae.pdf b/06_StatisticalInference/Random Formulae/Random Formulae.pdf deleted file mode 100644 index 1d5418411..000000000 Binary files a/06_StatisticalInference/Random Formulae/Random Formulae.pdf and /dev/null differ diff --git a/06_StatisticalInference/Random Formulae/index.Rmd b/06_StatisticalInference/Random Formulae/index.Rmd deleted file mode 100644 index 3ca20a730..000000000 --- a/06_StatisticalInference/Random Formulae/index.Rmd +++ /dev/null @@ -1,141 +0,0 @@ ---- -title : Random Formulae -subtitle : Mathematical Biostatistics Boot Camp -author : Brian Caffo, PhD -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## About this document - -This document contains random formulae images I used in the notes. - ---- - -$$A = \{1, 2\}$$ -$$B = \{1, 2, 3\}$$ - ---- - -$$ -\begin{eqnarray} -E[X^2] & = & \int_0^1 x^2 dx \\ - & = & \left. \frac{x^3}{3} \right|_0^1 = \frac{1}{3} -\end{eqnarray} -$$ - ---- - -$$\frac{|x - \mu|}{k\sigma} > 1$$ -Over the set $\{x : |x - \mu | > k\sigma\}$ -$$\frac{(x - \mu)^2}{k^2\sigma^2} > 1$$ -$$\frac{1}{k^2\sigma^2} \int_{-\infty}^\infty (x - \mu)^2 f(x) dx$$ -$$\frac{1}{k^2\sigma^2} E[(X - \mu)^2] = \frac{1}{k^2\sigma^2} Var(X)$$ - ---- - -$$P(A_1 \cup A_2 \cup A_3) = P\{A_1 \cup (A_2 \cup A_3)\} = P(A_1) + P(A_2 \cup A_3)$$ -$$P(A_1) + P(A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3)$$ - ---- - -$$P(\cup_{i=1}^n E_i) = P\left\{E_n \cup \left(\cup_{i=1}^{n-1} E_i \right) \right\}$$ - ---- - -$$ -(x_1, x_2, x_3, x_4) = (1, 0, 1, 1) -$$ -$$ -p^{(1 + 0 + 1 + 1)}(1 - p)^{\{4 - (1 + 0 + 1 + 1)\}} = p^3 (1 - p)^1 -$$ -$$ -\mathrm{SD}(X) \mathrm{SD}(Y) -$$ -$$ -Var(X) -$$ -$$ -Var(X) = E[X^2] - E[X]^2 \rightarrow E[X^2] = Var(X) + E[X]^2 = \sigma^2 + \mu^2 -$$ -$$ -Var(\bar X) = E[\bar X^2] - E[\bar X]^2 \rightarrow E[\bar X^2] = Var(\bar X) + E[\bar X]^2 = \sigma^2/n + \mu^2 -$$ -$$ -f(x | y = 5) = \frac{f_{xy}(x, 5)}{f_y(5)} -$$ - ---- - -$$ -P(A\cap B) -$$ -$$ -P(A) -$$ -$$ -P(A\cap B^c) -$$ - ---- - -$$ -\frac{10!}{1!9!} = \frac{10\times 9 \times 8 \times \ldots \times 1}{9 \times 8 \times \ldots \times 1} = 10 -$$ - -$$ -\frac{10!}{2!8!} = \frac{10\times 9 \times 8 \times \ldots \times 1}{2 \times 1 \times 8 \times 7 \times \ldots \times 1} = 45 -$$ - -In general - -$\left(\begin{array}{c}n \\ 2\end{array}\right)= \frac{n \times (n - 1)}{2}$ - -$$ -\mu -$$ - -$$ -\sigma^2 -$$ - -$$ -E[Z] = E\left[\frac{X - \mu}{\sigma} \right] = \frac{E[X] - \mu}{\sigma} = 0 -$$ - ---- - -$$ -Var(Z) = Var\left(\frac{X - \mu}{\sigma}\right) = \frac{1}{\sigma^2} Var(X - \mu) = \frac{1}{\sigma^2} Var(X) = 1 -$$ - ---- - -$$ -E[X_i^2] = E[Y_i] = \sigma^2 + \mu^2 -$$ -$$ -\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^2 X_i^2 - n \bar X ^ 2 -$$ - ---- - -$$ -E[\chi^2_{df}] = df -$$ -$$ -E[S^2] = \sigma^2 -\rightarrow -E\left[\frac{(n-1)S^2}{\sigma^2}\right] = (n-1) -$$ -$$ -Var(\chi^2_{df}) = 2df -$$ \ No newline at end of file diff --git a/06_StatisticalInference/Random Formulae/index.html b/06_StatisticalInference/Random Formulae/index.html deleted file mode 100644 index 7ab6f0bff..000000000 --- a/06_StatisticalInference/Random Formulae/index.html +++ /dev/null @@ -1,288 +0,0 @@ - - - - Random Formulae - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Random Formulae

-

Mathematical Biostatistics Boot Camp

-

Brian Caffo, PhD
Johns Hopkins Bloomberg School of Public Health

-
-
- - - -
-

About this document

-
-
-

This document contains random formulae images I used in the notes.

- -
- -
- - -
- -
-
-

\[A = \{1, 2\}\] -\[B = \{1, 2, 3\}\]

- -
- -
- - -
- -
-
-

\[ -\begin{eqnarray} -E[X^2] & = & \int_0^1 x^2 dx \\ - & = & \left. \frac{x^3}{3} \right|_0^1 = \frac{1}{3} -\end{eqnarray} -\]

- -
- -
- - -
- -
-
-

\[\frac{|x - \mu|}{k\sigma} > 1\] -Over the set \(\{x : |x - \mu | > k\sigma\}\) -\[\frac{(x - \mu)^2}{k^2\sigma^2} > 1\] -\[\frac{1}{k^2\sigma^2} \int_{-\infty}^\infty (x - \mu)^2 f(x) dx\] -\[\frac{1}{k^2\sigma^2} E[(X - \mu)^2] = \frac{1}{k^2\sigma^2} Var(X)\]

- -
- -
- - -
- -
-
-

\[P(A_1 \cup A_2 \cup A_3) = P\{A_1 \cup (A_2 \cup A_3)\} = P(A_1) + P(A_2 \cup A_3)\] -\[P(A_1) + P(A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3)\]

- -
- -
- - -
- -
-
-

\[P(\cup_{i=1}^n E_i) = P\left\{E_n \cup \left(\cup_{i=1}^{n-1} E_i \right) \right\}\]

- -
- -
- - -
- -
-
-

\[ -(x_1, x_2, x_3, x_4) = (1, 0, 1, 1) -\] -\[ -p^{(1 + 0 + 1 + 1)}(1 - p)^{\{4 - (1 + 0 + 1 + 1)\}} = p^3 (1 - p)^1 -\] -\[ -\mathrm{SD}(X) \mathrm{SD}(Y) -\] -\[ -Var(X) -\] -\[ -Var(X) = E[X^2] - E[X]^2 \rightarrow E[X^2] = Var(X) + E[X]^2 = \sigma^2 + \mu^2 -\] -\[ -Var(\bar X) = E[\bar X^2] - E[\bar X]^2 \rightarrow E[\bar X^2] = Var(\bar X) + E[\bar X]^2 = \sigma^2/n + \mu^2 -\] -\[ -f(x | y = 5) = \frac{f_{xy}(x, 5)}{f_y(5)} -\]

- -
- -
- - -
- -
-
-

\[ -P(A\cap B) -\] -\[ -P(A) -\] -\[ -P(A\cap B^c) -\]

- -
- -
- - -
- -
-
-

\[ -\frac{10!}{1!9!} = \frac{10\times 9 \times 8 \times \ldots \times 1}{9 \times 8 \times \ldots \times 1} = 10 -\]

- -

\[ -\frac{10!}{2!8!} = \frac{10\times 9 \times 8 \times \ldots \times 1}{2 \times 1 \times 8 \times 7 \times \ldots \times 1} = 45 -\]

- -

In general

- -

\(\left(\begin{array}{c}n \\ 2\end{array}\right)= \frac{n \times (n - 1)}{2}\)

- -

\[ -\mu -\]

- -

\[ -\sigma^2 -\]

- -

\[ -E[Z] = E\left[\frac{X - \mu}{\sigma} \right] = \frac{E[X] - \mu}{\sigma} = 0 -\]

- -
- -
- - -
- -
-
-

\[ -Var(Z) = Var\left(\frac{X - \mu}{\sigma}\right) = \frac{1}{\sigma^2} Var(X - \mu) = \frac{1}{\sigma^2} Var(X) = 1 -\]

- -
- -
- - -
- -
-
-

\[ -E[X_i^2] = E[Y_i] = \sigma^2 + \mu^2 -\] -\[ -\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^2 X_i^2 - n \bar X ^ 2 -\]

- -
- -
- - -
- -
-
-

\[ -E[\chi^2_{df}] = df -\] -\[ -E[S^2] = \sigma^2 -\rightarrow -E\left[\frac{(n-1)S^2}{\sigma^2}\right] = (n-1) -\] -\[ -Var(\chi^2_{df}) = 2df -\]

- -
- -
- - -
- - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/06_StatisticalInference/Random Formulae/index.md b/06_StatisticalInference/Random Formulae/index.md deleted file mode 100644 index 4262bd6b5..000000000 --- a/06_StatisticalInference/Random Formulae/index.md +++ /dev/null @@ -1,141 +0,0 @@ ---- -title : Random Formulae -subtitle : Mathematical Biostatistics Boot Camp -author : Brian Caffo, PhD -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## About this document - -This document contains random formulae images I used in the notes. - ---- - -$$A = \{1, 2\}$$ -$$B = \{1, 2, 3\}$$ - ---- - -$$ -\begin{eqnarray} -E[X^2] & = & \int_0^1 x^2 dx \\ - & = & \left. \frac{x^3}{3} \right|_0^1 = \frac{1}{3} -\end{eqnarray} -$$ - ---- - -$$\frac{|x - \mu|}{k\sigma} > 1$$ -Over the set $\{x : |x - \mu | > k\sigma\}$ -$$\frac{(x - \mu)^2}{k^2\sigma^2} > 1$$ -$$\frac{1}{k^2\sigma^2} \int_{-\infty}^\infty (x - \mu)^2 f(x) dx$$ -$$\frac{1}{k^2\sigma^2} E[(X - \mu)^2] = \frac{1}{k^2\sigma^2} Var(X)$$ - ---- - -$$P(A_1 \cup A_2 \cup A_3) = P\{A_1 \cup (A_2 \cup A_3)\} = P(A_1) + P(A_2 \cup A_3)$$ -$$P(A_1) + P(A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3)$$ - ---- - -$$P(\cup_{i=1}^n E_i) = P\left\{E_n \cup \left(\cup_{i=1}^{n-1} E_i \right) \right\}$$ - ---- - -$$ -(x_1, x_2, x_3, x_4) = (1, 0, 1, 1) -$$ -$$ -p^{(1 + 0 + 1 + 1)}(1 - p)^{\{4 - (1 + 0 + 1 + 1)\}} = p^3 (1 - p)^1 -$$ -$$ -\mathrm{SD}(X) \mathrm{SD}(Y) -$$ -$$ -Var(X) -$$ -$$ -Var(X) = E[X^2] - E[X]^2 \rightarrow E[X^2] = Var(X) + E[X]^2 = \sigma^2 + \mu^2 -$$ -$$ -Var(\bar X) = E[\bar X^2] - E[\bar X]^2 \rightarrow E[\bar X^2] = Var(\bar X) + E[\bar X]^2 = \sigma^2/n + \mu^2 -$$ -$$ -f(x | y = 5) = \frac{f_{xy}(x, 5)}{f_y(5)} -$$ - ---- - -$$ -P(A\cap B) -$$ -$$ -P(A) -$$ -$$ -P(A\cap B^c) -$$ - ---- - -$$ -\frac{10!}{1!9!} = \frac{10\times 9 \times 8 \times \ldots \times 1}{9 \times 8 \times \ldots \times 1} = 10 -$$ - -$$ -\frac{10!}{2!8!} = \frac{10\times 9 \times 8 \times \ldots \times 1}{2 \times 1 \times 8 \times 7 \times \ldots \times 1} = 45 -$$ - -In general - -$\left(\begin{array}{c}n \\ 2\end{array}\right)= \frac{n \times (n - 1)}{2}$ - -$$ -\mu -$$ - -$$ -\sigma^2 -$$ - -$$ -E[Z] = E\left[\frac{X - \mu}{\sigma} \right] = \frac{E[X] - \mu}{\sigma} = 0 -$$ - ---- - -$$ -Var(Z) = Var\left(\frac{X - \mu}{\sigma}\right) = \frac{1}{\sigma^2} Var(X - \mu) = \frac{1}{\sigma^2} Var(X) = 1 -$$ - ---- - -$$ -E[X_i^2] = E[Y_i] = \sigma^2 + \mu^2 -$$ -$$ -\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^2 X_i^2 - n \bar X ^ 2 -$$ - ---- - -$$ -E[\chi^2_{df}] = df -$$ -$$ -E[S^2] = \sigma^2 -\rightarrow -E\left[\frac{(n-1)S^2}{\sigma^2}\right] = (n-1) -$$ -$$ -Var(\chi^2_{df}) = 2df -$$ diff --git a/06_StatisticalInference/cp.R b/06_StatisticalInference/cp.R new file mode 100644 index 000000000..2365695e8 --- /dev/null +++ b/06_StatisticalInference/cp.R @@ -0,0 +1,26 @@ +## A program for copying the index.pdf files and naming them +## appropriately in the lectured directory +## Brian Caffo +## +## Has to be run within the directory and won't overwrite +## unless you change this to TRUE +overwrite = FALSE + +## Get the directory names (they all start with 0) +dirNames <- dir(pattern = "^[0-1][0-9]_[a-zA-Z]") + +## Loop over them and copy the pdf files +sapply(dirNames, function(x) + file.copy(from = paste(x, "/index.pdf", sep = ""), + to = paste("lectures/", x, ".pdf", sep = ""), + overwrite = overwrite + ) + ) + +## Loop over them and copy the RMD files +sapply(dirNames, function(x) + file.copy(from = paste(x, "/index.Rmd", sep = ""), + to = paste("rmd/", x, ".Rmd", sep = ""), + overwrite = overwrite + ) +) diff --git a/06_StatisticalInference/grading.md b/06_StatisticalInference/grading.md deleted file mode 100644 index c846b9967..000000000 --- a/06_StatisticalInference/grading.md +++ /dev/null @@ -1,17 +0,0 @@ -## Grading and logistics - -The grading in this class is very straightforward. - -1. There are four quizzes, each containing in the neighborhood of 10 questions. -2. Each question is equally weighted as 1 point. -3. Some require two answers, each giving half of a point (for a maximum total of 1 point for those questions). -4. Your total points is the sum of the points questions across all quizzes that you answered correctly (using all of your quiz attempts). -5. 70% or more of the total points is a pass for the class. -6. 80% or more of the total points is a pass with distinction. - - - - - - - diff --git a/06_StatisticalInference/homework/hw1.Rmd b/06_StatisticalInference/homework/hw1.Rmd index f5476f5c7..edba182ee 100644 --- a/06_StatisticalInference/homework/hw1.Rmd +++ b/06_StatisticalInference/homework/hw1.Rmd @@ -1,187 +1,188 @@ ---- -title : Homework 1 for Stat Inference -subtitle : Extra problems for Stat Inference -author : Brian Caffo -job : Johns Hopkins Bloomberg School of Public Health -framework : io2012 -highlighter : highlight.js -hitheme : tomorrow -#url: -# lib: ../../librariesNew #Remove new if using old slidify -# assets: ../../assets -widgets : [mathjax, quiz, bootstrap] -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -library(knitr) -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## About these slides -- These are some practice problems for Statistical Inference Quiz 1 -- They were created using slidify interactive which you will learn in -Creating Data Products -- Please help improve this with pull requests here -(https://github.com/bcaffo/courses) - - ---- &radio - -Consider influenza epidemics for two parent heterosexual families. Suppose that the probability is 15% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 10% while that the mother contracted the disease is 9%. What is the probability that both contracted influenza expressed as a whole number percentage? - -1. 15% -2. 10% -3. 9% -4. _4%_ - -*** .hint -$A = Father$, $P(A) = .10$, $B = Mother$, $P(B) = .09$ -$P(A\cup B) = .15$, - -*** .explanation -$P(A\cup B) = P(A) + P(B) - P(AB)$ thus -$$.15 = .10 + .09 - P(AB)$$ -```{r} -.10 + .09 - .15 -``` - ---- &radio - -A random variable, $X$, is uniform, a box from $0$ to $1$ of height $1$. (So that it's density is $f(x) = 1$ for $0\leq x \leq 1$.) What is it's median expressed to two decimal places?

- -1. 1.00 -2. 0.75 -3. _0.50_ -4. 0.25 - -*** .hint -The median is the point so that 50% of the density lies below it. - -*** .explanation -This density looks like a box. So, notice that $P(X \leq x) = width\times height = x$. -We want $.5 = P(X\leq x) = x$. - ---- &radio - -You are playing a game with a friend where you flip a coin and if it comes up heads you give her $X$ dollars and if it comes up tails she gives you $Y$ dollars. The odds that the coin is heads in $d$. What is your expected earnings? - -1. _$-X \frac{d}{1 + d} + Y \frac{1}{1+d} $_ -2. $X \frac{d}{1 + d} + Y \frac{1}{1+d} $ -3. $X \frac{d}{1 + d} - Y \frac{1}{1+d} $ -4. $-X \frac{d}{1 + d} - Y \frac{1}{1+d} $ - -*** .hint -The probability that you win on a given round is given by $p / (1 - p) = d$ which implies -that $p = d / (1 + d)$. - -*** .explanation -You lose $X$ with probability $p = d/(1 +d)$ and you win $Y$ with probability $1-p = 1/(1 + d)$. So your answer is -$$ --X \frac{d}{1 + d} + Y \frac{1}{1+d} -$$ - ---- &radio -A random variable takes the value -4 with probabability .2 and 1 with proabability .8. What -is the variance of this random variable? - -1. 0 -2. _4_ -3. 8 -4. 16 - -*** .hint -This random variable has mean 0. The variance would be given by $E[X^2]$ then. - -*** .explanation -$$E[X] = 0$$ -$$ -Var(X) = E[X^2] = (-4)^2 * .2 + (1)^2 * .8 -$$ -```{r} --4 * .2 + 1 * .8 -(-4)^2 * .2 + (1)^2 * .8 -``` - - ---- &radio -If $\bar X$ and $\bar Y$ are comprised of $n$ iid random variables arising from distributions -having means $\mu_x$ and $\mu_y$, respectively and common variance $\sigma^2$ -what is the variance $\bar X - \bar Y$? - -1. 0 -2. _$2\sigma^2/n$_ -3. $\mu_x$ - $\mu_y$ -4. $2\sigma^2$ - -*** .hint -Remember that $Var(\bar X) = Var(\bar Y) = \sigma^2 / n$. - -*** .explanation -$$ -Var(\bar X - \bar Y) = Var(\bar X) + Var(\bar Y) = \sigma^2 / n + \sigma^2 / n -$$ - ---- &radio -Let $X$ be a random variable having standard deviation $\sigma$. What can -be said about $X /\sigma$? - -1. Nothing -2. _It must have variance 1._ -3. It must have mean 0. -4. It must have variance 0. - -*** .hint -$Var(aX) = a^2 Var(X)$ - -*** .explanation -$$Var(X / \sigma) = Var(X) / \sigma^2 = 1$$ - - ---- &radio -If a continuous density that never touches the horizontal axis is symmetric about zero, can we say that its associated median is zero? - -1. _Yes_ -2. No. -3. It can not be determined given the information given. - -*** .explanation -This is a surprisingly hard problem. The easy explanation is that 50% of the probability -is below 0 and 50% is above so yes. However, it is predicated on the density not being -a flat line at 0 around 0. That's why the caveat that it never touches the horizontal axis -is important. - - ---- &radio - -Consider the following pmf given in R -```{r} -p <- c(.1, .2, .3, .4) -x <- 2 : 5 -``` -What is the variance expressed to 1 decimal place? - -1. _1.0_ -2. 4.0 -3. 6.0 -4. 17.0 - -*** .hint -The variance is $E[X^2] - E[X^2]$ - -*** .explanation -```{r} -sum(x ^ 2 * p) - sum(x * p) ^ 2 -``` +--- +title : Homework 1 for Stat Inference +subtitle : (Use arrow keys to navigate) +author : Brian Caffo +job : Johns Hopkins Bloomberg School of Public Health +framework : io2012 +highlighter : highlight.js +hitheme : tomorrow +#url: +# lib: ../../librariesNew #Remove new if using old slidify +# assets: ../../assets +widgets : [mathjax, quiz, bootstrap] +mode : selfcontained # {standalone, draft} +--- +```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} +# make this an external chunk that can be included in any file +library(knitr) +options(width = 100) +opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') + +options(xtable.type = 'html') +knit_hooks$set(inline = function(x) { + if(is.numeric(x)) { + round(x, getOption('digits')) + } else { + paste(as.character(x), collapse = ', ') + } +}) +knit_hooks$set(plot = knitr:::hook_plot_html) +runif(1) +``` + +## About these slides +- These are some practice problems for Statistical Inference Quiz 1 +- They were created using slidify interactive which you will learn in +Creating Data Products +- Please help improve this with pull requests here +(https://github.com/bcaffo/courses) + +--- &radio + +Consider influenza epidemics for two parent heterosexual families. Suppose that the probability is 15% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 10% and the propability that the mother contracted the disease is 9%. What is the probability that both contracted influenza expressed as a whole number percentage? +[Watch a video solution](https://www.youtube.com/watch?v=CvnmoCuIN08&index=1&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L) + +1. 15% +2. 10% +3. 9% +4. _4%_ + +*** .hint +$A = Father$, $P(A) = .10$, $B = Mother$, $P(B) = .09$ +$P(A\cup B) = .15$, + +*** .explanation +$P(A\cup B) = P(A) + P(B) - P(AB)$ thus +$$.15 = .10 + .09 - P(AB)$$ +```{r} +.10 + .09 - .15 +``` + +--- &radio + +A random variable, $X$, is uniform, a box from $0$ to $1$ of height $1$. (So that its density is $f(x) = 1$ for $0\leq x \leq 1$.) What is its median expressed to two decimal places? +[Watch a video solution.](https://www.youtube.com/watch?v=UXcarD-1xAM&index=2&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L)

+ +1. 1.00 +2. 0.75 +3. _0.50_ +4. 0.25 + +*** .hint +The median is the point so that 50% of the density lies below it. + +*** .explanation +This density looks like a box. So, notice that $P(X \leq x) = width\times height = x$. +We want $.5 = P(X\leq x) = x$. + +--- &radio + +You are playing a game with a friend where you flip a coin and if it comes up heads you give her $X$ dollars and if it comes up tails she gives you $Y$ dollars. The odds that the coin is heads is $d$. What is your expected earnings? [Watch a video solution.](https://www.youtube.com/watch?v=5J88Zq0q81o&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=3) + +1. _$-X \frac{d}{1 + d} + Y \frac{1}{1+d} $_ +2. $X \frac{d}{1 + d} + Y \frac{1}{1+d} $ +3. $X \frac{d}{1 + d} - Y \frac{1}{1+d} $ +4. $-X \frac{d}{1 + d} - Y \frac{1}{1+d} $ + +*** .hint +The odds that you lose on a given round is given by $p / (1 - p) = d$ which implies +that $p = d / (1 + d)$. + +*** .explanation +You lose $X$ with probability $p = d/(1 +d)$ and you win $Y$ with probability $1-p = 1/(1 + d)$. So your answer is +$$ +-X \frac{d}{1 + d} + Y \frac{1}{1+d} +$$ + +--- &radio +A random variable takes the value -4 with probability .2 and 1 with probability .8. What +is the variance of this random variable? [Watch a video solution.](https://www.youtube.com/watch?v=Em-xJeQO1rc&index=4&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L) + +1. 0 +2. _4_ +3. 8 +4. 16 + +*** .hint +This random variable has mean 0. The variance would be given by $E[X^2]$ then. + +*** .explanation +$$E[X] = 0$$ +$$ +Var(X) = E[X^2] = (-4)^2 * .2 + (1)^2 * .8 +$$ +```{r} +-4 * .2 + 1 * .8 +(-4)^2 * .2 + (1)^2 * .8 +``` + + +--- &radio +If $\bar X$ and $\bar Y$ are comprised of $n$ iid random variables arising from distributions +having means $\mu_x$ and $\mu_y$, respectively and common variance $\sigma^2$ +what is the variance $\bar X - \bar Y$? [Watch a video solution of this problem.](https://www.youtube.com/watch?v=7zJhPzX6jns&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=5) + +1. 0 +2. _$2\sigma^2/n$_ +3. $\mu_x - \mu_y$ +4. $2\sigma^2$ + +*** .hint +Remember that $Var(\bar X) = Var(\bar Y) = \sigma^2 / n$. + +*** .explanation +$$ +Var(\bar X - \bar Y) = Var(\bar X) + Var(\bar Y) = \sigma^2 / n + \sigma^2 / n +$$ + +--- &radio +Let $X$ be a random variable having standard deviation $\sigma$. What can +be said about $X /\sigma$? [Watch a video solution of this problem.](https://www.youtube.com/watch?v=0WUj18_BUPA&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=6) + +1. Nothing +2. _It must have variance 1._ +3. It must have mean 0. +4. It must have variance 0. + +*** .hint +$Var(aX) = a^2 Var(X)$ + +*** .explanation +$$Var(X / \sigma) = Var(X) / \sigma^2 = 1$$ + + +--- &radio +If a continuous density that never touches the horizontal axis is symmetric about zero, can we say that its associated median is zero? [Watch a video solution.](https://www.youtube.com/watch?v=sn48CGH_TXI&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=7) + +1. _Yes_ +2. No. +3. It can not be determined given the information given. + +*** .explanation +This is a surprisingly hard problem. The easy explanation is that 50% of the probability +is below 0 and 50% is above so yes. However, it is predicated on the density not being +a flat line at 0 around 0. That's why the caveat that it never touches the horizontal axis +is important. + + +--- &radio + +Consider the following pmf given in R +```{r} +p <- c(.1, .2, .3, .4) +x <- 2 : 5 +``` +What is the variance expressed to 1 decimal place? [Watch a solution to this problem.](https://www.youtube.com/watch?v=sn48CGH_TXI&list=PLpl-gQkQivXhHOcVeU3bSJg78zaDYbP9L&index=7) + +1. _1.0_ +2. 4.0 +3. 6.0 +4. 17.0 + +*** .hint +The variance is $E[X^2] - E[X]^2$ + +*** .explanation +```{r} +sum(x ^ 2 * p) - sum(x * p) ^ 2 +``` diff --git a/06_StatisticalInference/homework/hw1.html b/06_StatisticalInference/homework/hw1.html index 3db02f1ae..ee036ef46 100644 --- a/06_StatisticalInference/homework/hw1.html +++ b/06_StatisticalInference/homework/hw1.html @@ -34,7 +34,7 @@

Homework 1 for Stat Inference

-

Extra problems for Stat Inference

+

(Use arrow keys to navigate)

Brian Caffo
Johns Hopkins Bloomberg School of Public Health

@@ -49,7 +49,7 @@

About these slides