Most due dates are 5 p.m. Friday for CrowdGrader assignments.
Most CrowdGrader peer reviews will be due by 11 p.m. on Tuesdays.
Field | Excellent (3) | Competent (2) | Needs Work (1) |
---|---|---|---|
Reproducible | All graphs, code, and answers are created from text files. Answers are never hardcoded but instead are inserted using inline R code. An automatically generated references section with properly formatted citations when appropriate and sessionInfo() are provided at the end of the document. |
All graphs, code, and answers are created from text files. Answers are hardcoded. No sessionInfo() is provided at the end of the document. References are present but not cited properly or not automatically generated. |
Document uses copy and paste with graphs or code. Answers are hardcoded; and references, when appropriate are hardcoded. |
Statistical Understanding | Answers to questions demonstrate clear statistical understanding by comparing theoretical answers to simulated answers. When hypotheses are tested, classical methods are compared and contrasted to randomization methods. When confidence intervals are constructed, classical approaches are compared and contrasted with bootstrap procedures. The scope of inferential conclusions made is appropriate for the sampling method. | Theoretical and simulated answers are computed but no discussion is present comparing and contrasting the results. When hypotheses are tested, results for classical and randomization methods are presented but are not compared and contrasted. When confidence intervals are constructed, classical and bootstrap approaches are computed but the results are not compared and contrasted. The scope of inferential conclusions made is appropriate for the sampling method. | Theoretical and simulated answers are not computed correctly. No comparison between classical and randomization approaches is present when testing hypotheses. When confidence intervals are constructed, there is no comparison between classical and bootstrap confidence intervals . |
Graphics | Graphs for categorical data (barplot, mosaic plot, etc.) have appropriately labeled axes and titles. Graphs for quantitative data (histograms, density plots, violin plots, etc.) have appropriately labeled axes and titles. Multivariate graphs use appropriate legends and labels. Computer variable names are replaced with descriptive variable names. | Appropriate graphs for the type of data are used. Not all axes have appropriate labels or computer variable names are used in the graphs. | Inappropriate graphs are used for the type of data. Axes are not labeled and computer variable names appear in the graphs. |
Coding | Code (primarily R) produces correct answers. Non-standard or complex functions are commented. Code is formatted using a consistent standard. | Code produces correct answers. Commenting is not used with non-standard and complex functions. No consistent code formatting is used. | Code does not produce correct answers. Code has no comments and is not formatted. |
Clarity | Few errors of grammar and usage; any minor errors do not interfere with meaning. Language style and word choice are highly effective and enhance meaning. Style and word choice are appropriate for the assignment. | Some errors of grammar and usage; errors do not interfere with meaning. Language style and word choice are, for the most part, effective and appropriate for the assignment. | Major errors of grammar and usage make meaning unclear. Language style and word choice are ineffective and/or inappropriate. |
When you register for a free individual GitHub account, request a student discount to obtain a few private repositories as well as unlimited public repositories. Please use something similar to FirstNameLastName as your username when you register with GitHub. For example, my username on GitHub is alanarnholt. If you have a popular name such as John Smith, you may need to provide some other distinquishing characteristic in your username. Please use the same username for your account on Rpubs.
Once you have a GitHub account, send an email to arnholtat@appstate.edu with a Subject line of STT 2810 - GitHub Username, and tell me in the body of your email your first name, last name, and your GitHub username. I will then manually add you as a team member to the repository in the STAT-ATA-ASU organization that has your name (LastName-FirstName). This repository will be where you store all of your work for this course. I will also change your repository to a private repository.
Sign up to audit the Coursera classes Introduction to Probability and Data, Inferential Statistics, and Linear Regression and Modeling—auditing these courses will give you access to some excellent videos.
Become familiar with the Appstate RStudio server. You will use your Appstate user name and password to log in to the server. You must be registered in the class to access the server.
Read Chapters 1-2 of bookdown: Authoring Books and Technical Documents with R Markdown
Follow the directions from Happy Git and GitHub for the useR to Introduce yourself to Git, Connect to GitHub, and Cache credentials for HTTPS. Note: Git, R, and the RStudio IDE have already been installed for you on the RStudio server.
Read the Git and GitHub chapter from Hadley Wickham’s book R Packages
Watch the following video:
You may want to install Git, R, RStudio, zotero, and optionally LaTeX on your personal computer. If you do, you will want to follow Jenny Bryan’s excellent advice in Happy Git and GitHub for the useR. Note: Git, R, RStudio, and LaTeX are installed on the Appstate RStudio server.
Watch the following videos as appropriate:
Read chapters 1-3 of Reproducible Research with R and RStudio.
Watch all videos for week one of Introduction to Probability and Data.
Read sections 1.1-1.5 of OpenIntro Statistics, 3rd Edition.
Read/review Getting used to R, RStudio, and R Markdown
Read Chapters 1-2 of bookdown: Authoring Books and Technical Documents with R Markdown
Using zotero will be covered.
We will duscuss the LAB/PAPER due Feb 17 that will use the bookdown::html_document2
format. See Week 5 for details.
Clone the repository to your local machine using RStudio by following these instructions:
File > New Project > Version Control > Git
https://github.com/YourUserName/STAT-ATA-ASU/STT2810HonorsClassRepo.git
) in the Repository URL:
box.STT2810HonorsClassRepo
) in the Project directory name:
box.Create project as subdirectory of:
box.Click the Create Project
box. You should now have a local copy of the forked repository on your local machine. Congratulations!
Set the upstream remote in your fork to this repository with the command
git remote add upstream https://github.com/STAT-ATA-ASU/STT2810HonorsClassRepo.git
git remote -v
git pull upstream master
gh-pages
, use gh-pages
instead of master
to obtain updates.git pull upstream gh-pages
If there are conflicts, you will need to resolve them before proceeding.
Create a free account on DataCamp. Complete Introduction to R in Data Analysis and Statistical Inference
Complete the intro_to_r_ASU2.Rmd
lab found in the StatsWithRLabs
folder of your private repository, and submit your compiled *html
file no later than 5:00 p.m., January 27, to CrowdGrader. Make sure you commit and push your your final product to your private repository. Install the statsr
package by running the following from the R command line:
devtools::install_github('alanarnholt/statsr')
Identify variables as numerical or categorical.
library(ggplot2)
library(openintro)
ggplot(data = county, aes(x = poverty, y = fed_spend)) +
geom_point(alpha = 0.10, color = "blue") +
theme_bw() +
labs(x = "Poverty Rate (Percent)", y = "Federal Spending Per Capita")
Test yourself: Give one example of each type of variable you have learned.
Identify the explanatory variable in a pair of variables as the variable suspected of affecting the other, however note that labeling variables as explanatory and response does not guarantee that the relationship between the two is actually causal, even if there is an association identified between the two variables.
Classify a study as observational or experimental. Determine and explain whether the study’s results can be generalized to the population and whether the results suggest an association or causation between the quantities studied.
Identify confounding variables and sources of bias in a given study.
Distinguish among simple random, stratified, and cluster sampling. Recognize the benefits and drawbacks of choosing one sampling scheme over another.
Identify the four principles of experimental design, and recognize their purposes: control any possible confounders; randomize into treatment and control groups; replicate by using a sufficiently large sample or repeating the experiment; and block any variables that might influence the response.
Identify if single or double blinding has been used in a study.
Test yourself:
Watch all videos for week two of Introduction to Probability and Data.
Read sections 1.6-1.8 of OpenIntro Statistics, 3rd Edition.
Read Section 3.4 of bookdown: Authoring Books and Technical Documents with R Markdown
Complete Introduction to Data in Data Analysis and Statistical Inference
Complete the intro_to_data_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., February 3, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Integrating zotero with RMarkdown will be discussed during the first part of the week. See this document.
You may find these videos helpful.
ggplot(data = email50, aes(x = num_char, y = line_breaks)) +
geom_point(alpha = 0.50, color = "blue") +
theme_bw() +
labs(x = "Number of Characters (in thousands)", y = "Number of Lines")
library(dplyr)
ANS <- email50 %>%
summarize(MeanChar = mean(num_char), SDchar = sd(num_char),
MedianChar = median(num_char), IQRchar = IQR(num_char), n = n())
knitr::kable(ANS, caption = "Summary Statistics")
MeanChar | SDchar | MedianChar | IQRchar | n |
---|---|---|---|---|
11.59822 | 13.12526 | 6.8895 | 12.87525 | 50 |
Note that there are three commonly used measures of center and spread:
Identify the shape of a distribution as unimodal, bimodal, multimodal, or uniform; and if the shape is unimodal, further classify the distribution as symmetric, right skewed, or left skewed.
ggplot(data = email50, aes(x = num_char)) +
geom_histogram(fill = "lightblue", binwidth = 5, color = "black") +
theme_bw() +
labs(x = "Number of Characters (in thousands)", y = "Frequency")
Use histograms and box plots to visualize the shape, center, and spread of numerical distributions. Use intensity maps to visualize the spatial distribution of the data.
Define a robust statistic (e.g. median, IQR) as a statistic that is not heavily affected by skewness or extreme outliers. Determine when robust statistics are more appropriate measures of center and spread compared to non-robust statistics.
Recognize when transformations (e.g. log) can make the distribution of data more symmetric and easier to model.
Test yourself:
xtabs(~number, data = email)
number
none small big
549 2827 545
ggplot(data = email, aes(x = number)) +
geom_bar(fill = "pink") +
theme_bw()
xtabs(~spam + number, data = email)
number
spam none small big
0 400 2659 495
1 149 168 50
ggplot(data = email, aes(x = number, fill = as.factor(spam))) +
geom_bar(position = "fill") +
theme_bw() +
labs(y = "Fraction") +
scale_fill_manual(values = c("purple", "lightblue"), name = "Spam")
ggplot(data = email, aes(x = number, fill = as.factor(spam))) +
geom_bar(position = "stack") +
theme_bw() +
labs(y = "Frequency") +
scale_fill_manual(values = c("purple", "lightblue"), name = "Spam")
library(vcd)
mosaic(~number + as.factor(spam), data = email, shade = TRUE)
ggplot(data = email, aes(x = number, y = num_char)) +
geom_boxplot() +
theme_bw()
Test yourself:
Watch all videos for week three of Introduction to Probability and Data.
Read sections 2.1-2.5 of OpenIntro Statistics, 3rd Edition.
Complete Probability in Data Analysis and Statistical Inference
Complete the probability_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., February 10, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define the probability of an outcome as the proportion of times the outcome would occur if we observed the random process that gives rise to it an infinite number of times.
Explain why the long-run relative frequency of repeated independent events approaches the true probability as the number of trials increases, i.e. why the law of large numbers holds.
Define disjoint (mutually exclusive) events as events that cannot both happen at the same time:
Distinguish between disjoint and independent events.
If \(A\) and \(B\) are independent, then having information on \(A\) does not tell us anything about \(B\) (and vice versa).
If \(A\) and \(B\) are disjoint, then knowing that \(A\) occurs tells us that \(B\) cannot occur (and vice versa).
Disjoint (mutually exclusive) events are always dependent since if one event occurs we know the other one cannot is not quite correct. NOTE: \(P(A \cap B) = 0\) if \(P(A)=0,\) and then \(A\) and \(B\) are both disjoint AND independent.
Draw Venn diagrams representing events and their probabilities.
Define a probability distribution as a list of the possible outcomes with corresponding probabilities that satisfies three rules:
Define complementary outcomes as mutually exclusive outcomes of the same random process whose probabilities add up to 1.
Distinguish between union of events (\(A\) or \(B\)) and intersection of events (\(A\) and \(B\)).
Test yourself:
What is the probability of getting a head on the 6th coin flip if in the first 5 flips the coin landed on a head each time?
True / False: Being right handed and having blue eyes are mutually exclusive events.
Distinguish between marginal and conditional probabilities.
Construct tree diagrams to calculate conditional probabilities and probabilities of the intersection of non-independent events using Bayes’ theorem: \(P(A|B)=\frac{P(A \cap B)}{P(B)}.\)
Watch all videos for week four of Introduction to Probability and Data.
Read sections 3.1, 3.2, and 3.4 of OpenIntro Statistics, 3rd Edition.
Compile and read the exampleEDA.Rmd
lab.
Complete your LAB/PAPER using the bookdown::html_document2
format. Create a folder named 01_MyInterests
under your StatsWithRProjects
folder. Store your LAB/PAPER *Rmd
file in in the 01_MyInterests
folder. Make sure you commit and push your final product to your private repository no later than 5:00 p.m., Feb 17. Your LAB/PAPER needs to have at least two graphs with captions, at least two numbered and referenced equations, at least one TABLE with a descriptive caption, and properly formatted References.
Use this example if you need one
Define the standardized score (\(z\)-score) of a data point as the number of standard deviations it is away from the mean: \(z = \frac{x - \mu}{\sigma}.\)
Use the \(z\!~\text{-score}\)
if the distribution is normal to determine the percentile score of a data point (using technology or normal probability tables);
regardless of the shape of the distribution to assess whether or not the particular observation is considered to be unusual (more than 2 standard deviations away from the mean).
Depending on the shape of the distribution, determine whether the median would have a negative, positive, or zero \(z\)-score, keeping in mind that the mean always has a \(z\)-score of 0.
Assess whether or not a distribution is nearly normal using either the 68-95-99.7% rule or graphical methods such as a normal probability plot.
Determine if a random variable is binomial using the four conditions:
The probability of a success, \(p\), is the same for each trial.
Calculate the number of possible scenarios for obtaining \(k\) successes in n trials using the binomial coefficient: \(\binom{n}{k}=\frac{n!}{k!(n-k)!}.\)
Calculate probability of a given number of successes in a given number of trials using the binomial distribution: \(P(X=k)=\binom{n}{k}p^k(1-p)^{(n-k)}.\)
Calculate the expected value \((\mu = np)\) and standard deviation \(\left(\sigma = \sqrt{np(1-p)}\right)\) of a binomial distribution.
When number of trials is sufficiently large (\(np \ge 10\) and \(n(1-p) \ge 10\)), use the normal approximation to calculate binomial probabilities. Explain why this approach works.
Test yourself:
PDS
package.
Watch all videos for week one of Inferential Statistics.
Read sections 4.1-4.2 of OpenIntro Statistics, 3rd Edition.
Complete Foundations for Inference: Sampling Distributions in Data Analysis and Statistical Inference
Complete the sampling_distributions_ASU2.Rmd
lab and submit your compiled *html
file no later than 5:00 p.m., February 24 to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define a statistic as a point estimate for a population parameter. For example, the sample mean is used to estimate the population mean. Note that “point estimate” and “statistic” are synonymous.
Recognize that point estimates (such as the sample mean) will vary from one sample to another. We define this variability as sampling variability (sometimes called sampling variation).
Calculate the sampling variability of the mean, the standard deviation of \(\bar{x}\), as \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\) where \(\sigma\) is the population standard deviation.
Distinguish between standard deviation (\(\sigma\) or \(s\)) and standard error (\(SE\)): standard deviation measures the variability in the data, while standard error measures the variability in point estimates from different samples of the same size and from the same population; it measures the sampling variability.
Recognize that when the sample size increases we would expect the sampling variability to decrease.
Conceptually: Imagine taking many samples from the population. When the size of each sample is large, the sample means will be much more consistent across samples than when the sample sizes are small.
Mathematically: Remember \(SE = \frac{s}{\sqrt{n}}\). Then, when \(n\) increases, \(SE\) will decrease since \(n\) is in the denominator.
Define a confidence interval as the plausible range of values for a population parameter.
Define the confidence level as the expected percentage of random samples which yield confidence intervals that capture the true population parameter.
Recognize that the Central Limit Theorem (CLT) is about the distribution of point estimates. Under certain conditions, this distribution will be nearly normal.
In the case of the mean, the CLT tells us that if the sample size is sufficiently large and the observations in the sample are independent, then the distribution of the sample mean will be nearly normal. The distribution will be centered at the true population mean and have a standard deviation of \(\frac{\sigma}{\sqrt{n}}\). The distribution of \(\bar{x}\) is written with symbols as
\[\begin{equation} \bar{X} \dot{\sim} N\left(\mu_{\bar{x}} = \mu, \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\right) \label{clt} \end{equation}\]Recall that independence of observations in a sample is provided by random sampling (in the case of observational studies) or random assignment (in the case of experiments).
Recognize that the nearly normal distribution of the point estimate (as suggested by the CLT) implies that a confidence interval can be calculated as
Define margin of error as the distance required to travel in either direction away from the point estimate when constructing a confidence interval, i.e. \(z_{1 - \alpha/2}\cdot \frac{\sigma}{\sqrt{n}}.\)
Notice that this corresponds to half the width of the confidence interval.
Interpret a confidence interval as “We are XX% confident that the true population parameter is in this interval,” where XX% is the desired confidence level.
Test yourself:
For each of the following situations, state whether the variable is categorical or numerical and whether the parameter of interest is a mean or a proportion.
Suppose heights of all women in the US have a mean of 63.7 inches and a random sample of 100 women’s heights yields a sample mean of 65.2 inches. Which values are the population parameter and the point estimate, respectively? Which one is \(\mu\) and which one is \(\bar{x}\)?
Suppose heights of all women in the US have a standard deviation of 2.7 inches and a random sample of 100 women’s heights yields a standard deviation of 4 inches. Which value is the population parameter and which value is the point estimate? Which one is \(\sigma\) and which one is \(s\)?
Explain, in plain English, what you see in Figure 4.8 of the book (page 166).
List the conditions necessary for the CLT to hold.
Confirm that \(z_{1-\alpha/2}\) for a 98% confidence level is 2.33. (Include a sketch of the normal curve in your response.)
Calculate a 95% confidence interval for the average height of US women using a random sample of 100 women where the sample mean is 63 inches and the sample standard deviation is 3 inches. Interpret this interval in context of the data.
Explain, in plain English, the difference between the standard error and the margin of error.
Watch all videos for week two of Inferential Statistics.
Read sections 4.3-4.6 of OpenIntro Statistics, 3rd Edition.
Complete Foundations for Inferece: Confidence Intervals in Data Analysis and Statistical Inference
Complete the confidence_intervals_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., March 3, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Explain how the hypothesis testing framework resembles a court trial.
Recognize that in hypothesis testing we evaluate two competing claims:
the null hypothesis, which represents a skeptical perspective or the status quo, and
the alternative hypothesis, which represents an alternative under consideration and is often represented by a range of possible parameter values.
Construction of hypotheses:
Always construct hypotheses about population parameters (e.g. population mean, \(\mu\)) and not the sample statistics (e.g. sample mean, \(\bar{x}\)). Note that the population parameter is unknown while the statistic is measured using the observed data and hence there is no point in hypothesizing about it.
Define the null value as the value the parameter is set to equal in the null hypothesis.
The alternative hypothesis may be one-sided (\(\mu <\) or \(>\) the null value) or two-sided (\(\mu \ne\) the null value). The choice depends on the research question.
Define a \(p\!~\text{-value}\) as the conditional probability of obtaining a statistic at least as extreme as the one observed given that the null hypothesis is true.
Calculate a \(p\!~\text{-value}\) as the area under the normal curve beyond the observed sample mean (either in one tail or both, depending on the alternative hypothesis). Note that in doing so you can use a \(z\!~\text{-score}\), where
\(p\!~\text{-value}\)=P(observed or more extreme statistic | \(H_0\) true)
Always sketch the normal curve when calculating the \(p\!~\text{-value}\), and shade the appropriate area(s) depending on whether the alternative hypothesis is one- or two-sided.
Infer that if a confidence interval does not contain the null value, the null hypothesis should be rejected in favor of the alternative.
Compare the \(p\!~\text{-value}\) to the significance level to make a decision between the hypotheses:
If the \(p\!~\text{-value}\) < the significance level, reject the null hypothesis. This means that obtaining a statistic at least as extreme as the observed data is extremely unlikely to happen just by chance. Conclude that the data provides evidence for the alternative hypothesis.
If the \(p\!~\text{-value}\) > the significance level, fail to reject the null hypothesis. This means that obtaining a statistic at least as extreme as the observed data is likely to happen by chance. Conclude that the data does not provide evidence for the alternative hypothesis.
Note that we can never “accept” the null hypothesis since the hypothesis testing framework does not allow us to confirm it.
Note that the conclusion of a hypothesis test might be erroneous regardless of the decision we make.
Define a Type I error as rejecting the null hypothesis when the null hypothesis is actually true.
Define a Type II error as failing to reject the null hypothesis when the alternative hypothesis is actually true.
Use a smaller \(\alpha\) if a Type I error is more critical than a Type II error. Use a larger \(\alpha\) if a Type II error is more critical than a Type I error.
Recognize that sampling distributions of point estimates coming from samples that don’t meet the required conditions for the CLT (about sample size, skew, and independence) will not be normal.
Formulate the framework for statistical inference using hypothesis testing and nearly normal point estimates:
Set up the hypotheses first in plain language and then using appropriate notation.
Identify the appropriate statistic that can be used as a point estimate for the parameter of interest.
Verify that the conditions for the CLT hold.
Compute the \(SE\), sketch the sampling distribution, and shade area(s) representing the \(p\!~\text{-value}\).
Using the sketch and the normal model, calculate the \(p\!~\text{-value}\); and determine if the null hypothesis should be rejected or not. State your conclusion in context of the data and the research question.
If the conditions necessary for the CLT to hold are not met, note this and do not go forward with the analysis. (We will learn later about methods to use in these situations.)
Calculate the required sample size to obtain a given margin of error at a given confidence level by working backwards from the given margin of error.
Distinguish between statistical significance vs. practical significance.
Define power as the probability of correctly rejecting the null hypothesis (the complement of a Type II error).
Test yourself:
List the errors in the following hypotheses: \(H_0:\bar{x} \gt 20\) versus \(H_A:\bar{x} \ge 25\)
What is wrong with the following statement: “If the \(p\!~\text{-value}\) is large, we accept the null hypothesis since a large \(p\!~\text{-value}\) implies that the observed difference between the null value and the sample statistic is quite likely to happen just by chance”?
Suppose a researcher is interested in evaluating the claim “the average height of adult males in the US is 69.1 inches,” and she believes this is an underestimate.
How should she set up her hypotheses?
Explain to her, in plain language, how she should collect data and carry out a hypothesis test.
Suppose she collects a random sample of 40 adult males where the average is 70.2 inches. A test returns a \(p\!~\text{-value}\) of 0.0082. What should she conclude?
Interpret this \(p\!~\text{-value}\) (as a conditional probability) in context of the question.
Suppose that the true average is in fact 69.1 inches. If the researcher rejects the null hypothesis, what type of an error is the researcher making? In order to avoid making such an error, should she have used a smaller or a larger significance level?
Describe the differences and similarities between testing a hypothesis using simulation and testing a hypothesis using theory. Discuss how the calculation of the \(p\!~\text{-value}\) changes while the definition of the \(p\!~\text{-value}\) stays the same.
In a random sample of 1,017 Americans, 60% said they do not trust the mass media when it comes to reporting the news fully, accurately, and fairly. The standard error associated with this estimate is 0.015 (1.5%). What is the margin of error of a 95% confidence level? Calculate a 95% confidence interval and interpret it in context.
Watch all videos for week three of Inferential Statistics.
Read sections 5.1-5.5 of OpenIntro Statistics, 3rd Edition.
Read Can A New Drug Reduce the Spread of Schistosomiasis?—Use Ctrl f on the web page to find the appropriate text inside the notes.
Complete Inference for Numerical Data in Data Analysis and Statistical Inference
Complete the inf_for_numerical_data_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., March 10, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Use the \(t\)-distribution for inference on a single mean, mean of differences (paired groups), and difference of independent means.
Explain how the shape of the \(t\)-distribution accounts for the additional variability introduced by using \(s\) (sample standard deviation) in place of \(\sigma\) (population standard deviation).
Describe how the \(t\)-distribution is different from the normal distribution and what “heavy tail” means in this context.
Note that the \(t\)-distribution has a single parameter, degrees of freedom. As the number of degrees of freedom increases, this distribution approaches the normal distribution.
Use a \(t\)-statistic with degrees of freedom \(df=n-1\) for inference for a population mean:
CI: \(\bar{x} \pm t_{1-\alpha/2, n-1}\cdot SE_{\bar{x}}\)
HT: \(t_{n-1}=\frac{\bar{x}-\mu_0}{SE_{\bar{x}}}\) where \(SE_{\bar{x}} = \frac{s}{\sqrt{n}}.\)
Describe how to obtain a \(p\!~\text{-value}\) for a \(t\!~\text{-test}\) and a critical \(t_{1 - \alpha/2, n-1}\) value for a confidence interval.
Define observations as paired if each observation in one dataset has a special correspondence or connection with exactly one observation in the other data set.
Carry out inference for paired data by first subtracting the paired observations from each other. Then, treat the set of differences as a new numerical variable on which to do inference (such as a confidence interval or hypothesis test for the average difference).
Calculate the standard error of the mean difference between two paired (dependent) samples as \(SE = \frac{s_{diff}}{\sqrt{n_{diff}}}.\) Use this standard error in hypothesis testing and confidence intervals comparing means of paired (dependent) groups.
Use a \(t\)-statistic, with degrees of freedom \(df=n_{diff}-1\) for inference with the mean difference in two paired (dependent) samples:
CI: \(\bar{x}_{diff} \pm t_{1-\alpha/2, n_{diff}-1}\cdot SE\)
HT: \(t_{n_{diff}-1}=\frac{\bar{x}_{diff}-\mu_0}{SE}\) where \(SE = \frac{s_{diff}}{\sqrt{n_{diff}}}.\)
Recognize that a good interpretation of a confidence interval for the difference between two parameters includes a comparative statement (mentioning which group has the larger parameter).
Recognize that a confidence interval for the difference between two parameters that doesn’t include 0 is in agreement with a hypothesis test where the null hypothesis that sets the two parameters equal to each other is rejected.
Calculate the standard error for the difference between means of two independent samples as \(SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2}}.\) Use this standard error in hypothesis testing and confidence intervals comparing means of independent groups.
Use a \(t\)-statistic with \(\nu\) degrees of freedom for conducting inference with the difference in two independent means:
CI: \(\bar{x}_1 - \bar{x}_2 \pm t_{1-\alpha/2, \nu}\cdot SE\)
HT: \(t_{\nu}=\frac{(\bar{x}_1 - \bar{x}_2) -(\mu_1 -\mu_2)}{SE}\) where \(SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2}}.\)
\(\nu = \frac{\left(\frac{s_1^2}{n_1}+\frac{S_2^2}{n_2}\right)^2}{\frac{1}{n_1 - 1}\left(\frac{s_1^2}{n_1}\right)^2 + \frac{1}{n_2 - 1}\left(\frac{s_2^2}{n_2}\right)^2}\)
Calculate the power of a test for a given effect size and significance level in two steps:
Explain how power changes for changes in effect size, sample size, significance level, and standard error.
Define analysis of variance (ANOVA) as a statistical inference method that is used to determine, by simultaneously considering many groups at once, if the variability between the sample means is so large that it seems unlikely to be from chance alone.
Recognize that the null hypothesis in ANOVA sets all means equal to each other and that the alternative hypothesis suggest that at least one mean is different.
List the conditions necessary for performing ANOVA:
The observations should be independent within and across groups.
The data within each group should be nearly normal
The variability across the groups should be roughly equal.
Use graphical diagnostics to check if these conditions are met.
Recognize that the test statistic for ANOVA, the \(F\)-statistic, is calculated as the ratio of the mean square between groups (\(MSG\), variability between groups) and mean square error (\(MSE\), variability within groups errors). Also recognize that the \(F\)-statistic has a right skewed distribution with two different measures of degrees of freedom: one for the numerator (\(df_G=k-1,\) where \(k\) is the number of groups) and one for the denominator (\(df_E=n-k,\) where \(n\) is the total sample size).
Describe why calculation of the \(p\!~\text{-value}\) for ANOVA is always “one sided.”
Describe why conducting many \(t\!~\text{-test}\)s for differences between each pair of means leads to an increased Type I Error rate and why we use a corrected significance level (Bonferroni correction, \(\alpha^*=\alpha/K\), where \(K\) is the number of comparisons being considered) to combat inflating this error rate.
Describe why it is possible to reject the null hypothesis in ANOVA but not find significant differences between groups when doing pairwise comparisons.
Describe how bootstrap distributions are constructed. Recognize how bootstrap distributions are different from sampling distributions.
Construct bootstrap confidence intervals using one of the following methods:
Percentile method: XX% confidence level is the middle XX% of the bootstrap distribution.
Standard error method: If the standard error of the bootstrap distribution is known and the distribution is nearly normal, the bootstrap interval can also be calculated as
Test yourself:
What is the \(t\)-value for a 95% confidence interval for a mean where the sample size is 13?
What is the \(p\!~\text{-value}\) for a hypothesis test where the alternative hypothesis is two-sided, the sample size is 20, and the test statistic, \(t\), is calculated to be 1.75?
20 cardiac patients’ blood pressures are measured before and after taking a medication. For a given patient, are the before and after blood pressure measurements dependent (paired) or independent?
A random sample of 100 students was obtained and then randomly assigned into two equal-sized groups. One group rode on a roller coaster while the other rode a simulator at an amusement park. Afterwards, their blood pressure measurements were taken. Are the measurements dependent (paired) or independent?
Describe how the two sample means test is different from the paired means test.
A 95% confidence interval for the difference between the number of calories consumed by mature and juvenile cats \((\mu_{mat}-\mu_{juv})\) is (80 calories, 100 calories). Interpret this interval, and determine if it suggests a significant difference between the two means.
We would like to compare the average incomes of Americans who live in the Northeast, Midwest, South, and West. What are the appropriate hypotheses?
Suppose the sample in question 7 has 1000 observations, what are the degrees of freedom associated with the \(F\)-statistic?
Suppose the appropriate null hypothesis from question 7 is rejected. Describe how we would discover which regions’ averages are different from each other. Make sure to discuss how many pairwise comparisons we would need to make and what the corrected significance level would be.
What visualizations are useful for checking each of the conditions required for performing ANOVA?
How is a bootstrap distribution different from a sampling distribution?
Watch all videos for week four of Inferential Statistics.
Read sections 6.1-6.6 of OpenIntro Statistics, 3rd Edition.
Complete Inference for Categorical Data in Data Analysis and Statistical Inference
Complete the inf_for_categorical_data_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., March 24, to CrowdGrader. Make sure you commit and push your final product to your private repository.
Define the population proportion as \(p\) (parameter) and sample proportion as \(\hat{p}\).
Calculate the sampling variability of the proportion, the standard deviation, as \(\sigma_{\hat{p}}=SD_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion.
Note that if the CLT doesn’t apply and the sample proportion is low (close to 0), the sampling distribution will likely be right skewed. If the sample proportion is high (close to 1), the sampling distribution will likely be left skewed.
Remember that confidence intervals are calculated as
\(\text{point estimate} \pm \text{margin of error}\)
and standardized test statistics are calculated as
\(Z = \frac{\text{statistic} - \mu_{\text{statistic}}}{\sigma_{\text{statistic}}}\)
\(T = \frac{\text{statistic} - \mu_{\text{statistic}}}{\hat{\sigma}_{\text{statistic}}}\)
Note that the standard error calculation for the confidence interval and the hypothesis test are different when dealing with proportions. With hypothesis testing we assume that the null hypothesis is true. Remember: \(p\!~\text{-value}\) = P(observed or more extreme test statistic | \(H_0\) true).
For confidence intervals use \(\hat{p}\) (observed sample proportion) when calculating the standard error and checking the success/failure condition:
\(SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\)
For hypothesis tests use \(p_0\) (null value) when calculating the standard error and checking the success/failure condition:
\(SE_{\hat{p}} = \sqrt{\frac{p_0(1 - p_0)}{n}}\)
Such a discrepancy does not exist when conducting inference for means since the mean does not factor into the calculation of the standard error.
Explain why when calculating the required minimum sample size for a given margin of error at a given confidence level we use \(\hat{p} = 0.5\) if there are no previous studies suggesting a more accurate estimate.
Conceptually: When there is no additional information, 50% chance of success is a good guess for events with only two outcomes (success or failure).
Mathematically: Using \(\hat{p} = 0.5\) yields the most conservative (highest) estimate for the required sample size.
Note that the calculation of the standard error for the distribution of the difference in two independent sample proportions differs for a confidence interval and a hypothesis test.
confidence interval and hypothesis test when \(H_0: p_1 - p_2 = \text{some value other than 0:}\) \(SE_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}\)
hypothesis test when \(H_0:p_1 - p_2 = 0:\) \(SE_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{\hat{p}_{pool}(1 - \hat{p}_{pool})}{n_1} + \frac{\hat{p}_{pool}(1 - \hat{p}_{pool})}{n_2}}\) where \(\hat{p}_{pool}\) is the overall rate of success: \(\hat{p}_{pool} = \frac{\text{number of successes in group 1 + number of successes in group 2}}{n_1 + n_2}\)
Note that the reason for the difference in calculations of standard error is the same as in the case of the single proportion. When the null hypothesis claims that the two population proportions are equal, we need to take that information into consideration when calculating the standard error for the hypothesis test. Consequently, we use a common proportion for both samples.
Use a chi-squared test of goodness of fit to evaluate if the distribution of levels of a single categorical variable follows a hypothesized distribution.
\(H_0:\) The distribution of observed counts follows the hypothesized distribution and any observed differences are due to chance.
\(H_A:\) The distribution of observed counts does not follow the hypothesized distribution.
Calculate the expected counts for a given level (cell) in a one-way table as the sample size times the hypothesized proportion for that level.
Calculate the chi-squared test statistic as
\[\chi^2 = \sum_{i=1}^k\frac{(\text{observed count} - \text{expected count})^2}{\text{expected count}}, \text{where } k \text{ is the number of cells.}\]
Note that the chi-squared statistic is always positive and follows a right skewed distribution with one parameter, which is the degrees of freedom.
Note that the degrees of freedom for the chi-squared statistic for the goodness of fit test is \(df=k-1\).
List the conditions necessary for performing a chi-squared test (goodness of fit or independence)
Describe how to use the chi-squared table to obtain a \(p\!~\text{-value}\).
When evaluating the independence of two categorical variables, where at least one has more than two levels, use a chi-squared test of independence.
\(H_0:\) The two variables are independent.
\(H_A:\) The two variables are dependent.
Calculate expected counts in two-way tables as
\[E =\frac{\text{row total}\times\text{column total}}{\text{grand total}}\]
Calculate the degrees of freedom for chi-squared test of independence as \(df=(R-1)\times(C-1)\), where \(R\) is the number of rows in a two-way table, and \(C\) is the number of columns.
Note that there is no such thing as a chi-squared confidence interval for proportions.
Use simulation methods when sample size conditions are not met for inference for categorical variables.
In hypothesis testing
for one categorical variable, generate simulated samples based on the null hypothesis. Next, calculate the number of samples that are at least as extreme as the observed data.
for two categorical variables, use a randomization test.
Use bootstrap methods for confidence intervals for categorical variables with at most two levels.
Test yourself:
Suppose 10% of ASU students smoke. You collect many random samples of 100 ASU students at a time and calculate a sample proportion of students who smoke \((\hat{p})\) for each sample. What would you expect the distribution of \(\hat{p}\) to be? Describe its shape, center, and spread.
Suppose you want to construct a confidence interval with a margin of error no more than 4% for the proportion of ASU students who smoke. How would your calculation of the required sample size change if you do not know anything about the smoking habits of ASU students versus if you have a reliable previous study estimating that about 10% of ASU students smoke?
Suppose a 95% confidence interval for the difference between the proportion of male and the proportion of female ASU students who smoke is (-0.08, 0.11). Interpret this interval, making sure to incorporate a comparative statement about the two sexes of ASU students.
Does the above interval suggest a significant difference between the true proportions of smokers in the two groups?
Suppose you had a sample of 100 male ASU students where 11 of them smoke and a sample of 80 female ASU students where 10 of them smoke. Calculate \(\hat{p}_{pool}\).
When and why do we use \(\hat{p}_{pool}\) in the calculation of the standard error for the difference between two sample proportions?
Explain the different hypothesis tests one could use when assessing the distribution of a categorical variable (e.g. smoking status) with only two levels (e.g. levels: smoker and non-smoker) vs. more than two levels (e.g. levels: heavy smoker, moderate smoker, occasional smoker, non-smoker).
Why is the \(p\!~\text{-value}\) for chi-squared tests always “one sided”?
What are the null and alternative hypotheses in the chi-squared test of independence?
Suppose a chi-squared test of independence between two categorical variables (one with 5, the other with 3 levels) yields a test statistic of \(\chi^2 = 14\). What is the conclusion of the hypothesis test at a 5% significance level?
Project2Template.Rmd
project found in the StatsWithRProjects
folder of your private repository. Submit your compiled *html
file no later than 5:00 p.m., March 31, to CrowdGrader. Make sure you commit and push your final product to your private repository.Watch all videos for week one of Linear Regression and Modeling.
Read sections 7.1-7.2 of OpenIntro Statistics, 3rd Edition.
Complete Introduction to linear regression in Data Analysis and Statistical Inference
No Lab this week!
Define the explanatory variable as the independent variable (predictor) and the response variable as the dependent variable (predicted).
Plot the explanatory variable (\(x\)) on the \(x\)-axis and the response variable (\(y\)) on the \(y\)-axis, and fit a linear regression model
\[y = \beta_0 + \beta_1 x + \varepsilon,\]
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is random error.
Note that the point estimates (estimated from observed data) for \(\beta_0\) and \(\beta_1\) are \(b_0\) and \(b_1\), respectively.
When describing the association between two numerical variables, evaluate
direction: positive \((x\uparrow,y\uparrow)\), negative \((x\downarrow,y\uparrow)\)
form: linear or not
strength: determined by the scatter around the underlying relationship
Define correlation as the linear association between two numerical variables.
Note that the correlation coefficient (\(r\), also called Pearson’s \(r\)) has the following properties:
The magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables.
The sign of the correlation coefficient indicates the direction of association.
The correlation coefficient is always between -1 and 1.
\(r=0\) indicates no linear relationship.
The correlation coefficient is unit-less.
Since the correlation coefficient is unit-less, it is not affected by changes in the center or scale of either variable (such as unit conversions).
The correlation of \(X\) with \(Y\) is the same as the correlation of \(Y\) with \(X\).
The correlation coefficient is sensitive to outliers.
Recall that correlation does not imply causation.
Define residual (\(e\)) as the difference between the observed (\(y\)) and predicted (\(\hat{y}\)) values of the response variable.
\[e_i = y_i - \hat{y}_i\]
Define the least squares line as the line that minimizes the sum of the squared residuals. Recognize if the conditions for using least squares:
have been satisfied.
Define an indicator variable as a binary explanatory variable (with two levels).
Calculate the estimate for the slope (\(b_1\)) as
\[b_1 = r \frac{s_y}{s_x},\]
where \(r\) is the correlation coefficient, \(s_y\) is the standard deviation of the response variable, and \(s_x\) is the standard deviation of the explanatory variable.
Interpret the slope as
When \(x\) is numerical: “For each unit increase in \(x\), we would expect y to be lower/higher on average by \(|b_1|\) units.”
When \(x\) is categorical: “The value of the response variable is predicted to be \(|b_1|\) units higher/lower between the baseline level and the other level of the explanatory variable.”
Note that whether the response variable increases or decreases is determined by the sign of \(b_1\).
Note that the least squares line always passes through the average of the response and explanatory variables \((\bar{x}, \bar{y}).\)
Use the above property to calculate the estimate for the intercept \((b_0)\) as
where \(b_1\) is the slope, \(\bar{y}\) is the average of the response variable, and \(\bar{x}\) is the average of explanatory variable.
Interpret the intercept as
“When \(x=0\), we would expect \(y\) to equal, on average, \(b_0\)”—when \(x\) is numerical.
“The expected average value of the response variable for the reference level of the explanatory variable is \(b_0\)”—when \(x\) is categorical.
Define \(R^2\) as the percentage of the variability in the response variable explained by changes in the explanatory variable.
For a good model, we would like this number to be as close to 100% as possible.
\(R^2\) is calculated as the square of the correlation coefficient.
Test yourself:
A teaching assistant gives a quiz. There are 10 questions on the quiz and no partial credit is given. After grading the papers, the TA writes down the number of questions each student answered correctly. What is the correlation between the number of questions answered correctly and incorrectly? Hint: Make up some data for number of questions right, calculate number of questions wrong, and plot them against each other.
Suppose you fit a linear regression model predicting the score on an exam from the number of hours studied. Say you have studied for 4 hours. Would you prefer to be on the line, below the line, or above the line? What would the residual for your score be (0, negative, or positive)?
Derive the formula for \(b_0\) as a function of \(b_1\) given the fact that the linear model is \(\hat{y}=b_0 + b_1x\) and that the least squares line goes through \((\bar{x}, \bar{y})\).
One study on male college students found their average height to be 70 inches with a standard deviation of 2 inches. Their average weight was 140 pounds with a standard deviation of 25 pounds. The correlation between their height and weight was 0.60. Assuming that the two variables are linearly associated, write the linear model for predicting weight from height.
Is a male who is 72 inches tall and who weighs 115 pounds on the line, below the line, or above the line calculated in question 6?
What is an indicator variable, and what do levels 0 and 1 mean for such variables?
If the correlation between two variables \(y\) and \(x\) is 0.6, what percent of the variability in \(y\) do changes in \(x\) explain?
The model below predicts GPA based on an indicator variable (0: not premed, 1: premed). Interpret the intercept and slope estimates in context of the data.
Watch all videos for week two of Linear Regression and Modeling.
Read sections 7.3-7.4 of OpenIntro Statistics, 3rd Edition.
Review Introduction to linear regression in Data Analysis and Statistical Inference
Complete the simple_regression_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., April 14, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define a leverage point as a point that lies away from the center of the data in the horizontal direction.
Define an influential point as a point that influences (changes) the slope of the regression line.
Do not remove outliers from an analysis without good reason.
Be cautious about using a categorical explanatory variable when one of the levels has very few observations as these may act as influential points.
Determine whether an explanatory variable is a significant predictor for the response variable using the \(t\!~\text{-test}\) and the associated \(p\!~\text{-value}\) in the regression output.
When testing for the significance of the predictor the null hypothesis \(H_0:\beta_1=0\). Recognize that standard software output returns a \(p\!~\text{-value}\) for a two-sided alternative hypothesis.
Calculate the \(T\!~\text{-score}\) for the hypothesis test as
\[T_{df} = \frac{b_1 - \text{null value}}{SE_{b_1}} \text{ with } df = n - 2.\]
Note that the \(T\!~\text{-score}\) has \(n-2\) degrees of freedom since we lose one degree of freedom for each parameter we estimate. In simple linear regression, we estimate the intercept and the slope.
Note that a hypothesis test for the intercept is often irrelevant since it is usually out of the range of the data.
Calculate a confidence interval for the slope as
\[b_1 \pm t_{1- \alpha/2, df}\cdot SE_{b_1}\]
where \(df=n-2\) and \(t_{1- \alpha/2, df}\) is the critical score associated with the given confidence level at the desired degrees of freedom.
Note that the standard error of the slope estimate \(SE_{b_1}\) can be found on the regression output.
Test yourself:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 16.0839 | 3.0866 | 5.2109 | 0 |
x | 0.7339 | 0.0473 | 15.5241 | 0 |
Watch all videos for week three of Linear Regression and Modeling.
Read sections 8.1-8.3 of OpenIntro Statistics, 3rd Edition.
Complete Multiple linear regression in Data Analysis and Statistical Inference
Complete the multiple_regression_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., April 26, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
\[\hat{y} = b_0 +b_1x_1+b_2x_2+\cdots+b_kx_k\]
where there are \(k\) predictors (explanatory variables).
Interpret the estimate for the intercept \((b_0)\) as the expected value of \(y\) when all predictors are equal to 0, on average.
Interpret the estimate for a slope (say \(b_1\)) as “All else held constant, for each unit increase in \(x_1\), we would expect \(y\) to be higher/lower on average by \(b_1\).”
Define collinearity as a high correlation between two independent variables such that the two variables contribute redundant information to the model, which is something we want to avoid in multiple linear regression.
Note that \(R^2\) will increase with each explanatory variable added to the model, regardless of whether or not the added variable is a meaningful predictor of the response variable. We use adjusted \(R^2\), which applies a penalty for the number of predictors included in the model, to assess the strength of a multiple linear regression model.
\[R^2_{adj}= 1 - \frac{SSE/(n-k-1)}{SST/(n-1)}\] where \(n\) is the number of cases and \(k\) is the number of predictors. Note that \(R^2_{adj}\) will only increase if the added variable has a meaningful contribution to the amount of explained variability in \(y\), i.e. if the gains from adding the variable exceed the penalty.
Define model selection as identifying the best model for predicting a given response variable.
Note that we usually prefer simpler (parsimonious) models over more complicated ones.
Define the full model as the model with all explanatory variables included as predictors.
The significance of the model as a whole is assessed using an \(F\)-test where:
Note that the \(p\!~\text{-values}\) associated with each predictor are conditional on other variables being included in the model, so they can be used to assess if a given predictor is significant, given that all others are in the model.
\(H_0:\beta_1=0\), given all other variables are included in the model
\(H_A:\beta_1\ne0\), given all other variables are included in the model
These \(p\!~\text{-values}\) are calculated based on a \(t\) distribution with \(n-k-1\) degrees of freedom
The same degrees of freedom can be used to construct a confidence interval for the slope parameter of each predictor:
\[b_i \pm t_{1-\alpha/2, n-k-1}\cdot SE_{b_i}\]
Stepwise model selection (backward or forward) can be done based on \(p\!~\text{-values}\) (drop variables that are not significant) or based on adjusted \(R^2\) (choose the model with highest adjusted \(R^2\)).
The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached.
Start with the full model.
Drop the variable with the highest \(p\!~\text{-value}\) and refit the model.
Repeat until all remaining variables are significant.
The general idea behind forward-selection is to start with only one variable and add one variable at a time until the ideal model is obtained.
Try all possible simple linear regression models predicting \(y\) using one explanatory variable at a time. Choose the model where the explanatory variable of choice has the lowest \(p\!~\text{-value}\).
Try all possible models adding one more explanatory variable at a time. Choose the model where the added explanatory variable has the lowest \(p\!~\text{-value}\).
Repeat until all added variables are significant.
The adjusted \(R^2\) method is more computationally intensive; but, it is more reliable, since it does not depend on an arbitrary significance level.
List the conditions for multiple linear regression as
Note that no model is perfect, but even imperfect models can be useful.
Test yourself:
How are multiple linear regression and simple linear regression different?
What does “all else held constant” mean in the interpretation of a slope coefficient in multiple linear regression?
What is collinearity? Why do we want to avoid collinearity in multiple regression models?
Explain the difference between \(R^2\) and adjusted \(R^2\). Which one will be higher? Which one tells us the variability in \(y\) explained by the model? Which one is a better measure of the strength of a linear regression model?
Define the term “parsimonious model.”
Describe the backward-selection algorithm using adjusted \(R^2\) as the criterion for model selection.
Project3Template.Rmd
project found in the StatsWithRProjects
folder of your private repository. Submit your compiled *html
file no later than 5:00 p.m., May 11, to CrowdGrader. Make sure you commit and push your final product to your private repository.