Most due dates are 5 p.m. Friday for CrowdGrader assignments.
Most CrowdGrader peer reviews will be due by 11 p.m. on Tuesdays.
The grading rubric for submissions can be found in the grubric directory and below.
Field | Excellent (3) | Competent (2) | Needs Work (1) |
---|---|---|---|
Reproducible | All graphs, code, and answers are created from text files. Answers are never hardcoded but instead are inserted using inline R code. An automatically generated references section with properly formatted citations when appropriate and sessionInfo() are provided at the end of the document. |
All graphs, code, and answers are created from text files. Answers are hardcoded. No sessionInfo() is provided at the end of the document. References are present but not cited properly or not automatically generated. |
Document uses copy and paste with graphs or code. Answers are hardcoded; and references, when appropriate are hardcoded. |
Statistical Understanding | Answers to questions demonstrate clear statistical understanding by comparing theoretical answers to simulated answers. When hypotheses are tested, classical methods are compared and contrasted to randomization methods. When confidence intervals are constructed, classical approaches are compared and contrasted with bootstrap procedures. The scope of inferential conclusions made is appropriate for the sampling method. | Theoretical and simulated answers are computed but no discussion is present comparing and contrasting the results. When hypotheses are tested, results for classical and randomization methods are presented but are not compared and contrasted. When confidence intervals are constructed, classical and bootstrap approaches are computed but the results are not compared and contrasted. The scope of inferential conclusions made is appropriate for the sampling method. | Theoretical and simulated answers are not computed correctly. No comparison between classical and randomization approaches is present when testing hypotheses. When confidence intervals are constructed, there is no comparison between classical and bootstrap confidence intervals . |
Graphics | Graphs for categorical data (barplot, mosaic plot, etc.) have appropriately labeled axes and titles. Graphs for quantitative data (histograms, density plots, violin plots, etc.) have appropriately labeled axes and titles. Multivariate graphs use appropriate legends and labels. Computer variable names are replaced with descriptive variable names. | Appropriate graphs for the type of data are used. Not all axes have appropriate labels or computer variable names are used in the graphs. | Inappropriate graphs are used for the type of data. Axes are not labeled and computer variable names appear in the graphs. |
Coding | Code (primarily R) produces correct answers. Non-standard or complex functions are commented. Code is formatted using a consistent standard. | Code produces correct answers. Commenting is not used with non-standard and complex functions. No consistent code formatting is used. | Code does not produce correct answers. Code has no comments and is not formatted. |
Clarity | Few errors of grammar and usage; any minor errors do not interfere with meaning. Language style and word choice are highly effective and enhance meaning. Style and word choice are appropriate for the assignment. | Some errors of grammar and usage; errors do not interfere with meaning. Language style and word choice are, for the most part, effective and appropriate for the assignment. | Major errors of grammar and usage make meaning unclear. Language style and word choice are ineffective and/or inappropriate. |
When you register for a free individual GitHub account, request a student discount to obtain a few private repositories as well as unlimited public repositories. Please use something similar to FirstNameLastName as your username when you register with GitHub. For example, my username on GitHub is alanarnholt. If you have a popular name such as John Smith, you may need to provide some other distinquishing characteristic in your username. Please use the same username for your account on Rpubs.
Once you have a GitHub account, send an email to arnholtat@appstate.edu with a Subject line of STT 5811 - GitHub Username, and tell me in the body of your email your first name, last name, and your GitHub username. I will then manually add you as a team member to the repository in the STAT-ATA-ASU organization that has your name (LastName-FirstName). This repository will be where you store all of your work for this course. I will also change your repository to a private repository.
Sign up to audit the Coursera classes Introduction to Probability and Data, Inferential Statistics, and Linear Regression and Modeling—auditing these courses will give you access to some excellent videos.
Become familiar with the Appstate RStudio server. You will use your Appstate user name and password to log in to the server. You must be registered in the class to access the server.
Test drive RStudio by following the directions from Jenny Bryan’s STAT 545 course. Additional material can be found in the detailed Bookdown
document Happy Git and GitHub for the useR. Chapters 8, 10-13 of Happy Git and GitHub for the useR will be helpful if you need more directions for test driving RStudio. Note: Git, R, and the RStudio IDE have already been installed for you on the RStudio server.
Read the Git and GitHub chapter from Hadley Wickham’s book R Packages
Read chapters 1-3 in Passion Driven Statistics, (PDS), and watch all linked videos in PDS.
Read chapters 1-3 of Reproducible Research with R and RStudio.
Watch the following video:
You may want to install Git, R, RStudio, zotero, and optionally LaTeX on your personal computer. If you do, you will want to follow Jenny Bryan’s excellent advice for installing R and RStudio and installing Git. Jenny’s advice is also in chapters 6 and 7 of Happy Git and GitHub for the useR. Note: Git, R, RStudio, and LaTeX are installed on the Appstate RStudio server.
Watch the following videos as appropriate:
Watch all videos for week one of Introduction to Probability and Data.
Read sections 1.1-1.5 of OpenIntro Statistics, 3rd Edition.
Read chapters 4 and 5 of PDS (Conducting a Literature Review and Writing About Empirical Research).
Using zotero will be covered.
Clone the repository to your local machine using RStudio by following these instructions:
File > New Project > Version Control > Git
https://github.com/YourUserName/STAT-ATA-ASU/STT5811ClassRepo.git
) in the Repository URL:
box.UserNameSTT5811
) in the Project directory name:
box.Create project as subdirectory of:
box.Create Project
box. You should now have a local copy of the forked repository on your local machine. Congratulations!Set the upstream remote in your fork to this repository with the command
git remote add upstream https://github.com/STAT-ATA-ASU/STT5811ClassRepo.git
Verify with
git remote -v
To obtain updates from the upstream repository type
git pull upstream master
If the upstream repository is using gh-pages
, use gh-pages
instead of master
to obtain updates.
git pull upstream gh-pages
If there are conflicts, you will need to resolve them before proceeding.
Create a free account on DataCamp. Complete chapter 1 of Data Analysis and Statistical Inference
Fork the StatsWithRLabs repository to your private GitHub account. Complete the intro_to_r_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., August 26, to CrowdGrader. Make sure you commit and push your your final product to your private repository. Install the statsr
package by running the following from the R command line:
devtools::install_github('alanarnholt/statsr')
Define associated variables as variables that show some relationship with one another. Further categorize this relationship as a positive or negative association when possible.
Define variables that are not associated as independent.
Test yourself: Give one example of each type of variable you have learned.
Identify the explanatory variable in a pair of variables as the variable suspected of affecting the other, however note that labeling variables as explanatory and response does not guarantee that the relationship between the two is actually causal, even if there is an association identified between the two variables.
Classify a study as observational or experimental. Determine and explain whether the study’s results can be generalized to the population and whether the results suggest an association or causation between the quantities studied.
Identify confounding variables and sources of bias in a given study.
Distinguish among simple random, stratified, and cluster sampling. Recognize the benefits and drawbacks of choosing one sampling scheme over another.
Identify the four principles of experimental design, and recognize their purposes: control any possible confounders; randomize into treatment and control groups; replicate by using a sufficiently large sample or repeating the experiment; and block any variables that might influence the response.
Identify if single or double blinding has been used in a study.
Test yourself:
Watch all videos for week two of Introduction to Probability and Data.
Read sections 1.6-1.8 of OpenIntro Statistics, 3rd Edition.
Read chapters 6, 8, and 9 of PDS.
Complete chapter 2 of Data Analysis and Statistical Inference
Complete the intro_to_data_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., September 2, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Integrating zotero with RMarkdown will be discussed during the first part of the week. An example document will be stored in this folder.
You may find these videos helpful.
Use scatterplots to display the relationship between two numerical variables, making sure to note the direction (positive or negative), form (linear or non-linear), and strength of the relationship as well as any unusual observations.
When describing the distribution of a numerical variable, mention its shape, center, and spread, as well as any unusual observations.
Note that there are three commonly used measures of center and spread:
Identify the shape of a distribution as unimodal, bimodal, multimodal, or uniform; and if the shape is unimodal, further classify the distribution as symmetric, right skewed, or left skewed.
Use histograms and box plots to visualize the shape, center, and spread of numerical distributions. Use intensity maps to visualize the spatial distribution of the data.
Define a robust statistic (e.g. median, IQR) as a statistic that is not heavily affected by skewness or extreme outliers. Determine when robust statistics are more appropriate measures of center and spread compared to non-robust statistics.
Recognize when transformations (e.g. log) can make the distribution of data more symmetric and easier to model.
Test yourself:
Use frequency tables and bar plots to describe the distribution of one categorical variable.
Use contingency tables and segmented bar plots or mosaic plots to assess the relationship between two categorical variables.
Use side-by-side box plots for assessing the relationship between a numerical and a categorical variable.
Test yourself:
Watch all videos for week three of Introduction to Probability and Data.
Read sections 2.1-2.5 of OpenIntro Statistics, 3rd Edition.
Read chapter 7 of PDS.
Complete chapter 3 of Data Analysis and Statistical Inference
Complete the probability_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., September 9, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define the probability of an outcome as the proportion of times the outcome would occur if we observed the random process that gives rise to it an infinite number of times.
Explain why the long-run relative frequency of repeated independent events approaches the true probability as the number of trials increases, i.e. why the law of large numbers holds.
Define disjoint (mutually exclusive) events as events that cannot both happen at the same time:
Draw Venn diagrams representing events and their probabilities.
Define a probability distribution as a list of the possible outcomes with corresponding probabilities that satisfies three rules:
if \(A\) and \(B\) are mutually exclusive, \(P (A \cup B) = P (A) + P (B)\), since for mutually exclusive events \(P(A \cap B) = 0.\)
If \(A\) and \(B\) are dependent, \(P(A \cap B) = P(A|B) \times P(B)\)
Test yourself:
What is the probability of getting a head on the 6th coin flip if in the first 5 flips the coin landed on a head each time?
True / False: Being right handed and having blue eyes are mutually exclusive events.
P(A)=0.5, P(B)=0.6, and there are no other possible outcomes in the sample space. What is P(A ∩ B)?
Distinguish between marginal and conditional probabilities.
Construct tree diagrams to calculate conditional probabilities and probabilities of the intersection of non-independent events using Bayes’ theorem: \(P(A|B)=\frac{P(A \cap B)}{P(B)}.\)
Test yourself: 50% of students in a class are social science majors and the rest are not. 70% of the social science students, and 40% of the non-social science students are in a biology class. Create a contingency table and a tree diagram summarizing these probabilities. Calculate the percentage of students in this class who are in a biology class.
Watch all videos for week four of Introduction to Probability and Data.
Read sections 3.1, 3.2, and 3.4 of OpenIntro Statistics, 3rd Edition.
Complete chapter 4 of Data Analysis and Statistical Inference
Compile and read the exampleEDA.Rmd
lab.
Read chapter 9 of PDS.
Define the standardized score (\(z\)-score) of a data point as the number of standard deviations it is away from the mean: \(z = \frac{x - \mu}{\sigma}.\)
Use the \(z\!~\text{-score}\)
Depending on the shape of the distribution, determine whether the median would have a negative, positive, or zero \(z\)-score, keeping in mind that the mean always has a \(z\)-score of 0.
Assess whether or not a distribution is nearly normal using either the 68-95-99.7% rule or graphical methods such as a normal probability plot.
Test yourself: True/False: In a right skewed distribution the z -score of the median is positive.
Determine if a random variable is binomial using the four conditions:
The probability of a success, \(p\), is the same for each trial.
Calculate the number of possible scenarios for obtaining \(k\) successes in n trials using the binomial coefficient: \(\binom{n}{k}=\frac{n!}{k!(n-k)!}.\)
Calculate probability of a given number of successes in a given number of trials using the binomial distribution: \(P(X=k)=\binom{n}{k}p^k(1-p)^{(n-k)}.\)
Calculate the expected value \((\mu = np)\) and standard deviation \(\left(\sigma = \sqrt{np(1-p)}\right)\) of a binomial distribution.
When number of trials is sufficiently large (\(np \ge 10\) and \(n(1-p) \ge 10\)), use the normal approximation to calculate binomial probabilities. Explain why this approach works.
Test yourself:
PDS
package.
Project1Template.Rmd
project, and submit your compiled *html
file no later than 5:00 p.m., September 23, to CrowdGrader. Make sure you commit and push your your final product to your private repository.Watch all videos for week one of Inferential Statistics.
Read sections 4.1-4.2 of OpenIntro Statistics, 3rd Edition.
Complete chapter 4 of Data Analysis and Statistical Inference
Complete the sampling_distributions_ASU2.Rmd
lab and submit your compiled *html
file no later than 5:00 p.m., September 30 11:00 p.m., October 2, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define a statistic as a point estimate for a population parameter. For example, the sample mean is used to estimate the population mean. Note that “point estimate” and “statistic” are synonymous.
Recognize that point estimates (such as the sample mean) will vary from one sample to another. We define this variability as sampling variability (sometimes called sampling variation).
Calculate the sampling variability of the mean, the standard deviation of \(\bar{x}\), as \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\) where \(\sigma\) is the population standard deviation.
Distinguish between standard deviation (\(\sigma\) or \(s\)) and standard error (\(SE\)): standard deviation measures the variability in the data, while standard error measures the variability in point estimates from different samples of the same size and from the same population; it measures the sampling variability.
Recognize that when the sample size increases we would expect the sampling variability to decrease.
Define a confidence interval as the plausible range of values for a population parameter.
Define the confidence level as the expected percentage of random samples which yield confidence intervals that capture the true population parameter.
Recognize that the Central Limit Theorem (CLT) is about the distribution of point estimates. Under certain conditions, this distribution will be nearly normal.
In the case of the mean, the CLT tells us that if the sample size is sufficiently large and the observations in the sample are independent, then the distribution of the sample mean will be nearly normal. The distribution will be centered at the true population mean and have a standard deviation of \(\frac{\sigma}{\sqrt{n}}\). The distribution of \(\bar{x}\) is written with symbols as
\[\begin{equation} \bar{X} \dot{\sim} N\left(\mu_{\bar{x}} = \mu, \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\right) \label{clt} \end{equation}\]In addition, the sample should not be too large compared to the population. More precisely, the sample should be smaller than 10% of the population. Samples that are too large will likely contain observations that are not independent.
Notice that this corresponds to half the width of the confidence interval.
Test yourself:
For each of the following situations, state whether the variable is categorical or numerical and whether the parameter of interest is a mean or a proportion.
Suppose heights of all women in the US have a mean of 63.7 inches and a random sample of 100 women’s heights yields a sample mean of 65.2 inches. Which values are the population parameter and the point estimate, respectively? Which one is μ and which one is \(\bar{x}\)?
Suppose heights of all women in the US have a standard deviation of 2.7 inches and a random sample of 100 women’s heights yields a standard deviation of 4 inches. Which value is the population parameter and which value is the point estimate? Which one is σ and which one is s?
Explain, in plain English, what you see in Figure 4.8 of the book (page 166).
List the conditions necessary for the CLT to hold.
Confirm that z1 − α/2 for a 98% confidence level is 2.33. (Include a sketch of the normal curve in your response.)
Calculate a 95% confidence interval for the average height of US women using a random sample of 100 women where the sample mean is 63 inches and the sample standard deviation is 3 inches. Interpret this interval in context of the data.
Explain, in plain English, the difference between the standard error and the margin of error.
A little more challenging: Suppose heights of all men in the US have a mean of 69.1 inches and a standard deviation of 2.9 inches. What is the probability that a random sample of 100 men will yield a sample average less than 70 inches?
Watch all videos for week two of Inferential Statistics.
Read sections 4.3-4.6 of OpenIntro Statistics, 3rd Edition.
Read chapter 10 of PDS.
Complete chapter 5 of Data Analysis and Statistical Inference
Complete the confidence_intervals_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., October 7, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Explain how the hypothesis testing framework resembles a court trial.
Recognize that in hypothesis testing we evaluate two competing claims:
\(p\!~\text{-value}\)=P(observed or more extreme statistic | \(H_0\) true)
Always sketch the normal curve when calculating the \(p\!~\text{-value}\), and shade the appropriate area(s) depending on whether the alternative hypothesis is one- or two-sided.
Infer that if a confidence interval does not contain the null value, the null hypothesis should be rejected in favor of the alternative.
Compare the \(p\!~\text{-value}\) to the significance level to make a decision between the hypotheses:
Use a smaller \(\alpha\) if a Type I error is more critical than a Type II error. Use a larger \(\alpha\) if a Type II error is more critical than a Type I error.
Recognize that sampling distributions of point estimates coming from samples that don’t meet the required conditions for the CLT (about sample size, skew, and independence) will not be normal.
Formulate the framework for statistical inference using hypothesis testing and nearly normal point estimates:
Using the sketch and the normal model, calculate the \(p\!~\text{-value}\); and determine if the null hypothesis should be rejected or not. State your conclusion in context of the data and the research question.
If the conditions necessary for the CLT to hold are not met, note this and do not go forward with the analysis. (We will learn later about methods to use in these situations.)
Calculate the required sample size to obtain a given margin of error at a given confidence level by working backwards from the given margin of error.
Distinguish between statistical significance vs. practical significance.
Define power as the probability of correctly rejecting the null hypothesis (the complement of a Type II error).
Test yourself:
List the errors in the following hypotheses: \(H_0:\bar{x} \gt 20\) versus \(H_A:\bar{x} \ge 25\)
What is wrong with the following statement: “If the p -value is large, we accept the null hypothesis since a large p -value implies that the observed difference between the null value and the sample statistic is quite likely to happen just by chance”?
Suppose a researcher is interested in evaluating the claim “the average height of adult males in the US is 69.1 inches,” and she believes this is an underestimate.
How should she set up her hypotheses?
Explain to her, in plain language, how she should collect data and carry out a hypothesis test.
Suppose she collects a random sample of 40 adult males where the average is 70.2 inches. A test returns a p -value of 0.0082. What should she conclude?
Interpret this p -value (as a conditional probability) in context of the question.
Suppose that the true average is in fact 69.1 inches. If the researcher rejects the null hypothesis, what type of an error is the researcher making? In order to avoid making such an error, should she have used a smaller or a larger significance level?
Describe the differences and similarities between testing a hypothesis using simulation and testing a hypothesis using theory. Discuss how the calculation of the p -value changes while the definition of the p -value stays the same.
In a random sample of 1,017 Americans, 60% said they do not trust the mass media when it comes to reporting the news fully, accurately, and fairly. The standard error associated with this estimate is 0.015 (1.5%). What is the margin of error of a 95% confidence level? Calculate a 95% confidence interval and interpret it in context.
If we want to decrease the margin of error, and hence have a more precise confidence interval, should we increase or decrease the sample size?
Watch all videos for week three of Inferential Statistics.
Read sections 5.1-5.5 of OpenIntro Statistics, 3rd Edition.
Read Can A New Drug Reduce the Spread of Schistosomiasis?—Use control f on the web page to find the appropriate text inside the notes.
Read chapter 11 of PDS.
Complete chapter 6 of Data Analysis and Statistical Inference
Complete the inf_for_numerical_data_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., October 21, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Use the \(t\)-distribution for inference on a single mean, mean of differences (paired groups), and difference of independent means.
Explain how the shape of the \(t\)-distribution accounts for the additional variability introduced by using \(s\) (sample standard deviation) in place of \(\sigma\) (population standard deviation).
Describe how the \(t\)-distribution is different from the normal distribution and what “heavy tail” means in this context.
Note that the \(t\)-distribution has a single parameter, degrees of freedom. As the number of degrees of freedom increases, this distribution approaches the normal distribution.
Use a \(t\)-statistic with degrees of freedom \(df=n-1\) for inference for a population mean:
CI: \(\bar{x} \pm t_{1-\alpha/2, n-1}\cdot SE_{\bar{x}}\)
HT: \(t_{n-1}=\frac{\bar{x}-\mu_0}{SE_{\bar{x}}}\) where \(SE_{\bar{x}} = \frac{s}{\sqrt{n}}.\)
Describe how to obtain a \(p\!~\text{-value}\) for a \(t\!~\text{-test}\) and a critical \(t_{1 - \alpha/2, n-1}\) value for a confidence interval.
Define observations as paired if each observation in one dataset has a special correspondence or connection with exactly one observation in the other data set.
Carry out inference for paired data by first subtracting the paired observations from each other. Then, treat the set of differences as a new numerical variable on which to do inference (such as a confidence interval or hypothesis test for the average difference).
Calculate the standard error of the mean difference between two paired (dependent) samples as \(SE = \frac{s_{diff}}{\sqrt{n_{diff}}}.\) Use this standard error in hypothesis testing and confidence intervals comparing means of paired (dependent) groups.
Use a \(t\)-statistic, with degrees of freedom \(df=n_{diff}-1\) for inference with the mean difference in two paired (dependent) samples:
CI: \(\bar{x}_{diff} \pm t_{1-\alpha/2, n_{diff}-1}\cdot SE\)
HT: \(t_{n_{diff}-1}=\frac{\bar{x}_{diff}-\mu_0}{SE}\) where \(SE = \frac{s_{diff}}{\sqrt{n_{diff}}}.\)
Recognize that a good interpretation of a confidence interval for the difference between two parameters includes a comparative statement (mentioning which group has the larger parameter).
Recognize that a confidence interval for the difference between two parameters that doesn’t include 0 is in agreement with a hypothesis test where the null hypothesis that sets the two parameters equal to each other is rejected.
Calculate the standard error for the difference between means of two independent samples as \(SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2}}.\) Use this standard error in hypothesis testing and confidence intervals comparing means of independent groups.
Use a \(t\)-statistic with \(\nu\) degrees of freedom for conducting inference with the difference in two independent means:
CI: \(\bar{x}_1 - \bar{x}_2 \pm t_{1-\alpha/2, \nu}\cdot SE\)
HT: \(t_{\nu}=\frac{(\bar{x}_1 - \bar{x}_2) -(\mu_1 -\mu_2)}{SE}\) where \(SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2}}.\)
Calculate the power of a test for a given effect size and significance level in two steps:
Calculate the probability of obtaining that statistic given the effect size.
Explain how power changes for changes in effect size, sample size, significance level, and standard error.
Define analysis of variance (ANOVA) as a statistical inference method that is used to determine, by simultaneously considering many groups at once, if the variability between the sample means is so large that it seems unlikely to be from chance alone.
Recognize that the null hypothesis in ANOVA sets all means equal to each other and that the alternative hypothesis suggest that at least one mean is different.
List the conditions necessary for performing ANOVA:
The variability across the groups should be roughly equal.
Use graphical diagnostics to check if these conditions are met.
Describe why calculation of the \(p\!~\text{-value}\) for ANOVA is always “one sided.”
Describe why conducting many \(t\!~\text{-test}\)s for differences between each pair of means leads to an increased Type I Error rate and why we use a corrected significance level (Bonferroni correction, \(\alpha^*=\alpha/K\), where \(K\) is the number of comparisons being considered) to combat inflating this error rate.
Describe why it is possible to reject the null hypothesis in ANOVA but not find significant differences between groups when doing pairwise comparisons.
Describe how bootstrap distributions are constructed. Recognize how bootstrap distributions are different from sampling distributions.
Construct bootstrap confidence intervals using one of the following methods:
Test yourself:
What is the t-value for a 95% confidence interval for a mean where the sample size is 13?
What is the p -value for a hypothesis test where the alternative hypothesis is two-sided, the sample size is 20, and the test statistic, t, is calculated to be 1.75?
20 cardiac patients’ blood pressures are measured before and after taking a medication. For a given patient, are the before and after blood pressure measurements dependent (paired) or independent?
A random sample of 100 students was obtained and then randomly assigned into two equal-sized groups. One group rode on a roller coaster while the other rode a simulator at an amusement park. Afterwards, their blood pressure measurements were taken. Are the measurements dependent (paired) or independent?
Describe how the two sample means test is different from the paired means test.
A 95% confidence interval for the difference between the number of calories consumed by mature and juvenile cats (μmat − μjuv) is (80 calories, 100 calories). Interpret this interval, and determine if it suggests a significant difference between the two means.
We would like to compare the average incomes of Americans who live in the Northeast, Midwest, South, and West. What are the appropriate hypotheses?
Suppose the sample in question 7 has 1000 observations, what are the degrees of freedom associated with the F-statistic?
Suppose the appropriate null hypothesis from question 7 is rejected. Describe how we would discover which regions’ averages are different from each other. Make sure to discuss how many pairwise comparisons we would need to make and what the corrected significance level would be.
What visualizations are useful for checking each of the conditions required for performing ANOVA?
How is a bootstrap distribution different from a sampling distribution?
If a bootstrap distribution is constructed using 200 simulations, how would we find the 95% bootstrap confidence interval?
Watch all videos for week four of Inferential Statistics.
Read sections 6.1-6.6 of OpenIntro Statistics, 3rd Edition.
Read chapter 12 of PDS.
Complete chapter 7 of Data Analysis and Statistical Inference
Complete the inf_for_categorical_data_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., October 28, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define the population proportion as \(p\) (parameter) and sample proportion as \(\hat{p}\).
Calculate the sampling variability of the proportion, the standard deviation, as \(\sigma_{\hat{p}}=SD_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion.
Note that if the CLT doesn’t apply and the sample proportion is low (close to 0), the sampling distribution will likely be right skewed. If the sample proportion is high (close to 1), the sampling distribution will likely be left skewed.
Remember that confidence intervals are calculated as
and standardized test statistics are calculated as
\(Z = \frac{\text{statistic} - \mu_{\text{statistic}}}{\sigma_{\text{statistic}}}\)
\(T = \frac{\text{statistic} - \mu_{\text{statistic}}}{\hat{\sigma}_{\text{statistic}}}\)
For confidence intervals use \(\hat{p}\) (observed sample proportion) when calculating the standard error and checking the success/failure condition:
\(SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\)
For hypothesis tests use \(p_0\) (null value) when calculating the standard error and checking the success/failure condition:
\(SE_{\hat{p}} = \sqrt{\frac{p_0(1 - p_0)}{n}}\)
Such a discrepancy does not exist when conducting inference for means since the mean does not factor into the calculation of the standard error.
Conceptually: When there is no additional information, 50% chance of success is a good guess for events with only two outcomes (success or failure).
Mathematically: Using \(\hat{p} = 0.5\) yields the most conservative (highest) estimate for the required sample size.
confidence interval and hypothesis test when \(H_0: p_1 - p_2 = \text{some value other than 0:}\) \(SE_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}\)
hypothesis test when \(H_0:p_1 - p_2 = 0:\) \(SE_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{\hat{p}_{pool}(1 - \hat{p}_{pool})}{n_1} + \frac{\hat{p}_{pool}(1 - \hat{p}_{pool})}{n_2}}\) where \(\hat{p}_{pool}\) is the overall rate of success: \(\hat{p}_{pool} = \frac{\text{number of successes in group 1 + number of successes in group 2}}{n_1 + n_2}\)
\(H_0:\) The distribution of observed counts follows the hypothesized distribution and any observed differences are due to chance.
\(H_A:\) The distribution of observed counts does not follow the hypothesized distribution.
Calculate the expected counts for a given level (cell) in a one-way table as the sample size times the hypothesized proportion for that level.
Calculate the chi-squared test statistic as
\[\chi^2 = \sum_{i=1}^k\frac{(\text{observed count} - \text{expected count})^2}{\text{expected count}}, \text{where } k \text{ is the number of cells.}\]
Note that the chi-squared statistic is always positive and follows a right skewed distribution with one parameter, which is the degrees of freedom.
Note that the degrees of freedom for the chi-squared statistic for the goodness of fit test is \(df=k-1\).
List the conditions necessary for performing a chi-squared test (goodness of fit or independence)
The degrees of freedom should be at least two (if not, use methods for evaluating proportions).
Describe how to use the chi-squared table to obtain a \(p\!~\text{-value}\).
When evaluating the independence of two categorical variables, where at least one has more than two levels, use a chi-squared test of independence.
\(H_0:\) The two variables are independent.
\(H_A:\) The two variables are dependent.
\[E =\frac{\text{row total}\times\text{column total}}{\text{grand total}}\]
Calculate the degrees of freedom for chi-squared test of independence as \(df=(R-1)\times(C-1)\), where \(R\) is the number of rows in a two-way table, and \(C\) is the number of columns.
Note that there is no such thing as a chi-squared confidence interval for proportions.
Use simulation methods when sample size conditions are not met for inference for categorical variables.
Test yourself:
Suppose 10% of ASU students smoke. You collect many random samples of 100 ASU students at a time and calculate a sample proportion of students who smoke \((\hat{p})\) for each sample. What would you expect the distribution of \(\hat{p}\) to be? Describe its shape, center, and spread.
Suppose you want to construct a confidence interval with a margin of error no more than 4% for the proportion of ASU students who smoke. How would your calculation of the required sample size change if you do not know anything about the smoking habits of ASU students versus if you have a reliable previous study estimating that about 10% of ASU students smoke?
Suppose a 95% confidence interval for the difference between the proportion of male and the proportion of female ASU students who smoke is (-0.08, 0.11). Interpret this interval, making sure to incorporate a comparative statement about the two sexes of ASU students.
Does the above interval suggest a significant difference between the true proportions of smokers in the two groups?
Suppose you had a sample of 100 male ASU students where 11 of them smoke and a sample of 80 female ASU students where 10 of them smoke. Calculate \(\hat{p}_{pool}\).
When and why do we use \(\hat{p}_{pool}\) in the calculation of the standard error for the difference between two sample proportions?
Explain the different hypothesis tests one could use when assessing the distribution of a categorical variable (e.g. smoking status) with only two levels (e.g. levels: smoker and non-smoker) vs. more than two levels (e.g. levels: heavy smoker, moderate smoker, occasional smoker, non-smoker).
Why is the p -value for chi-squared tests always “one sided”?
What are the null and alternative hypotheses in the chi-squared test of independence?
Suppose a chi-squared test of independence between two categorical variables (one with 5, the other with 3 levels) yields a test statistic of χ2 = 14. What is the conclusion of the hypothesis test at a 5% significance level?
Suppose you want to estimate the proportion of ASU students who smoke. You collect a random sample of 100 students, where only 8 of them smoke. Should you use the theoretical methods (Z) to construct a confidence interval based on these data? If not, describe how you could calculate a 95% bootstrap confidence interval.
Project2Template.Rmd
project found in your StatsWithRProjects forked repository. Submit your compiled *html
file no later than 5:00 p.m., November 4, to CrowdGrader. Make sure you commit and push your final product to your private repository.Watch all videos for week one of Linear Regression and Modeling.
Read sections 7.1-7.2 of OpenIntro Statistics, 3rd Edition.
Read chapter 13 of PDS.
Complete chapter 8 of Data Analysis and Statistical Inference
No Lab this week!
Define the explanatory variable as the independent variable (predictor) and the response variable as the dependent variable (predicted).
Plot the explanatory variable (\(x\)) on the \(x\)-axis and the response variable (\(y\)) on the \(y\)-axis, and fit a linear regression model
\[y = \beta_0 + \beta_1 x + \varepsilon,\]
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is random error.
The correlation coefficient is always between -1 and 1.
\(r=0\) indicates no linear relationship.
The correlation coefficient is sensitive to outliers.
Recall that correlation does not imply causation.
Define residual (\(e\)) as the difference between the observed (\(y\)) and predicted (\(\hat{y}\)) values of the response variable.
\[e_i = y_i - \hat{y}_i\]
Define the least squares line as the line that minimizes the sum of the squared residuals. Recognize if the conditions for using least squares:
constant variability
have been satisfied.
Define an indicator variable as a binary explanatory variable (with two levels).
Calculate the estimate for the slope (\(b_1\)) as
\[b_1 = r \frac{s_y}{s_x},\]
where \(r\) is the correlation coefficient, \(s_y\) is the standard deviation of the response variable, and \(s_x\) is the standard deviation of the explanatory variable.
Note that the least squares line always passes through the average of the response and explanatory variables \((\bar{x}, \bar{y}).\)
Use the above property to calculate the estimate for the intercept \((b_0)\) as
where \(b_1\) is the slope, \(\bar{y}\) is the average of the response variable, and \(\bar{x}\) is the average of explanatory variable.
\[\hat{y} = b_0 + b_1 x^*\]
Test yourself:
A teaching assistant gives a quiz. There are 10 questions on the quiz and no partial credit is given. After grading the papers, the TA writes down the number of questions each student answered correctly. What is the correlation between the number of questions answered correctly and incorrectly? Hint: Make up some data for number of questions right, calculate number of questions wrong, and plot them against each other.
Suppose you fit a linear regression model predicting the score on an exam from the number of hours studied. Say you have studied for 4 hours. Would you prefer to be on the line, below the line, or above the line? What would the residual for your score be (0, negative, or positive)?
Someone hands you the scatter diagram shown below, but has forgotten to label the axes. Can you calculate the correlation coefficient?
Derive the formula for b0 as a function of b1 given the fact that the linear model is \(\hat{y}=b_0 + b_1x\) and that the least squares line goes through \((\bar{x}, \bar{y})\).
One study on male college students found their average height to be 70 inches with a standard deviation of 2 inches. Their average weight was 140 pounds with a standard deviation of 25 pounds. The correlation between their height and weight was 0.60. Assuming that the two variables are linearly associated, write the linear model for predicting weight from height.
Is a male who is 72 inches tall and who weighs 115 pounds on the line, below the line, or above the line calculated in question 6?
What is an indicator variable, and what do levels 0 and 1 mean for such variables?
If the correlation between two variables y and x is 0.6, what percent of the variability in y do changes in x explain?
The model below predicts GPA based on an indicator variable (0: not premed, 1: premed). Interpret the intercept and slope estimates in context of the data.
\[\widehat{\text{gpa}}= 3.57 − 0.01 \times \text{premed}\]
Watch all videos for week two of Linear Regression and Modeling.
Read sections 7.3-7.4 of OpenIntro Statistics, 3rd Edition.
Read chapter 15 of PDS.
Review chapter 8 of Data Analysis and Statistical Inference
Complete the simple_regression_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., November 18, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
Define a leverage point as a point that lies away from the center of the data in the horizontal direction.
Define an influential point as a point that influences (changes) the slope of the regression line.
Do not remove outliers from an analysis without good reason.
Be cautious about using a categorical explanatory variable when one of the levels has very few observations as these may act as influential points.
Determine whether an explanatory variable is a significant predictor for the response variable using the \(t\!~\text{-test}\) and the associated \(p\!~\text{-value}\) in the regression output.
When testing for the significance of the predictor the null hypothesis \(H_0:\beta_1=0\). Recognize that standard software output returns a \(p\!~\text{-value}\) for a two-sided alternative hypothesis.
\[T_{df} = \frac{b_1 - \text{null value}}{SE_{b_1}} \text{ with } df = n - 2\].
Note that a hypothesis test for the intercept is often irrelevant since it is usually out of the range of the data.
Calculate a confidence interval for the slope as
\[b_1 \pm t_{1- \alpha/2, df}\cdot SE_{b_1}\]
where \(df=n-2\) and \(t_{1- \alpha/2, df}\) is the critical score associated with the given confidence level at the desired degrees of freedom.
Test yourself:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 16.0839 | 3.0866 | 5.2109 | 0 |
x | 0.7339 | 0.0473 | 15.5241 | 0 |
Watch all videos for week three of Linear Regression and Modeling.
Read sections 8.1-8.3 of OpenIntro Statistics, 3rd Edition.
Complete chapter 9 of Data Analysis and Statistical Inference
Complete the multiple_regression_ASU2.Rmd
lab, and submit your compiled *html
file no later than 5:00 p.m., November 30, to CrowdGrader. Make sure you commit and push your your final product to your private repository.
\[\hat{y} = b_0 +b_1x_1+b_2x_2+\cdots+b_kx_k\]
where there are \(k\) predictors (explanatory variables).
Interpret the estimate for the intercept \((b_0)\) as the expected value of \(y\) when all predictors are equal to 0, on average.
Interpret the estimate for a slope (say \(b_1\)) as “All else held constant, for each unit increase in \(x_1\), we would expect \(y\) to be higher/lower on average by \(b_1\).”
Define collinearity as a high correlation between two independent variables such that the two variables contribute redundant information to the model, which is something we want to avoid in multiple linear regression.
Note that \(R^2\) will increase with each explanatory variable added to the model, regardless of whether or not the added variable is a meaningful predictor of the response variable. We use adjusted \(R^2\), which applies a penalty for the number of predictors included in the model, to assess the strength of a multiple linear regression model.
\[R^2_{adj}= 1 - \frac{SSE/(n-k-1)}{SST/(n-1)}\]
where \(n\) is the number of cases and \(k\) is the number of predictors. Note that \(R^2_{adj}\) will only increase if the added variable has a meaningful contribution to the amount of explained variability in \(y\), i.e. if the gains from adding the variable exceed the penalty.
Define model selection as identifying the best model for predicting a given response variable.
Note that we usually prefer simpler (parsimonious) models over more complicated ones.
Define the full model as the model with all explanatory variables included as predictors.
\[b_i \pm t_{1-\alpha/2, n-k-1}\cdot SE_{b_i}\]
Stepwise model selection (backward or forward) can be done based on \(p\!~\text{-values}\) (drop variables that are not significant) or based on adjusted \(R^2\) (choose the model with highest adjusted \(R^2\)).
The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached.
Repeat until maximum possible adjusted \(R^2\) is reached.
The general idea behind forward-selection is to start with only one variable and add one variable at a time until the ideal model is obtained.
Try all possible simple linear regression models predicting \(y\) using one explanatory variable at a time. Choose the model where the explanatory variable of choice has the lowest \(p\!~\text{-value}\).
Try all possible models adding one more explanatory variable at a time. Choose the model where the added explanatory variable has the lowest \(p\!~\text{-value}\).
Repeat until all added variables are significant.
Repeat until maximum possible adjusted \(R^2\) is reached.
The adjusted \(R^2\) method is more computationally intensive; but, it is more reliable, since it does not depend on an arbitrary significance level.
List the conditions for multiple linear regression as
independence of residuals (and hence observations)—checked using a scatterplot of residuals vs. order of data collection (will reveal non-independence if data have time series structure)
Note that no model is perfect, but even imperfect models can be useful.
Test yourself:
How are multiple linear regression and simple linear regression different?
What does “all else held constant” mean in the interpretation of a slope coefficient in multiple linear regression?
What is collinearity? Why do we want to avoid collinearity in multiple regression models?
Explain the difference between R2 and adjusted R2. Which one will be higher? Which one tells us the variability in y explained by the model? Which one is a better measure of the strength of a linear regression model?
Define the term “parsimonious model.”
Describe the backward-selection algorithm using adjusted R2 as the criterion for model selection.
If a residuals plot (residuals vs. x or residuals vs. \(\hat{y}\)) shows a fan shape, we worry about non-constant variability of residuals. What would the shape of these residuals be if the absolute value of the residuals are plotted against a predictor or \(\hat{y}\)?
Project3Template.Rmd
project found in your StatsWithRProjects forked repository. Submit your compiled *html
file no later than 5:00 p.m., December 7, to CrowdGrader. Make sure you commit and push your final product to your private repository.