Statistic Reasoning with R Exam 2 Note for Printing
Used to be printed on paper
Exam 2 notes
Name: Yi Yang
AndrewID: yiyang5
Example
In a experiment in changing one’s fate and causality, an observational study is conducted on people who have read Berserk.
The predictors are: (num.p) number of pages read in Berserk (continuous), unit is "pages"; (s8) sleeps more than 8 hours per day (dichotomous), unit is boolean (yes/no); (deg) has university education (dichotomous), unit is boolean (yes/no).
The outcome is: (ss) stress score at age 40 (continuous), unit is "points".
Linear Correlation
Linear Correlation:
Sum of the z-score multiplication of each data point divided by (n-1). It tells us what the relationship between these two variables is negative or positive in a linear assumption (e.x. perfect U-shape plot is the same correlation as a random plot, corr = 0). It also tell the strength of the linear relationship; if close to -1 or 1 it is strong linear.
Sensitivity to outlier
- mean is sensitive, median is robust
- SD is sensitive, IQR is robust
- correlation is sensitive
Simple Linear Regression
Linearity: The mean of the outcome changes linearly with the values of the predictor.
Description | Formula |
---|---|
True Line | |
Fitted Value of | |
Estimated Residual | |
mean |
: residual, the vertical distance between the value and the line (regression line). Mean of residual will always be zero.
- How far off a person’s stress score is from the number we expected based on the linear relationship with the number of pages of Berserk read.
: slope, can be scaled
- The slope tells us how much more/less stress score at age 40 () for each additional page of Berserk is read (). In this case, stress scores are on average points heavier for each page Berserk is read, or equivalently babies on average are points more for each pages more pages of Berserk is read.
- (Dichotomous)
: y-intercept
- The y-intercept, (actual number), is the average stress score () of those who have read 0 pages of Berserk ().
We can use the estimated regression line to predict an expected (mean) value for a new unit where we know its value of
For summary
RMSE: How far are the actual stress scores of people at age 40 from the number we expected based on the number of pages of Berserk read?
- On average, people’s actual stress score is more or less than the stress score expected based on the number of Berserk pages read and .
Interpret:
- Put the answer into the units of the actual values.
- of the variation in average stress score is accounted for by average pages read in Berserk and its linear relationship with average stress score.
Check the linearity assumption
- In a residual plot
- (In case of residual plot of stress score show random spread around the center line) The mean of the residuals in each of the 3 or 4 vertical slices ( each with approximately the same number of data points) appears to be close to zero. Hence, there are no clear violations of the linearity assumption.
- (In case of residual plot of stress score show non-random spread around the center line, curvilinear) The linearity assumption appears to be violated. We see several regions of the residual plot where the mean of the residuals appears to differ from zero - the residuals appear to follow a U-shape. The mean of the residuals for those with the lowest average estimated stress score (under 2) and those with the highest average estimated stress score (over 8) appears to be greater than zero. The mean of the residuals for those with the middle average estimated stress score (around 4 - 6) appears to be less than zero.
Categorical to Dichotomous: by actually changing a categorical variable into multiple boolean (dichotomous) variables.