Statistic Reasoning with R Exam 2 Note for Printing

Used to be printed on paper

Monospaced Page

Exam 2 notes

Name: Yi Yang

AndrewID: yiyang5

Example

In a experiment in changing one’s fate and causality, an observational study is conducted on people who have read Berserk.

The predictors are: (num.p) number of pages read in Berserk (continuous), unit is "pages"; (s8) sleeps more than 8 hours per day (dichotomous), unit is boolean (yes/no); (deg) has university education (dichotomous), unit is boolean (yes/no).

The outcome is: (ss) stress score at age 40 (continuous), unit is "points".

Linear Correlation

Linear Correlation: Corr(A,B) = \frac{1}{n-1} \sum^{n}_{i=1} (\frac{a_i - \bar{a}}{SD_a})(\frac{b_i - \bar{b}}{SD_b})

Sum of the z-score multiplication of each data point divided by (n-1). It tells us what the relationship between these two variables is negative or positive in a linear assumption (e.x. perfect U-shape plot is the same correlation as a random plot, corr = 0). It also tell the strength of the linear relationship; if close to -1 or 1 it is strong linear.

Sensitivity to outlier

  • mean is sensitive, median is robust
  • SD is sensitive, IQR is robust
  • correlation is sensitive

Simple Linear Regression

Linearity: The mean of the outcome changes linearly with the values of the predictor.

DescriptionFormula
True Line ss_i = \alpha + \beta_1 \times num.p_i + \beta_2 \times deg_i+  \varepsilon_i
Fitted Value of ss \hat{ss_i} = \hat{\alpha} + \hat{\beta} \times num.p_i
Estimated Residual \hat{\varepsilon}_i = ss_i - \hat{ss_i}
mean mean(ss_i | num.p)= \alpha + \beta \times num.p_i

\varepsilon_i : residual, the vertical distance between the value ss_i and the line (regression line). Mean of residual will always be zero.

  • How far off a person’s stress score is from the number we expected based on the linear relationship with the number of pages of Berserk read.

 \beta : slope, can be scaled

  • The slope tells us how much more/less stress score at age 40 (ss) for each additional page of Berserk is read (num.p). In this case, stress scores are on average ss_i points heavier for each page Berserk is read, or equivalently babies on average are ss_i\times 10 points more for each num.p\times 10 pages more pages of Berserk is read.
  • (Dichotomous)

 \alpha: y-intercept

  • The y-intercept, \alpha (actual number), is the average stress score (ss) of those who have read 0 pages of Berserk (num.p=0).

We can use the estimated regression line to predict an expected (mean) ss value for a new unit where we know its value of num.p

For summary

RMSE: How far are the actual stress scores of people at age 40 from the number we expected based on the number of pages of Berserk read?  \sqrt{\frac{1}{n} SSR} = \sqrt{\frac{1}{n} \sum _i \hat{\varepsilon^2_i}}

  • On average, people’s actual stress score is \#RMSE(ss_i) more or less than the stress score expected based on the number of Berserk pages read and .

R^2 Interpret:

  • Put the answer into the units of the actual values.
  • \#R^2(ss_i) of the variation in average stress score is accounted for by average pages read in Berserk and its linear relationship with average stress score.

Check the linearity assumption

  • In a residual plot
  • (In case of residual plot of stress score show random spread around the center line) The mean of the residuals in each of the 3 or 4 vertical slices ( each with approximately the same number of data points) appears to be close to zero. Hence, there are no clear violations of the linearity assumption.
  • (In case of residual plot of stress score show non-random spread around the center line, curvilinear) The linearity assumption appears to be violated. We see several regions of the ss residual plot where the mean of the residuals appears to differ from zero - the residuals appear to follow a U-shape. The mean of the residuals for those with the lowest average estimated stress score (under 2) and those with the highest average estimated stress score (over 8) appears to be greater than zero. The mean of the residuals for those with the middle average estimated stress score (around 4 - 6) appears to be less than zero.

Categorical to Dichotomous: by actually changing a categorical variable into multiple boolean (dichotomous) variables.

Yi Yang
Yi Yang

My research interests include end-to-end encrypted systems, encryption, and information security.