Statistic Reasoning with R Exam 1 Note for Printing
Used to be printed on paper
Causal Inference
Specific causal question (SCQ)
- SCQ four components
- intervention ("what is the impact of xxx", x)
- event & purpose ("of xxx", x improves y )
- who ("for xxx", group_A, a group )
- alternative or the control ("relative to xxx", x_control )
- Example output: What is the impact of x on improving y for group_A relative to x_control.
- Define y: continues variable, each individual of group_A has a y
Hypothesis
The researchers hypothesize that x will improve y for group_A.
Potential Outcomes of One Treatment (total count = 2)
- What would the outcome of y be if one of group_A (a individual of group_A) was subjected under x_control.
- What would the outcome of y be if one of group_A was subjected under x.
Average Factual Outcome (for treatment effect x)
What is the average y after individuals in group_A are subjected under x.
Average missing counterfactual (for treatment effect x)
What would have been the average y for individuals in group_A are subjected under x_control instead of x but all else remained the same.
Randomization
- Randomization ensures the average difference in outcome of y between x and x_control is only due to the treatment because the two groups are on average identical to each others in all other pretreatment characteristic.
- Ensures internal validity
- lacks in external validity, where the conclusion can only be generalized for this experiment.
Observational
- Not randomized
- estimated average MCF using group_A who received x_control, but cannot guarantee unbiased
- To be unbiased: no other features systematically differ between those who was subjected under x and those who was subjected under x_control
Confounders, covariate
- Systematic difference
- Two conditions for cofounder
- related or predicts the outcome of y and not observed
- difference in baseline covariate of the x group and the x_control group
Univariate Summary Statistics And Figures
Want to find out what values they take; the frequency of each value or each range of values: central tendency, spread, shape, notable features.
Variables to identify
- continuous
- discret
- dichotomous (use mean)
- categorical (use mode for central tendency)
- ordinal (use median ): categorical with meaningful order
- continuous variable (mean and median): numeric variable
Quantiles:
- Univariate variable (median: average of middle-most two values, or middle-most)
- mean: more sensitive to outliers than median
- mode: the most frequently appearing variable
- first quartile (lower quartile): the first 25% of the data
- second quartile: the first 50% of the data, which is the median
- third quartile (upper quartile): the first 75% of the data
- interquartile range: diff between upper and lower quart. and measures the spread
- root mean square: abs. magnitude change in proportion
- standard deviation: the proportion of y subjected to x is approx. sd() away from its mean.
Plots
- Bar plots: Summarize the distribution (0.3, 0.4, etc.), or dichotomous var, of a proportion var or char. var with multiple categories.
- Histogram: (For numeric vals) First, discretize by creating bins; second, calculate the density of each bin; third, use density as the height of the bin.
- Box plot: shows distribution of numeric values; best to show variables side-by-side; visualizes median, upper q., lower q., and IQR together
Bivariate Relationships
- Scatter Plot: Bivariate only; shows relationship between two continuous values.
Types:
- dichotomous vs. categorical: grounded bar plot
- dichotomous vs. continuous : two bar plots, overlaid or side-by-side histograms
- categorical vs. continuous : two box plots side-by-side
- continuous vs. continuous : scatter