Chapter 9 The basics: t-tests, ANOVA, and linear regression
The most basic statistical analyses for a GRA to use are t-tests, analysis of variance, and linear regression. Here are some tips and references for getting started and creating well-formatted output:
9.1 t-tests & ANOVA
The best R-package I have used for doing t-tests and ANOVA is gtsummary, a package that makes beautiful tables for summarizing data.
Here is an example of a t-test with gtsummary
:
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Note: I am assuming here that you have checked the appropriate diagnostics (
# e.g., check residuals for normality, check for outliers, check variances in both groups, etc.)
tbl_summary(ToothGrowth, # data set
by = "supp", # factor/grouping variable
include = "len", # numeric outcome
label = list("len" ~ "Length"), # make a readable label
# t-test is a test of means, so let's summarize data with mean (sd) form
statistic = list(all_continuous() ~ "{mean} ({sd})"),
# choose reasonable number of digits to print
digits = list(everything() ~ 2)) |>
# indicate that you want a t.test
add_p(test = list(all_continuous() ~ "t.test")) |>
# format labels
bold_labels() |>
# remove default 'Characteristic' header
modify_header(label = "")
OJ N = 301 |
VC N = 301 |
p-value2 | |
---|---|---|---|
Length | 20.66 (6.61) | 16.96 (8.27) | 0.061 |
1 Mean (SD) | |||
2 Welch Two Sample t-test |
For ANOVA, we could do this:
tbl_summary(ToothGrowth,
by = dose, # factor/grouping variable is now 'dose'
include = "len",
label = list("len" ~ "Length"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(everything() ~ 2)) |>
# indicate that you want a one-way ANOVA test here, not assuming equal variances
add_p(test = list(all_continuous() ~ "oneway.test")) |>
bold_labels() |>
modify_header(label = "")
0.5 N = 201 |
1 N = 201 |
2 N = 201 |
p-value2 | |
---|---|---|---|---|
Length | 10.61 (4.50) | 19.74 (4.42) | 26.10 (3.77) | <0.001 |
1 Mean (SD) | ||||
2 One-way analysis of means (not assuming equal variances) |
9.2 Linear regression
gtsummary
will also create summary tables for linear regression, like so:
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
my_model <- lm(Petal.Length ~ Species + Sepal.Length, data = iris)
# again, I am assuming you have checked the appropriate linear model diagnostics
tbl_regression(my_model) |>
modify_header(label = "**Feature**", estimate = "**Estimate**") |>
bold_labels()
Feature | Estimate | 95% CI1 | p-value |
---|---|---|---|
Species | |||
setosa | — | — | |
versicolor | 2.2 | 2.1, 2.3 | <0.001 |
virginica | 3.1 | 2.9, 3.3 | <0.001 |
Sepal.Length | 0.63 | 0.54, 0.72 | <0.001 |
1 CI = Confidence Interval |
For logistic regression, you will need to do some futher customization of the output to get the odds ratio estimates to appear in the table - take a look at this help documentation for examples.
For more on diagnostics/checking assumptions: this article on STHDA gives some examples of checking assumptions for linear regression in R
.