Combining (lots of) numerical and categorical predictors

ANCOVAs and beyond

So far we’ve used linear models to consider how categorical predictors and continuous numerical predictors can predict continuous outcomes. We’ve already considered having multiple categorical predictors (remember factorial ANOVAs). Now we’ll extend those ideas to models where we have numerous categorical and/or continuous numerical predictors in the same model. In doing so we’ll introduce (and connect) ANCOVAs, multiple regression, and model selection.

Back to the iris data

We will motivate this (again!) with an example from our favorite iris data. So far we have considered how species impacts sepal lengths (abbreviated analysis:

library(Rmisc)
function_output <- summarySE(iris, measurevar="Sepal.Length", groupvars =
                               c("Species"))

library(ggplot2)
ggplot(function_output, aes(y=Sepal.Length, x=Species, fill=Species)) +
  geom_col(aes(fill=Species)) +
    geom_errorbar(aes(ymin=Sepal.Length-ci, ymax=Sepal.Length+ci)) +
  labs(title=expression(paste("Sepal lengths of ",italic("I. species"))),
       y= "Sepal length (cm)",
       x= "Species")

iris_anova <- lm(Sepal.Length~Species, iris)
plot(iris_anova)

library(car)
Anova(iris_anova, type="III")
Anova Table (Type III tests)

Response: Sepal.Length
             Sum Sq  Df F value    Pr(>F)    
(Intercept) 1253.00   1 4728.16 < 2.2e-16 ***
Species       63.21   2  119.26 < 2.2e-16 ***
Residuals     38.96 147                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multcomp)
compare_cont_tukey <- glht(iris_anova, linfct = mcp(Species = "Tukey"))
summary(compare_cont_tukey)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Sepal.Length ~ Species, data = iris)

Linear Hypotheses:
                            Estimate Std. Error t value Pr(>|t|)    
versicolor - setosa == 0       0.930      0.103   9.033   <1e-08 ***
virginica - setosa == 0        1.582      0.103  15.366   <1e-08 ***
virginica - versicolor == 0    0.652      0.103   6.333   <1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

We’ve also considered how petal length influences sepal length (abbreviated analysis:

ggplot(iris, aes(y=Sepal.Length, x=Petal.Length)) +
  geom_point() +
  labs(title="Sepal length increases with petal length",
       y= "Sepal length (cm)",
       x= "Petal Length (cm)") +
  geom_smooth(method = "lm",se = F)
`geom_smooth()` using formula = 'y ~ x'

iris_regression <- lm(Sepal.Length ~ Petal.Length, iris)
plot(iris_regression)

Anova(iris_regression, type = "III")
Anova Table (Type III tests)

Response: Sepal.Length
             Sum Sq  Df F value    Pr(>F)    
(Intercept)  500.16   1 3018.28 < 2.2e-16 ***
Petal.Length  77.64   1  468.55 < 2.2e-16 ***
Residuals     24.53 148                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(iris_regression)

Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24675 -0.29657 -0.01515  0.27676  1.00269 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.30660    0.07839   54.94   <2e-16 ***
Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

However, what if we wanted to consider the combined (and potentially interacting) effects of petal length and species on sepal length? This is like our 2-way ANOVA but with one predictor being continuous. This is often called an ANCOVA, but it’s just another linear model! Overall, we are decomposing the variance of each data point among various factors. Our use of type III residuals let’s us ask how much any given factor explains given that other factors are in the model (we’ll explore this more below). Put another way, we might want to know if a factor adds explanatory power to the model (and is not redundant or subsumed by another factor).

Continuing to use focal iris dataset, we can use the same format we used for factorial or 2-way ANOVAs to add the factors.

iris_ancova <- lm(Sepal.Length~Species*Petal.Length, iris)

This model includes an interaction and matche the following null hypotheses (note 2 are copied from previous sections!):

\[ \begin{split} H_O: \mu_{sepal \ length, \ setosa} = \mu_{sepal \ length, \ virginica} = \mu_{sepal \ length, \ versicolor}\\ H_O: \beta_\textrm{(coefficient between sepal and petal length)} = 0\\\\ H_O: \textrm{relationship between petal length and sepal length does not differ among species}\\ \end{split} \]

Once we develop the model, we can visually check the assumptions, which remain

\[ \epsilon \approx i.i.d.\ N(\mu,\sigma) \]

plot(iris_ancova)

If assumptions appear to be met (as they do here), we can consider impacts of factors. Given the presence of categorical predictors, an Anova table may be informative

Anova(iris_ancova, type="III")
Anova Table (Type III tests)

Response: Sepal.Length
                      Sum Sq  Df  F value    Pr(>F)    
(Intercept)          12.1053   1 106.9378 < 2.2e-16 ***
Species               2.8991   2  12.8054 7.611e-06 ***
Petal.Length          0.4346   1   3.8392    0.0520 .  
Species:Petal.Length  0.3810   2   1.6828    0.1895    
Residuals            16.3007 144                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the interaction is not significant (F2,144 = 1.68, p = 0.19), we have the same 2 options we noted in factorial ANOVAs.

Interpret results with the interaction

We can read results from the full model. These show no interaction between species and petal length (F2,144 = 1.68, p = 0.1895) and no impact of petal length (F1,144=3.83, p = 0.052) but a significant impact of species (F2,144 = 12.81, p < 0.01). We can follow this up with a post-hoc test

compare_ancova_tukey <- glht(iris_ancova, linfct = mcp(Species = "Tukey"))
Warning in mcp2matrix(model, linfct = linfct): covariate interactions found --
default contrast might be inappropriate
summary(compare_ancova_tukey)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Sepal.Length ~ Species * Petal.Length, data = iris)

Linear Hypotheses:
                            Estimate Std. Error t value Pr(>|t|)    
versicolor - setosa == 0     -1.8056     0.5984  -3.017  0.00836 ** 
virginica - setosa == 0      -3.1535     0.6341  -4.973  < 1e-04 ***
virginica - versicolor == 0  -1.3479     0.6544  -2.060  0.10175    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

which suggests I. versicolor and I. virginica are similar to each other but different from other species.

However, note the glht output notes interactions are still present in model, so this approach may be inappropriate. For ANOVA’s we noted could instead compare means for each combination.

library(emmeans)
emmeans(iris_ancova, pairwise ~ Species*Petal.Length)
$emmeans
 Species    Petal.Length emmean    SE  df lower.CL upper.CL
 setosa             3.76   6.25 0.637 144     4.99     7.51
 versicolor         3.76   5.52 0.070 144     5.38     5.66
 virginica          3.76   4.80 0.163 144     4.48     5.12

Confidence level used: 0.95 

$contrasts
 contrast                                                   estimate    SE  df
 setosa Petal.Length3.758 - versicolor Petal.Length3.758       0.731 0.641 144
 setosa Petal.Length3.758 - virginica Petal.Length3.758        1.449 0.658 144
 versicolor Petal.Length3.758 - virginica Petal.Length3.758    0.719 0.178 144
 t.ratio p.value
   1.140  0.4911
   2.203  0.0740
   4.044  0.0003

P value adjustment: tukey method for comparing a family of 3 estimates 

This approach, however, is less useful for numerical predictors. By default the emmeans package uses the average for the covariate (note,

mean(iris$Petal.Length)
[1] 3.758

is why we see 3.578 spread throughout the output). While we can specify other breaks, this highlights the difference in combining cateorical variable impacts and combining effects of categorical and continuous variables.

Remove the interaction

Since the interaction isn’t significant, we can also remove it.

iris_ancova_updated <- lm(Sepal.Length~Species+Petal.Length, iris)
plot(iris_ancova_updated)

Anova(iris_ancova_updated, type="III")
Anova Table (Type III tests)

Response: Sepal.Length
              Sum Sq  Df  F value    Pr(>F)    
(Intercept)  137.726   1 1205.394 < 2.2e-16 ***
Species        7.843   2   34.323 6.053e-13 ***
Petal.Length  22.275   1  194.950 < 2.2e-16 ***
Residuals     16.682 146                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows us that both petal length and species impact sepal length.

Why did these factors “suddenly” become significant. The issue lies in that they are not fully independent. Note

petal_anova <- lm(Petal.Length ~ Species, iris)
plot(petal_anova)

Anova(petal_anova, type ="III")
Anova Table (Type III tests)

Response: Petal.Length
            Sum Sq  Df F value    Pr(>F)    
(Intercept) 106.87   1   577.1 < 2.2e-16 ***
Species     437.10   2  1180.2 < 2.2e-16 ***
Residuals    27.22 147                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(petal_anova)

Call:
lm(formula = Petal.Length ~ Species, data = iris)

Residuals:
   Min     1Q Median     3Q    Max 
-1.260 -0.258  0.038  0.240  1.348 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.46200    0.06086   24.02   <2e-16 ***
Speciesversicolor  2.79800    0.08607   32.51   <2e-16 ***
Speciesvirginica   4.09000    0.08607   47.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4303 on 147 degrees of freedom
Multiple R-squared:  0.9414,    Adjusted R-squared:  0.9406 
F-statistic:  1180 on 2 and 147 DF,  p-value: < 2.2e-16

This shows that species explains a lot (0.9413717 %, in fact) of the variation in petal length. Thus interactions among species and petal length are hard to define, as each species petal lengths are mostly different than the others.

This relates to an issue we will soon see. When you consider the impacts of multiple factors on a variable, you are assuming they are not related. That is rarely true (except for when we set up factorial ANOVAs), so we have to decide how related is ok. As factors become more related, the linear model approach does not work.On one level, its hard to split variance correctly among two similar columns. Mathematically, it also makes the design matrix harder to invert.

How does this relate to types of residuals?

This also relates to the “types” of residuals. Type 1 notes the order of factors in the model. The function anova uses this type. Note the values are different (but still significant) depending on the order.

anova(iris_ancova)
Analysis of Variance Table

Response: Sepal.Length
                      Df Sum Sq Mean Sq  F value Pr(>F)    
Species                2 63.212 31.6061 279.2076 <2e-16 ***
Petal.Length           1 22.275 22.2745 196.7730 <2e-16 ***
Species:Petal.Length   2  0.381  0.1905   1.6828 0.1895    
Residuals            144 16.301  0.1132                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and

iris_ancova_reversed <- lm(Sepal.Length~Petal.Length*Species, iris)
anova(iris_ancova_reversed)
Analysis of Variance Table

Response: Sepal.Length
                      Df Sum Sq Mean Sq  F value    Pr(>F)    
Petal.Length           1 77.643  77.643 685.8998 < 2.2e-16 ***
Species                2  7.843   3.922  34.6441 5.206e-13 ***
Petal.Length:Species   2  0.381   0.190   1.6828    0.1895    
Residuals            144 16.301   0.113                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is because Type 1 residuals remove all the variation that can be explained by the first factor (even if it could be given to a related factor), then do the same for the second and so forth. So related factors will have different p-values depending on order (unbalanced designs also impact this.)

Type II residuals don’t focus on order, but they ignore interactions. In doing so they ignore any variation that could be attributed to multiple factors.

Anova(iris_ancova, type="II")
Anova Table (Type II tests)

Response: Sepal.Length
                      Sum Sq  Df  F value    Pr(>F)    
Species               7.8434   2  34.6441 5.206e-13 ***
Petal.Length         22.2745   1 196.7730 < 2.2e-16 ***
Species:Petal.Length  0.3810   2   1.6828    0.1895    
Residuals            16.3007 144                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Type III residuals include interactions, but ask how much a given factor contributes to the explanatory power of a model given main effects of other factors and interactions, including those with the focal factor, are already present. This use of marginal means may seem odd (e.g., asking if a factor should be included when you are already including interactions with said factor) and also means the sum of squares that we get from decomposing the variance adds up to more than the total sum of squares. However, from a conceptual standpoint they work and are thus commonly used.

What would interactions look like?

So, what would interactions actually look like, and how would you interpret them? To illustrate this, let’s pretend we visit another valley and sample 3 new iris species (I. baruch, I. hunter, and I. york). We want to see how species and petal length impact sepal length in these species. Let’s make some data to suggest potential outcomes.

No impact of petal length or species

First, we might find no relationship between the variables. That might looi something like this:

ggplot(iris_example_species, aes(x= Petal_Length, y = Sepal_no_impacts, color = Species)) +
  geom_point()+
  ylab("Sepal Length") +
  xlab("Petal Length") +
  ggtitle("Impact of petal length and species on sepal length") +
    geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

Analysis would indicate (assumption plots not shown here to allow focus on interpreting interactions)

Anova(lm( Sepal_no_impacts~ Petal_Length * Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_impacts
                      Sum Sq Df F value    Pr(>F)    
(Intercept)          15.5910  1 50.7890 7.838e-10 ***
Petal_Length          0.2812  1  0.9162    0.3418    
Species               0.4198  2  0.6838    0.5081    
Petal_Length:Species  0.4781  2  0.7788    0.4630    
Residuals            21.1813 69                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

no impact of interaction, so we could drop it

Anova(lm( Sepal_no_impacts~ Petal_Length + Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_impacts
             Sum Sq Df  F value Pr(>F)    
(Intercept)  61.439  1 201.3986 <2e-16 ***
Petal_Length  0.007  1   0.0229 0.8801    
Species       0.043  2   0.0700 0.9325    
Residuals    21.659 71                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

but we still find no main effects.

Impact of petal length but not species

Alternatively, we might find a situation where there is a relationship between petal length and sepal length, but no impact of species

ggplot(iris_example_species, aes(x= Petal_Length, y = Sepal_no_impact_species, color = Species)) +
  geom_point()+
  ylab("Sepal Length") +
  xlab("Petal Length") +
  ggtitle("Impact of petal length and species on sepal length") +
    geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

Analysis would indicate (assumption plots not shown here to allow focus on interpreting interactions)

Anova(lm( Sepal_no_impact_species~ Petal_Length * Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_impact_species
                     Sum Sq Df F value    Pr(>F)    
(Intercept)           0.233  1  0.2086    0.6493    
Petal_Length         35.555  1 31.8760 3.375e-07 ***
Species               1.233  2  0.5528    0.5778    
Petal_Length:Species  1.769  2  0.7931    0.4565    
Residuals            76.964 69                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

no impact of interaction, so we could drop it

Anova(lm( Sepal_no_impact_species~ Petal_Length + Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_impact_species
             Sum Sq Df F value    Pr(>F)    
(Intercept)   0.311  1  0.2803    0.5981    
Petal_Length 89.619  1 80.8172 2.451e-13 ***
Species       2.196  2  0.9901    0.3766    
Residuals    78.733 71                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and note only petal length impacts sepal length. We could use summary to see the impact

summary(lm( Sepal_no_impact_species~ Petal_Length + Species, iris_example_species))

Call:
lm(formula = Sepal_no_impact_species ~ Petal_Length + Species, 
    data = iris_example_species)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1913 -0.5082 -0.0111  0.6568  2.4913 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.3519     0.6646   0.529    0.598    
Petal_Length    1.9663     0.2187   8.990 2.45e-13 ***
Specieshunter  -0.3474     0.2979  -1.166    0.247    
Speciesyork    -0.3805     0.3015  -1.262    0.211    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.053 on 71 degrees of freedom
Multiple R-squared:  0.5375,    Adjusted R-squared:  0.518 
F-statistic: 27.51 on 3 and 71 DF,  p-value: 6.509e-12

Impact of species but not petal length

We could also see an impact of species on sepal length, but no relationship between petal length and sepal length.

ggplot(iris_example_species, aes(x= Petal_Length, y = Sepal_no_relationship_petal, color = Species)) +
  geom_point()+
  ylab("Sepal Length") +
  xlab("Petal Length") +
  ggtitle("Impact of petal length and species on sepal length") +
    geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

Analysis would indicate (assumption plots not shown here to allow focus on interpreting interactions)

Anova(lm( Sepal_no_relationship_petal~ Petal_Length * Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_relationship_petal
                     Sum Sq Df F value   Pr(>F)   
(Intercept)           0.928  1  0.7644 0.384996   
Petal_Length          0.373  1  0.3070 0.581305   
Species              12.282  2  5.0583 0.008914 **
Petal_Length:Species  4.690  2  1.9314 0.152699   
Residuals            83.771 69                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

no impact of interaction, so we could drop it

Anova(lm( Sepal_no_relationship_petal~ Petal_Length + Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_relationship_petal
              Sum Sq Df F value    Pr(>F)    
(Intercept)    8.924  1  7.1629  0.009236 ** 
Petal_Length   0.052  1  0.0421  0.837988    
Species      230.679  2 92.5730 < 2.2e-16 ***
Residuals     88.461 71                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and we find only species impacts sepal length. In that case, we need a post-hoc follow up.

summary(glht(lm( Sepal_no_relationship_petal~ Petal_Length + Species, iris_example_species), 
             linfct =mcp(Species = "Tukey")))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Sepal_no_relationship_petal ~ Petal_Length + Species, 
    data = iris_example_species)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
hunter - baruch == 0   2.2703     0.3157   7.190  < 1e-08 ***
york - baruch == 0     4.3452     0.3196  13.597  < 1e-08 ***
york - hunter == 0     2.0750     0.3190   6.505 1.77e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Impact of species and petal length, but no interaction

We also might see a difference among the species on sepal length and a relationship between petal length and sepal length, but find the relationship is the same for all species

ggplot(iris_example_species, aes(x= Petal_Length, y = Sepal_no_interaction, color = Species)) +
  geom_point()+
  ylab("Sepal Length") +
  xlab("Petal Length") +
  ggtitle("Impact of petal length and species on sepal length") +
    geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

Analysis would indicate (assumption plots not shown here to allow focus on interpreting interactions)

Anova(lm( Sepal_no_interaction~ Petal_Length * Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_interaction
                     Sum Sq Df F value    Pr(>F)    
(Intercept)           2.019  1  2.0815 0.1536187    
Petal_Length         33.579  1 34.6162 1.305e-07 ***
Species              17.167  2  8.8488 0.0003794 ***
Petal_Length:Species  1.916  2  0.9876 0.3776779    
Residuals            66.933 69                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

no impact of interaction, so we could drop it

Anova(lm( Sepal_no_interaction~ Petal_Length + Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_no_interaction
              Sum Sq Df  F value Pr(>F)    
(Intercept)    6.700  1   6.9096 0.0105 *  
Petal_Length 111.746  1 115.2368 <2e-16 ***
Species      225.299  2 116.1681 <2e-16 ***
Residuals     68.849 71                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given this, we could focus post-hoc tests on which species are different than which others (since the relationship between sepal and petal length is the same)

and we find only species impacts sepal length. In that case, we need a post-hoc follow up.

summary(glht(lm( Sepal_no_interaction~ Petal_Length + Species, iris_example_species), 
             linfct =mcp(Species = "Tukey")))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Sepal_no_interaction ~ Petal_Length + Species, data = iris_example_species)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
hunter - baruch == 0  -0.3612     0.2785  -1.297    0.402    
york - baruch == 0     3.5372     0.2819  12.546   <1e-04 ***
york - hunter == 0     3.8983     0.2814  13.852   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Here we would note that only I. york is different than the other species.

Impact of species and petal length that differs among species

Finally, we could not there is an relationship between petal and sepal length that differs among the species.

ggplot(iris_example_species, aes(x= Petal_Length, y = Sepal_interaction, color = Species)) +
  geom_point()+
  ylab("Sepal Length") +
  xlab("Petal Length") +
  ggtitle("Impact of petal length and species on sepal length") +
    geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

Analysis would indicate (assumption plots not shown here to allow focus on interpreting interactions)

Anova(lm( Sepal_interaction~ Petal_Length * Species, iris_example_species), 
      type = "III")
Anova Table (Type III tests)

Response: Sepal_interaction
                      Sum Sq Df  F value    Pr(>F)    
(Intercept)            7.076  1   8.0267  0.006038 ** 
Petal_Length          38.452  1  43.6177  6.88e-09 ***
Species                3.353  2   1.9015  0.157092    
Petal_Length:Species 227.334  2 128.9368 < 2.2e-16 ***
Residuals             60.828 69                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interactions do exist. This means we can’t interpret the “general” relationship, so we need to look for each species using regression.

summary(lm(Sepal_interaction ~ Petal_Length, 
         iris_example_species[iris_example_species$Species == "baruch",]))

Call:
lm(formula = Sepal_interaction ~ Petal_Length, data = iris_example_species[iris_example_species$Species == 
    "baruch", ])

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7194 -0.5504 -0.1860  0.4736  1.7067 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.9734     0.9767   3.044  0.00576 ** 
Petal_Length  -2.3663     0.3335  -7.097 3.14e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8738 on 23 degrees of freedom
Multiple R-squared:  0.6865,    Adjusted R-squared:  0.6728 
F-statistic: 50.36 on 1 and 23 DF,  p-value: 3.144e-07
Anova(lm(Sepal_interaction ~ Petal_Length, 
         iris_example_species[iris_example_species$Species == "baruch",]), 
      type="III")
Anova Table (Type III tests)

Response: Sepal_interaction
             Sum Sq Df F value    Pr(>F)    
(Intercept)   7.076  1  9.2676  0.005758 ** 
Petal_Length 38.452  1 50.3604 3.144e-07 ***
Residuals    17.561 23                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(Sepal_interaction ~ Petal_Length, 
           iris_example_species[iris_example_species$Species == "hunter",]))

Call:
lm(formula = Sepal_interaction ~ Petal_Length, data = iris_example_species[iris_example_species$Species == 
    "hunter", ])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89221 -0.58055  0.00876  0.47006  2.49756 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.2564     1.1463   1.096    0.284
Petal_Length   0.4962     0.3895   1.274    0.215

Residual standard error: 0.9902 on 23 degrees of freedom
Multiple R-squared:  0.06589,   Adjusted R-squared:  0.02528 
F-statistic: 1.622 on 1 and 23 DF,  p-value: 0.2155
Anova(lm(Sepal_interaction ~ Petal_Length, 
         iris_example_species[iris_example_species$Species == "hunter",]), 
      type="III")
Anova Table (Type III tests)

Response: Sepal_interaction
              Sum Sq Df F value Pr(>F)
(Intercept)   1.1779  1  1.2014 0.2844
Petal_Length  1.5907  1  1.6224 0.2155
Residuals    22.5503 23               
summary(lm(Sepal_interaction ~ Petal_Length, 
           iris_example_species[iris_example_species$Species == "york",]))

Call:
lm(formula = Sepal_interaction ~ Petal_Length, data = iris_example_species[iris_example_species$Species == 
    "york", ])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.45150 -0.69660  0.02717  0.83006  1.64698 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.0617     0.9550   4.253    3e-04 ***
Petal_Length   4.9642     0.3024  16.417  3.4e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9491 on 23 degrees of freedom
Multiple R-squared:  0.9214,    Adjusted R-squared:  0.918 
F-statistic: 269.5 on 1 and 23 DF,  p-value: 3.401e-14
Anova(lm(Sepal_interaction ~ Petal_Length, 
         iris_example_species[iris_example_species$Species == "york",]), 
      type="III")
Anova Table (Type III tests)

Response: Sepal_interaction
              Sum Sq Df F value    Pr(>F)    
(Intercept)   16.292  1  18.087 0.0002998 ***
Petal_Length 242.770  1 269.527 3.401e-14 ***
Residuals     20.717 23                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that there is a significant negative relationship (F~1,23 = 50.36, p<0.001) between sepal and petal length for I. baruch, a significant positive relationship (F~1,23 = 269.53, p<0.001) between sepal and petal length for I. york,and no relationship (F~1,23 = 1.63, p<-0.21) between sepal and petal length for I. hunter.

Extensions to multiple regression

Just as we can extend the 2-way ANOVA ideas to ANCOVA, it turns out we can extend these ideas further to include even more variables (and their interactions, if we want) using our linear model framework. At each stage we are continuing to partition variance among factors and ask (using our type III residuals) how “much” better a given factor makes a model.

For example, we can return to our FEV data from the previous chapters practice problems. Remember, we investigated the impact of age,

#fev data####
fev <- read.table("http://www.statsci.org/data/general/fev.txt", header = T,
                  stringsAsFactors = T)
head(fev)
    ID Age   FEV Height    Sex Smoker
1  301   9 1.708   57.0 Female    Non
2  451   8 1.724   67.5 Female    Non
3  501   7 1.720   54.5 Female    Non
4  642   9 1.558   53.0   Male    Non
5  901   9 1.895   57.0   Male    Non
6 1701   8 2.336   61.0 Female    Non
fev_age <- lm(FEV ~ Age, fev)
plot(fev_age)

Anova(fev_age, type = "III")
Anova Table (Type III tests)

Response: FEV
            Sum Sq  Df F value    Pr(>F)    
(Intercept)   9.89   1  30.707 4.359e-08 ***
Age         280.92   1 872.184 < 2.2e-16 ***
Residuals   210.00 652                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fev_age)

Call:
lm(formula = FEV ~ Age, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.57539 -0.34567 -0.04989  0.32124  2.12786 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.431648   0.077895   5.541 4.36e-08 ***
Age         0.222041   0.007518  29.533  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5675 on 652 degrees of freedom
Multiple R-squared:  0.5722,    Adjusted R-squared:  0.5716 
F-statistic: 872.2 on 1 and 652 DF,  p-value: < 2.2e-16
#age plot####
ggplot(fev, aes(x=Age, y=FEV)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm") +
  labs(y="FEV (liters)",
       x= "Age (years)",
       title ="FEV increases with age")
`geom_smooth()` using formula = 'y ~ x'

height,

fev_height <- lm(FEV ~ Height, fev)
plot(fev_height)

Anova(fev_height, type = "III")
Anova Table (Type III tests)

Response: FEV
            Sum Sq  Df F value    Pr(>F)    
(Intercept) 166.25   1  896.33 < 2.2e-16 ***
Height      369.99   1 1994.73 < 2.2e-16 ***
Residuals   120.93 652                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fev_height)

Call:
lm(formula = FEV ~ Height, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75167 -0.26619 -0.00401  0.24474  2.11936 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.432679   0.181460  -29.94   <2e-16 ***
Height       0.131976   0.002955   44.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4307 on 652 degrees of freedom
Multiple R-squared:  0.7537,    Adjusted R-squared:  0.7533 
F-statistic:  1995 on 1 and 652 DF,  p-value: < 2.2e-16
#height plot####
ggplot(fev, aes(x=Height, y=FEV)) +
  geom_point() +
  geom_smooth(method = "lm") +
    labs(y="FEV (liters)",
       x= "Height (inches)",
       title ="FEV increases with height")
`geom_smooth()` using formula = 'y ~ x'

and gender

fev_gender <- lm(FEV ~ Sex, fev)
plot(fev_gender) #anova is fine

Anova(fev_gender, type = "III")
Anova Table (Type III tests)

Response: FEV
             Sum Sq  Df  F value    Pr(>F)    
(Intercept) 1910.62   1 2652.756 < 2.2e-16 ***
Sex           21.32   1   29.607 7.496e-08 ***
Residuals    469.60 652                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fev_gender)

Call:
lm(formula = FEV ~ Sex, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.01645 -0.69420 -0.06367  0.58233  2.98055 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.45117    0.04759  51.505  < 2e-16 ***
SexMale      0.36128    0.06640   5.441  7.5e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8487 on 652 degrees of freedom
Multiple R-squared:  0.04344,   Adjusted R-squared:  0.04197 
F-statistic: 29.61 on 1 and 652 DF,  p-value: 7.496e-08
#gender plot ####

#bar chart with error bars ####
library(Rmisc)
function_output <- summarySE(fev, measurevar="FEV", groupvars =
                               c("Sex"))

ggplot(function_output, aes(x=Sex, y=FEV)) +
  geom_col() +
  ylab("FEV") +
  geom_errorbar(aes(ymin=FEV-ci, ymax=FEV+ci), size=1.5) 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

on outcomes and found all were significant predictors. However, these variables are correlated. For example, there is a relationship between height and gender

height_gender <- lm(Height ~ Sex, fev)
plot(height_gender) #anova is fine

Anova(height_gender, type = "III")
Anova Table (Type III tests)

Response: Height
             Sum Sq  Df   F value    Pr(>F)    
(Intercept) 1152902   1 36305.026 < 2.2e-16 ***
Sex             537   1    16.917 4.405e-05 ***
Residuals     20705 652                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(height_gender)

Call:
lm(formula = Height ~ Sex, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.0253  -3.7119   0.7881   4.2881  11.9747 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  60.2119     0.3160 190.539  < 2e-16 ***
SexMale       1.8133     0.4409   4.113  4.4e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.635 on 652 degrees of freedom
Multiple R-squared:  0.02529,   Adjusted R-squared:  0.0238 
F-statistic: 16.92 on 1 and 652 DF,  p-value: 4.405e-05
#gender plot ####

#bar chart with error bars ####
function_output <- summarySE(fev, measurevar="Height", groupvars =
                               c("Sex"))

ggplot(function_output, aes(x=Sex, y=Height)) +
  geom_col() +
  ylab("Height") +
  geom_errorbar(aes(ymin=Height-ci, ymax=Height+ci)) 

So we may want to know if we can do better with a larger model.

To begin with, notice we are slightly changing the question. We are moving from hypothesis-based analysis to a decision to find the “best” model. However, we are still trying to use a linear model framework, so our predictor variables need to be independent (or at least somewhat independent). We can consider this by noting the relationship among all predictor variables. The pairs function is one way to do this; the added portion below shows r values (correlations) among variables and marks significant relationships using asterisks (any asterisk indicates a p value of < .05):

library(psych)
pairs.panels(fev, stars=T)

Note we significant relationships among FEV and several variagles (e.g., age), but also among several predictor variables (e.gl, height and age). This makes sense (people tend to grow for first 20ish years!), but the r2 between height and age is only 64%.

cor.test(~ Age + Height, fev)

    Pearson's product-moment correlation

data:  Age and Height
t = 33.118, df = 652, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7615128 0.8188906
sample estimates:
      cor 
0.7919436 

While there is no hard limit, if 2 variables share an r2 value of greater than 80%, only one should be included in the model. 60-70% is even better.

For now, let’s retain all factors and consider how we can carry out model selection.

Option 1: Run full model and interpet outcomes

One option is to construct a large model,

fev_full <- lm(FEV ~ Age * Height * Sex * Smoker, fev)

check it,

plot(fev_full)

and interpret outcomes if assumptions are met.

Anova(fev_full, type="III")
Anova Table (Type III tests)

Response: FEV
                      Sum Sq  Df F value  Pr(>F)  
(Intercept)            0.547   1  3.7336 0.05377 .
Age                    0.557   1  3.8008 0.05167 .
Height                 0.426   1  2.9068 0.08869 .
Sex                    0.595   1  4.0635 0.04424 *
Smoker                 0.661   1  4.5134 0.03402 *
Age:Height             0.555   1  3.7859 0.05213 .
Age:Sex                0.256   1  1.7478 0.18663  
Height:Sex             0.589   1  4.0174 0.04546 *
Age:Smoker             0.536   1  3.6616 0.05613 .
Height:Smoker          0.620   1  4.2352 0.04000 *
Sex:Smoker             0.711   1  4.8534 0.02795 *
Age:Height:Sex         0.250   1  1.7090 0.19159  
Age:Height:Smoker      0.495   1  3.3772 0.06657 .
Age:Sex:Smoker         0.423   1  2.8874 0.08976 .
Height:Sex:Smoker      0.708   1  4.8324 0.02829 *
Age:Height:Sex:Smoker  0.428   1  2.9204 0.08795 .
Residuals             93.465 638                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This (and the following) approaches assume you chosen legitimate predictor variables (you have a reason/mechanism for explaining their impact).

Option 2: Remove or add variables using p-values

Another option is to start with the full model and remove factors (or their interactions) until all are “significant”. You can do this by building models manually or using automated approaches like the drop1 function.

drop1(fev_full, test="F")
Single term deletions

Model:
FEV ~ Age * Height * Sex * Smoker
                      Df Sum of Sq    RSS     AIC F value  Pr(>F)  
<none>                             93.465 -1240.4                  
Age:Height:Sex:Smoker  1   0.42783 93.893 -1239.4  2.9204 0.08795 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice it considers highest order interactions first. This indicates we should drop the 4-way interaction term, so we do that and continue

fev_full_working <- update(fev_full, .~.- Age:Height:Sex:Smoker)
drop1(fev_full_working, test = "F")
Single term deletions

Model:
FEV ~ Age + Height + Sex + Smoker + Age:Height + Age:Sex + Height:Sex + 
    Age:Smoker + Height:Smoker + Sex:Smoker + Age:Height:Sex + 
    Age:Height:Smoker + Age:Sex:Smoker + Height:Sex:Smoker
                  Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                         93.893 -1239.4                      
Age:Height:Sex     1   1.67744 95.570 -1229.8 11.4160 0.0007724 ***
Age:Height:Smoker  1   0.08082 93.974 -1240.8  0.5500 0.4585756    
Age:Sex:Smoker     1   0.00474 93.898 -1241.3  0.0322 0.8575529    
Height:Sex:Smoker  1   1.20821 95.101 -1233.0  8.2226 0.0042735 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we can drop another interaction term

fev_full_working <- update(fev_full_working, .~.- Age:Sex:Smoker)
drop1(fev_full_working, test = "F")
Single term deletions

Model:
FEV ~ Age + Height + Sex + Smoker + Age:Height + Age:Sex + Height:Sex + 
    Age:Smoker + Height:Smoker + Sex:Smoker + Age:Height:Sex + 
    Age:Height:Smoker + Height:Sex:Smoker
                  Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                         93.898 -1241.3                      
Age:Height:Sex     1    1.6786 95.576 -1231.8 11.4412 0.0007621 ***
Age:Height:Smoker  1    0.0811 93.979 -1242.8  0.5528 0.4574650    
Height:Sex:Smoker  1    1.2755 95.173 -1234.5  8.6939 0.0033091 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and another

fev_full_working <- update(fev_full_working, .~.- Age:Height:Smoker)
drop1(fev_full_working, test = "F")
Single term deletions

Model:
FEV ~ Age + Height + Sex + Smoker + Age:Height + Age:Sex + Height:Sex + 
    Age:Smoker + Height:Smoker + Sex:Smoker + Age:Height:Sex + 
    Height:Sex:Smoker
                  Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                         93.979 -1242.8                      
Age:Smoker         1    1.4059 95.385 -1235.1  9.5893 0.0020425 ** 
Age:Height:Sex     1    1.7363 95.715 -1232.8 11.8424 0.0006166 ***
Height:Sex:Smoker  1    1.1946 95.173 -1236.5  8.1478 0.0044507 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and so on until all interactions or main effects are significant (like we see now). This approach requires nested models (you compare a model without a given factor to one that has it - type III residuals!).

Alternatively, you can build a simple model

fev_under <- lm(FEV ~ 1, fev)

and add factors to it until none are significant (not fully shown here).

add1(fev_under, ~ Age + Height + Sex, test = "F")
Single term additions

Model:
FEV ~ 1
       Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
<none>              490.92  -185.58                       
Age     1    280.92 210.00  -738.94  872.184 < 2.2e-16 ***
Height  1    369.99 120.93 -1099.86 1994.731 < 2.2e-16 ***
Sex     1     21.32 469.60  -212.63   29.607 7.496e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fev_under_a <- update(fev_under, .~. + Age)
add1(fev_under_a, ~ . + Height + Sex, test = "F")
Single term additions

Model:
FEV ~ Age
       Df Sum of Sq    RSS      AIC F value    Pr(>F)    
<none>              210.00  -738.94                      
Height  1    95.326 114.67 -1132.62 541.157 < 2.2e-16 ***
Sex     1    17.066 192.94  -792.37  57.583 1.125e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Option 3: Compare models (nested or not) using AIC

You may have noted the AIC values above. AIC is “an information criteria” that uses maximum likelihood to compare models.

Remember that? from the chapter on comparing proportions

For an easy example of likelihood, let’s go back to a one-sample example of comparing a proportion to a sample and focus on finch data. We have 9 old and 9 young birds, so we have a signal of .5 for p. We can use likelihood to calculate how likely our data was under multiple values of p (ranging from 0 - 1, the only options here) and compare the likelihood of those outcomes White (n.d.).

Similar to calculating sum square errors from models, what is most likely is what we saw, but we know there is always noise in the data. Thankfully, it turns out the ratio of likelihood values follow a \(\chi^2\) distribution and can thus provide a p-value to compare possible models. We will return to likelihood-based approaches later, in fact, as they can be used for any dataset we can generate a model for and can be used to compare multiple models.

To do so, it obtains a likelihood value for a model and takes a log of it. It then multiples that by -2 and adds a penalty parameter multiplied by the number of parameters in the model. Why? Because the algorithms used mean an extra parameter will always make a model better (or at least no worse).

The true AIC process uses a penalty parameter of 2; other approaches vary this. The Bayesian information criterion (BIC), for example, uses a penalty parameter of of log(n), where n is the number of observations. Given this formula, the “best” model has the lowest AIC value.

AIC and other information criteria can be used to compare nested models (the value of adding or removing a single variable). This can be done in a few ways. The default for drop1 focuses on AIC

drop1(fev_full)
Single term deletions

Model:
FEV ~ Age * Height * Sex * Smoker
                      Df Sum of Sq    RSS     AIC
<none>                             93.465 -1240.4
Age:Height:Sex:Smoker  1   0.42783 93.893 -1239.4

but we can also automate this process using the stepAIC function

stepAIC(fev_full)
Start:  AIC=-1240.37
FEV ~ Age * Height * Sex * Smoker

                        Df Sum of Sq    RSS     AIC
<none>                               93.465 -1240.4
- Age:Height:Sex:Smoker  1   0.42783 93.893 -1239.4

Call:
lm(formula = FEV ~ Age * Height * Sex * Smoker, data = fev)

Coefficients:
                 (Intercept)                           Age  
                    25.71051                      -1.95406  
                      Height                       SexMale  
                    -0.34974                     -31.17265  
                   SmokerNon                    Age:Height  
                   -28.32508                       0.03009  
                 Age:SexMale                Height:SexMale  
                     1.60183                       0.47343  
               Age:SmokerNon              Height:SmokerNon  
                     1.92939                       0.42323  
           SexMale:SmokerNon            Age:Height:SexMale  
                    34.15472                      -0.02410  
        Age:Height:SmokerNon         Age:SexMale:SmokerNon  
                    -0.02860                      -2.07178  
    Height:SexMale:SmokerNon  Age:Height:SexMale:SmokerNon  
                    -0.52090                       0.03172  

AIC and IC can also be used to compare non-nested models. For example, the dredge function in the MuMin package takes a large model and copmares all outcomes.

library(MuMIn)
options(na.action = "na.fail")
auto <- dredge(fev_full)
Fixed term is "(Intercept)"
library(rmarkdown)
paged_table(auto)

The output shows us which factors lead to the smallest AIC value (here using the small sample correction, the AICc value). One benefit of this approach is we may find multiple models have similar AIC values. Delta (difference) values of <2 mean models are supported (Burnham and Anderson 2004).

Models having \(\Delta_i\) ≤ 2 have substantial support (evidence), those in which 4 ≤ \(\Delta_i\) ≤ 7 have considerably less support, and models having \(\Delta_i\) > 10 have essentially no support.

We can actually average supported models using their weight to find an “final” model (code shown here, but note we only have one model meeting this criteria and thus receive an error.

model.avg(auto, subset = delta < 2)

Call:
model.avg(object = auto, subset = delta < 2)

Component models: 
'1+2+3+4+5+6+7+8+9+10+11+13'    '1+2+3+4+5+6+7+8+9+10+11+12+13'

Coefficients: 
       (Intercept)        Age       Height   SexMale SmokerNon  Age:Height
full      3.103673 -0.2360365 -0.001240504 -4.865981  -5.58118 0.003591144
subset    3.103673 -0.2360365 -0.001240504 -4.865981  -5.58118 0.003591144
       Age:SexMale Age:SmokerNon Height:SexMale Height:SmokerNon
full    -0.4448419     0.1907352     0.07042723       0.07253359
subset  -0.4448419     0.1907352     0.07042723       0.07253359
       SexMale:SmokerNon Age:Height:SexMale Height:SexMale:SmokerNon
full            7.681415        0.007223447               -0.1152972
subset          7.681415        0.007223447               -0.1152972
       Age:Height:SmokerNon
full           -0.001775413
subset         -0.005591655

While AIC and other IC may be an interesting approach, a few points should be made.

  • AIC values are relative only to models fit with the same data in the same way.

    • So there is no “good” AIC value

    • Different programs may use different additive components, so be careful in using functions or different pieces of software to calculate AIC for different models that you want to compare. In R, an easy example is the following functions (both in base R) give different output (as noted in their respective help files)

      AIC(fev_full)
      [1] 617.602
      extractAIC(fev_full)
      [1]    16.00 -1240.37
  • AIC outcomes actually correspond to using an \(\alpha\) level of .15 (Steyerberg et al. 2000), so AIC approaches typically favor larger models

  • AIC relies on large-sample approximations

    • AICc corrects for small samples and can always be used.
  • Models that are being compared must focus on the same set of outcomes. One model can’t use fewer rows due to missing data, and different responses (we’ll cover transformations and glms in a few chapters) can’t be compared. Note, however, models may differ in the number of explanatory variables.

Option 4: Compare r2 (or adjusted r2) values

This option is noted here as it was more common in the past, but now it should not be used. If it is used, however, one should compare adjusted r2 values since the r2 values my be connected to larger but less useful models.

Final model checks

Regardless of the the method that is used, the linear models still need to meet basic assumptions. For this reason, the final model (and probably all those developed in route to getting there, though this rarely happens) should be assessed for assumptions.

A new approach we have not used previously focuses on variance inflation factors (vif) to ensure collinearity (more on this in next section) isn’t an issue. In general you want these values to be less than 5. When working with normal linear models, especially those have factors that have more than 2 levels, focusing on the generalized form of the vif for predictors is appropriate.

vif(fev_full_working, type="predictor")
GVIFs computed for predictors
       GVIF Df GVIF^(1/(2*Df))      Interacts With Other Predictors
Age       1 12               1 Height, Sex, Smoker             --  
Height    1 12               1    Age, Sex, Smoker             --  
Sex       1 12               1 Age, Height, Smoker             --  
Smoker    1 12               1    Age, Height, Sex             --  

Important differences

One major difference between ANCOVA/ANOVA and multiple regression is that multiple regression, especially with larger models, is often more focused on outcomes/prediction than explanation. This differs from the focus on null hypothesis significance testing (NHST) that we have been focused on in previous chapters, but the mathematical approaches remain the same via the use of the linear model.

Next steps

These methods can be extended to other models that are used when linear model assumptions are not met, which is the focus of the next chapter.

References

Burnham, Kenneth P., and David R. Anderson, eds. 2004. Model Selection and Multimodel Inference. New York, NY: Springer. https://doi.org/10.1007/b97636.
Steyerberg, E. W., M. J. Eijkemans, F. E. Harrell, and J. D. Habbema. 2000. “Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets.” Statistics in Medicine 19 (8): 1059–79. https://doi.org/10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0.
White, John Myles. n.d. “Doing Maximum Likelihood Estimation by Hand in r.” http://www.johnmyleswhite.com/notebook/2010/04/21/doing-maximum-likelihood-estimation-by-hand-in-r/.