Linear model extensions

Remember you should

  1. In a study considering how the presence of sea stars changed snail growth patterns, ~25 snails were grown in containers containing 0,1, or 2 seastars.
    Since non-consumptive effects are often threshold based, these treatments levels should be considered as groups (not as a continuous variable!). The data is available at

https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/snail_modified_for_class.csv

FL is the final length of measured snails, and the treatment (coded 1-3) correspond to [1=Control (no predators). 2=1 predator treatment,3=2 predator treatment).

What method would you use to analyze this data and why? Carry out your test, stating your null hypothesis, test assumptions, p-value, and interpretation.
Describe any necessary steps and provide graphics and values as needed. If needed, can you determine which treatments differ from each other?

snail <- read.csv("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/snail_modified_for_class.csv")
head(snail)
  Container Treatment    FL
1         1         3 22.50
2         1         3 20.61
3         1         3 23.13
4         1         3 23.71
5         1         3 24.40
6         1         3 23.62
snail$Treatment <- as.factor(snail$Treatment)
library(plyr)
snail$Treatment_new <- revalue(snail$Treatment, c("1" = "Control", "2" = "Single predator",
                                                  "3" = "Two predators"))

library(lme4)
snail_mm <- lmer(FL ~ Treatment_new + (1|Container), snail)
summary(snail_mm)
Linear mixed model fit by REML ['lmerMod']
Formula: FL ~ Treatment_new + (1 | Container)
   Data: snail

REML criterion at convergence: 1163.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.62147 -0.68438  0.04799  0.62360  2.93899 

Random effects:
 Groups    Name        Variance Std.Dev.
 Container (Intercept) 0.6509   0.8068  
 Residual              8.2124   2.8657  
Number of obs: 234, groups:  Container, 12

Fixed effects:
                             Estimate Std. Error t value
(Intercept)                   23.7426     0.5178  45.857
Treatment_newSingle predator  -2.2159     0.7295  -3.037
Treatment_newTwo predators    -1.8844     0.7353  -2.563

Correlation of Fixed Effects:
            (Intr) Trt_Sp
Trtmnt_nwSp -0.710       
Trtmnt_nwTp -0.704  0.500
plot(snail_mm)

check_mixed_model <- function (model, model_name = NULL) {
  #collection of things you might check for mixed model
  par(mfrow = c(2,3))
  #not sure what this does with mutliple random effects, so stop with 1 for now
  if(length(names(ranef(model))<2)){
    qqnorm(ranef(model, drop = T)[[1]], pch = 19, las = 1, cex = 1.4, main= paste(model_name, 
                                                                                  "\n Random effects Q-Q plot"))
  }
  plot(fitted(model),residuals(model), main = paste(model_name, 
                                                    "\n residuals vs fitted"))
  qqnorm(residuals(model), main =paste(model_name, 
                                       "\nresiduals q-q plot"))
  qqline(residuals(model))
  hist(residuals(model), main = paste(model_name, 
                                      "\nresidual histogram"))
}

check_mixed_model(snail_mm)


library(car)
Anova(snail_mm, type = "III")
Analysis of Deviance Table (Type III Wald chisquare tests)

Response: FL
                 Chisq Df Pr(>Chisq)    
(Intercept)   2102.876  1  < 2.2e-16 ***
Treatment_new   10.681  2   0.004792 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multcomp)
snail_comparison <- glht(snail_mm, linfct = mcp(Treatment_new = "Tukey"))
summary(snail_comparison)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lmer(formula = FL ~ Treatment_new + (1 | Container), data = snail)

Linear Hypotheses:
                                     Estimate Std. Error z value Pr(>|z|)   
Single predator - Control == 0        -2.2159     0.7295  -3.037  0.00665 **
Two predators - Control == 0          -1.8844     0.7353  -2.563  0.02793 * 
Two predators - Single predator == 0   0.3315     0.7326   0.453  0.89329   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
#graph using Rmisc
library(Rmisc)
library(ggplot2)
graph_output <- summarySE(snail, measurevar = "FL", groupvars = "Treatment_new")
bar_graph_with_error_bars <- ggplot(graph_output, 
                                     aes_string(x="Treatment_new", 
                                                y = "FL")) +
  geom_col() + 
  geom_errorbar(aes(ymin = FL - ci, 
                    ymax = FL + ci))+
  xlab("Treatment")+
  theme(axis.title.x = element_text(face="bold", size=28), 
        axis.title.y = element_text(face="bold", size=28), 
        axis.text.y  = element_text(size=20),
        axis.text.x  = element_text(size=20), 
        legend.text =element_text(size=20),
        legend.title = element_text(size=20, face="bold"),
        plot.title = element_text(hjust = 0.5, face="bold", size=32))+
ylim(c(0, 30))
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
bar_graph_with_error_bars

Since multiple oysters were measured in each cage, we need to use a random effect to account for cages. You could also block by cages- it takes up more degrees of freedom, but you have plenty here. Results show a significant differce among treatments (Chi^2~2=10.681, p <.01), so I used a Tukey post hoc test to determine which groups differed from others while controlling for the family wise error rate. REsults indicate the presence of a predator impacts length but not the density.

  1. (From OZDasl) The data give the ambient temperature and the number of primary O-rings damaged for 23 of the 24 space shuttle launches before the launch of the space shuttle Challenger on January 20, 1986. (Challenger was the 25th shuttle. One engine was lost at sea and could not be examined.) Each space shuttle contains 6 primary O-rings.

Note these are counts. We can analyze this data using a Poisson distribution or binomial. Make sure you understand why each one is possible, which one is better, and carry out the analysis. Data is available @

http://www.statsci.org/data/general/challenger.txt

rings <- read.table("http://www.statsci.org/data/general/challenger.txt", 
                    header = T)
#can do as poisson
rings_poisson <- glm(Damaged ~ Temp, rings, family = "poisson")
summary(rings_poisson)

Call:
glm(formula = Damaged ~ Temp, family = "poisson", data = rings)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   5.9691     2.7628   2.161   0.0307 *
Temp         -0.1034     0.0430  -2.405   0.0162 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22.434  on 22  degrees of freedom
Residual deviance: 16.834  on 21  degrees of freedom
AIC: 36.061

Number of Fisher Scoring iterations: 6
#note dispersion is ok
library(car)
Anova(rings_poisson, type = "III")
Analysis of Deviance Table (Type III tests)

Response: Damaged
     LR Chisq Df Pr(>Chisq)  
Temp   5.6004  1    0.01796 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#or binomial (preffered as we can add info (number damaged and not!))
rings_binomial <- glm(cbind(Damaged, 6 - Damaged) ~ Temp, rings, family = "binomial")
summary(rings_binomial)

Call:
glm(formula = cbind(Damaged, 6 - Damaged) ~ Temp, family = "binomial", 
    data = rings)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  5.08498    3.05247   1.666   0.0957 .
Temp        -0.11560    0.04702  -2.458   0.0140 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24.230  on 22  degrees of freedom
Residual deviance: 18.086  on 21  degrees of freedom
AIC: 35.647

Number of Fisher Scoring iterations: 5
#note dispersion is ok
Anova(rings_binomial, type = "III")
Analysis of Deviance Table (Type III tests)

Response: cbind(Damaged, 6 - Damaged)
     LR Chisq Df Pr(>Chisq)  
Temp    6.144  1    0.01319 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#compare to lm
rings_lm <- lm(Damaged ~ Temp, rings)
summary(rings_lm)

Call:
lm(formula = Damaged ~ Temp, data = rings)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5608 -0.3944 -0.0854  0.1056  1.8671 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  3.69841    1.21951   3.033  0.00633 **
Temp        -0.04754    0.01744  -2.725  0.01268 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5774 on 21 degrees of freedom
Multiple R-squared:  0.2613,    Adjusted R-squared:  0.2261 
F-statistic: 7.426 on 1 and 21 DF,  p-value: 0.01268
#note dispersion is ok
Anova(rings_lm, type = "III")
Anova Table (Type III tests)

Response: Damaged
            Sum Sq Df F value  Pr(>F)   
(Intercept) 3.0667  1  9.1973 0.00633 **
Temp        2.4762  1  7.4264 0.01268 * 
Residuals   7.0021 21                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since these are counts we need to use a glm to model the data. We could use a Poisson, but the binomial actually includes more information (like how many did not fail!). Both models indicate a significant relationship between temperature and the number or proportion of failed rings. Results are compared to a linear model.

  1. Returning to the whelk length-mass relationship from class, try fitting an exponential curve to the data. As a hint, try
nls(Mass ~ exp(b0 + b1 * Shell.Length), whelk, 
                   start = list(b0 =1, b1=0), na.action = na.omit)

Compare this model to those that assume a linear and power relationship. Data is available @

https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/whelk.csv

whelk <- read.csv("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/whelk.csv")
head(whelk)
   Location   Mass Sex Shell.Length
1 San Diego 126.10   F           NA
2 San Diego 119.92   M       106.21
3 San Diego  40.07   M        75.58
4 San Diego 140.82   F       107.59
5 San Diego  49.70   M        76.23
6 San Diego     NA   M        70.10
summary(whelk)
   Location              Mass             Sex             Shell.Length   
 Length:473         Min.   :  9.906   Length:473         Min.   : 43.58  
 Class :character   1st Qu.: 87.352   Class :character   1st Qu.: 90.50  
 Mode  :character   Median :150.325   Mode  :character   Median :109.89  
                    Mean   :152.590                      Mean   :106.56  
                    3rd Qu.:209.476                      3rd Qu.:122.30  
                    Max.   :403.892                      Max.   :155.28  
                    NA's   :28                           NA's   :33      
library(ggplot2)
whelk_plot <- ggplot(whelk, aes_string(x="Shell.Length", y = "Mass")) +
  geom_point(aes_string(colour = "Location")) + 
  theme(axis.title.x = element_text(face="bold", size=28), 
        axis.title.y = element_text(face="bold", size=28), 
        axis.text.y  = element_text(size=20),
        axis.text.x  = element_text(size=20), 
        legend.text =element_text(size=20),
        legend.title = element_text(size=20, face="bold"),
        plot.title = element_text(hjust = 0.5, face="bold", size=32))
whelk_plot
Warning: Removed 61 rows containing missing values or values outside the scale range
(`geom_point()`).

#power fit
whelk_lm <- lm(Mass ~ Shell.Length, whelk, na.action = na.omit)

whelk_power <- nls(Mass ~ b0 * Shell.Length^b1, whelk, 
                   start = list(b0 = 1, b1=3), na.action = na.omit)
whelk_exponential <- nls(Mass ~ exp(b0 + b1 * Shell.Length), whelk, 
                         start = list(b0 =1, b1=0), na.action = na.omit)
library(MuMIn)
AICc(whelk_lm, whelk_power, whelk_exponential)
                  df     AICc
whelk_lm           3 3947.579
whelk_power        3 3800.106
whelk_exponential  3 3841.422
#plot
whelk_plot + geom_smooth(method = "lm", se = FALSE, size = 1.5, color = "orange")+ 
  geom_smooth(method="nls", 
              # look at whelk_power$call
              formula = y ~ b0 * x^b1, 
              method.args = list(start = list(b0 = 1, 
                                              b1 = 3)), 
              se=FALSE, size = 1.5, color = "blue") +
  geom_smooth(method="nls", 
              # look at whelk_exponential$call
              formula = y ~ exp(b0 + b1 * x), 
              method.args = list(start = list(b0 = 1, 
                                              b1 = 0)), 
              se=FALSE, size = 1.5, color = "green")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 61 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 61 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 61 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 61 rows containing missing values or values outside the scale range
(`geom_point()`).

We can use the nls model to consider exponential curve to the data. Various fits may be compared using AIC methods. In this case it appears that the power fit is the best (lowest AIC value).

  1. Going back to the TEAM dataset, remember we found that elevation had no impact on carbon storage. But that was a linear fit. Use a gam (generalized additive model) to see if elevation can be related to carbon storage in an additive model. Note we can use the gamm (generalized additive mixed model) function in the mgcv package to denote mixed effects. For example (from help file)
b2 <- gamm(y~s(x0)+s(x1)+s(x2),family=poisson,
           data=dat,random=list(fac=~1))

Team data is available @

https://raw.github.com/jsgosnell/CUNY-BioStats/blob/master/datasets/team_data_no_spaces.csv

library(mgcv)
library(MuMIn) #for AICc
team <- read.csv("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/team_data_no_spaces.csv", stringsAsFactors = T)
elevation_linear <- gam(PlotCarbon.tonnes ~ Elevation, data = team)
elevation_gam <- gam(PlotCarbon.tonnes ~ s(Elevation), data = team)
elevation_gamm <- gamm(PlotCarbon.tonnes ~s(Elevation), random = list(Site.Name = ~ 1), data = team)
AICc(elevation_gam, elevation_gamm, elevation_linear)
                 df     AICc
elevation_gam     3 648.2139
elevation_gamm    5 634.4652
elevation_linear  3 648.2139

A generalized additive model fits a curve to the dataset (spline in this case). AIC comparison indicates the gam model with a random effect for site is the best fit.