More ANOVAs

Remember you should

Overview

This practice reviews the More ANOVAs lecuture.

Examples

If interaction is significant

Following the memory example from class, read in and check data

memory <- read.table("http://www.statsci.org/data/general/eysenck.txt", header = T,
                     stringsAsFactors = T)
str(memory)
'data.frame':   100 obs. of  3 variables:
 $ Age    : Factor w/ 2 levels "Older","Younger": 2 2 2 2 2 2 2 2 2 2 ...
 $ Process: Factor w/ 5 levels "Adjective","Counting",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Words  : num  8 6 4 6 7 6 5 7 9 7 ...

Let’s put younger level first

library(plyr)
memory$Age <- relevel(memory$Age, "Younger")

and graph

library(Rmisc)
function_output <- summarySE(memory, measurevar="Words", groupvars =
                               c("Age", "Process"), na.rm = T)
library(ggplot2)
ggplot(function_output, aes(x=Age, y=Words,color=Process, 
                                   shape = Process)) +
  geom_line(aes(group=Process, linetype = Process), size=2) +
    geom_point(size = 5) +
  ylab("Words remembered")+ 
  xlab("Age") + 
  ggtitle("Process type interacts with age to impact memory")+
  theme(axis.title.x = element_text(face="bold", size=28), 
        axis.title.y = element_text(face="bold", size=28), 
        axis.text.y  = element_text(size=20),
        axis.text.x  = element_text(size=20), 
        legend.text =element_text(size=20),
        legend.title = element_text(size=20, face="bold"),
        plot.title = element_text(hjust = 0.5, face="bold", size=32))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

There appears to be some interactions. Let’ build a model

memory_interactions <- lm(Words ~ Age * Process, memory)

and check assumptions.

par(mfrow=c(2,2))
plot(memory_interactions)

These appear to be met, so look at output

library(car)
Anova(memory_interactions, type = "III")
Anova Table (Type III tests)

Response: Words
            Sum Sq Df  F value    Pr(>F)    
(Intercept) 2190.4  1 272.9281 < 2.2e-16 ***
Age           72.2  1   8.9963 0.0034984 ** 
Process     1353.7  4  42.1690 < 2.2e-16 ***
Age:Process  190.3  4   5.9279 0.0002793 ***
Residuals    722.3 90                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since interaction is significant, analyze subsets. For example,

memory_interactions_young <- lm(Words ~ Process, memory[memory$Age == "Younger",])
plot(memory_interactions_young)

Anova(memory_interactions_young, type = "III")
Anova Table (Type III tests)

Response: Words
            Sum Sq Df F value    Pr(>F)    
(Intercept) 2190.4  1 343.442 < 2.2e-16 ***
Process     1353.7  4  53.064 < 2.2e-16 ***
Residuals    287.0 45                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant difference in words recalled based on process, but why? Investigate with post-hoc tests.

library(multcomp)
comp_young <- glht(memory_interactions_young, linfct = mcp(Process = "Tukey"))
summary(comp_young)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Words ~ Process, data = memory[memory$Age == "Younger", 
    ])

Linear Hypotheses:
                             Estimate Std. Error t value Pr(>|t|)    
Counting - Adjective == 0      -8.300      1.129  -7.349  < 0.001 ***
Imagery - Adjective == 0        2.800      1.129   2.479  0.11357    
Intentional - Adjective == 0    4.500      1.129   3.984  0.00221 ** 
Rhyming - Adjective == 0       -7.200      1.129  -6.375  < 0.001 ***
Imagery - Counting == 0        11.100      1.129   9.828  < 0.001 ***
Intentional - Counting == 0    12.800      1.129  11.333  < 0.001 ***
Rhyming - Counting == 0         1.100      1.129   0.974  0.86545    
Intentional - Imagery == 0      1.700      1.129   1.505  0.56454    
Rhyming - Imagery == 0        -10.000      1.129  -8.854  < 0.001 ***
Rhyming - Intentional == 0    -11.700      1.129 -10.359  < 0.001 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Blocking example

Following feather color example from class:

# more than 2? ####
feather <-  read.csv("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/wiebe_2002_example.csv", stringsAsFactors = T)
str(feather)
'data.frame':   32 obs. of  3 variables:
 $ Bird       : Factor w/ 16 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Feather    : Factor w/ 2 levels "Odd","Typical": 2 2 2 2 2 2 2 2 2 2 ...
 $ Color_index: num  -0.255 -0.213 -0.19 -0.185 -0.045 -0.025 -0.015 0.003 0.015 0.02 ...
set.seed(25)
special <- data.frame(Bird = LETTERS[1:16], Feather = "Special", 
                      Color_index= feather[feather$Feather == "Typical", "Color_index"] +
                        .3 +runif(16,1,1)*.01)
feather <- merge(feather, special, all = T)


Anova(lm(Color_index ~ Feather + Bird, data=feather), type= "III")
Anova Table (Type III tests)

Response: Color_index
             Sum Sq Df  F value    Pr(>F)    
(Intercept) 0.36392  1  59.9538 1.224e-08 ***
Feather     1.67906  2 138.3093 7.208e-16 ***
Bird        0.34649 15   3.8055 0.0008969 ***
Residuals   0.18210 30                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multcomp)
compare <- glht(lm(Color_index ~ Feather + Bird, data=feather), linfct = mcp("Feather" = "Tukey"))
summary(compare)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Color_index ~ Feather + Bird, data = feather)

Linear Hypotheses:
                       Estimate Std. Error t value Pr(>|t|)    
Typical - Odd == 0      0.13712    0.02755   4.978   <1e-04 ***
Special - Odd == 0      0.44712    0.02755  16.232   <1e-04 ***
Special - Typical == 0  0.31000    0.02755  11.254   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
#note comparison doesn't work
Anova(lm(Color_index ~ Feather * Bird, data=feather), type= "III")
Error in Anova.lm(lm(Color_index ~ Feather * Bird, data = feather), type = "III"): residual df = 0

Practice

1

A survey was conducted to see if athletes and non-athletes deal with anger in the same way. Data is @

angry <- read.csv(“https://docs.google.com/spreadsheets/d/e/2PACX-1vSaawG37o1ZUEs1B4keIJpZAY2c5tuljf29dWnzqQ0tHNCzfbz85AlWobYzBQ3nPPXJBLP-FWe4BNZB/pub?gid=1784556512&single=true&output=csv”, stringsAsFactors = T)

and more information is at

http://onlinestatbook.com/case_studies/angry_moods.html.

Focus on the following variables:

Sports 1 = athletes, 2 = non-athletes Gender 1 = males, 2 = females Expression (AE) index of general anger expression: (Anger-Out) + (Anger-In) - (Control-Out) - (Control-In) + 48

Is there any evidence that gender or athlete status impact how anger is expressed?

angry <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSaawG37o1ZUEs1B4keIJpZAY2c5tuljf29dWnzqQ0tHNCzfbz85AlWobYzBQ3nPPXJBLP-FWe4BNZB/pub?gid=1784556512&single=true&output=csv", stringsAsFactors = T)
str(angry)
'data.frame':   78 obs. of  7 variables:
 $ Gender          : int  2 2 2 2 1 1 1 2 2 2 ...
 $ Sports          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Anger.Out       : int  18 14 13 17 16 16 12 13 16 12 ...
 $ Anger.In        : int  13 17 14 24 17 22 12 16 16 16 ...
 $ Control.Out     : int  23 25 28 23 26 25 31 22 22 29 ...
 $ Control.In      : int  20 24 28 23 28 23 27 31 24 29 ...
 $ Anger_Expression: int  36 30 19 43 27 38 14 24 34 18 ...
angry$Gender <- as.factor(angry$Gender)
library(plyr)
angry$Gender <- revalue(angry$Gender, c("1" = "male", 
                                        "2" = "female"))
angry$Sports <- as.factor(angry$Sports)
angry$Sports <- revalue(angry$Sports, c("1" = "athlete",
                                        "2" = "non-athlete"))
library(Rmisc)
anger_summary <- summarySE(angry, measurevar="Anger_Expression", groupvars =
                               c("Sports", "Gender"), na.rm = T)
library(ggplot2)
ggplot(anger_summary, aes(x=Gender, y=Anger_Expression, color=Sports, 
                                   shape = Sports)) +
  geom_point(size = 3) +
  geom_line(aes(group=Sports, linetype =Sports), size=2) +
  geom_errorbar(aes(ymin=Anger_Expression-ci, ymax=Anger_Expression+ci), size=1.5) +
  ylab("Anger level")+ 
  xlab("Gender") + 
  scale_shape_discrete(guide=FALSE)+
  scale_linetype_discrete(guide=FALSE)+
  ggtitle("Anger level among groups")+
  theme(axis.title.x = element_text(face="bold", size=28), 
        axis.title.y = element_text(face="bold", size=28), 
        axis.text.y  = element_text(size=20),
        axis.text.x  = element_text(size=20), 
        legend.text =element_text(size=20),
        legend.title = element_text(size=20, face="bold"),
        plot.title = element_text(hjust = 0.5, face="bold", size=32))
Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
ggplot2 3.3.4.
ℹ Please use "none" instead.

I first read in and recoded some data for ease and plotting. I then produced a plot to consider the null hypotheses that

  • the sport an athlete plays does not influence anger level
  • the gender of an athlete does not influence anger level
  • the sport an athlete plays and their gender do not interact to influence anger level
angry_gender <- lm(Anger_Expression ~ Sports * Gender, angry)
plot(angry_gender)

library(car)
Anova(angry_gender, type = "III")
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted

Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted
Anova Table (Type III tests)

Response: Anger_Expression
               Sum Sq Df F value    Pr(>F)    
(Intercept)   11200.1  1 71.8690 1.617e-12 ***
Sports          480.1  1  3.0807   0.08336 .  
Gender           17.7  1  0.1135   0.73711    
Sports:Gender     5.2  1  0.0336   0.85505    
Residuals     11532.2 74                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#remove interaction since not significant
angry_gender <- lm(Anger_Expression ~ Sports + Gender, angry)
plot(angry_gender)

Anova(angry_gender, type = "III") #only differs among those who play sports
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted
Anova Table (Type III tests)

Response: Anger_Expression
             Sum Sq Df  F value    Pr(>F)    
(Intercept) 17367.1  1 112.8964 < 2.2e-16 ***
Sports       1357.2  1   8.8227  0.003995 ** 
Gender         16.3  1   0.1061  0.745501    
Residuals   11537.4 75                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I then analyzed the data using factorial ANOVA. The outcome is continuous and both explanatory variables are categorical. The design is also fully randomized. Resildual plots indicated all assumptions were met (there is no pattern in the residuals and they are normally distributed). Analysis shows an insignificant interaction (F1,74=.04, p=.855) between sport and gender, so I removed the interaction term. The reduced model showed anger levels differed among athletes and non-athletes but not by gender. There was no need for post-hoc tests (only 2 levels/groups for each categorial variable.)

2

A professor carried out a long-term study to see how various factors impacted pulse rate before and after exercise. Data can be found at http://www.statsci.org/data/oz/ms212.txt With more info at http://www.statsci.org/data/oz/ms212.html. Is there evidence that frequency of exercise (Exercise column) and gender impact change in pulse rate for students who ran (Ran column = 1)?

pulse <- read.table("http://www.statsci.org/data/oz/ms212.txt", header = T, 
                    stringsAsFactors = T)
pulse$Exercise <- factor(pulse$Exercise)
library(plyr)
pulse$Exercise <- revalue(pulse$Exercise, c("1" = "high", 
                                            "2" = "moderate", 
                                            "3" = "low"))
pulse$Gender <- factor(pulse$Gender)
pulse$Gender <- revalue (pulse$Gender, c("1" = "male", "2" = "female"))
pulse$change <- pulse$Pulse2 - pulse$Pulse1
change_summary <- summarySE(pulse[pulse$Ran == 1, ], measurevar="change", groupvars =
                               c("Exercise", "Gender"), na.rm = T)
Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced
ggplot(change_summary, aes(x=Gender, shape = Exercise, color = Exercise,
                           y=change)) +
  geom_point(size = 3) +
  geom_line(aes(group=Exercise, linetype =Exercise), size=2) +
  geom_errorbar(aes(ymin=change-ci, ymax=change+ci), size=1.5) +
  ylab("Change in pulse \n (beats per minute)") +
  scale_color_discrete(name = "Exercise level")+
  scale_shape_discrete(guide=FALSE)+
  scale_linetype_discrete(guide=FALSE)+
  ggtitle("Change in pulse does \n not differ among groups") +
  theme(axis.title.x = element_text(face="bold", size=28), 
        axis.title.y = element_text(face="bold", size=28), 
        axis.text.y  = element_text(size=20),
        axis.text.x  = element_text(size=20), 
        legend.text =element_text(size=20),
        legend.title = element_text(size=20, face="bold"),
        plot.title = element_text(hjust = 0.5, face="bold", size=32))

I first read in and recoded some data for ease and plotting. I then produced a plot to consider the null hypotheses that

  • exercise level does not influence change in pulse rate
  • gender does not influence change in pulse rate
  • gender and exercise level do not interact to influence change in pulse rate
exercise <- lm(change ~ Gender * Exercise, pulse[pulse$Ran == 1, ])
summary(exercise)

Call:
lm(formula = change ~ Gender * Exercise, data = pulse[pulse$Ran == 
    1, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-44.231 -11.300   1.769  10.083  48.444 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     45.000      8.746   5.145 7.44e-06 ***
Genderfemale                    35.000     23.139   1.513   0.1383    
Exercisemoderate                 9.231     10.573   0.873   0.3879    
Exerciselow                     12.400     12.972   0.956   0.3449    
Genderfemale:Exercisemoderate  -38.231     24.678  -1.549   0.1292    
Genderfemale:Exerciselow       -46.844     26.043  -1.799   0.0796 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.42 on 40 degrees of freedom
Multiple R-squared:  0.0828,    Adjusted R-squared:  -0.03185 
F-statistic: 0.7222 on 5 and 40 DF,  p-value: 0.6107
Anova(exercise, type = "III")
Anova Table (Type III tests)

Response: change
                 Sum Sq Df F value    Pr(>F)    
(Intercept)     12150.0  1 26.4739 7.444e-06 ***
Gender           1050.0  1  2.2879    0.1383    
Exercise          496.3  2  0.5407    0.5865    
Gender:Exercise  1488.8  2  1.6220    0.2102    
Residuals       18357.7 40                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#rerun without interaction
exercise <- lm(change ~ Gender + Exercise, pulse[pulse$Ran == 1, ])
summary(exercise)

Call:
lm(formula = change ~ Gender + Exercise, data = pulse[pulse$Ran == 
    1, ])

Residuals:
   Min     1Q Median     3Q    Max 
-43.99 -14.89   2.23  11.91  45.19 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        50.391      8.273   6.091 2.94e-07 ***
Genderfemale       -2.738      6.770  -0.404    0.688    
Exercisemoderate    3.603      9.572   0.376    0.708    
Exerciselow         1.155     10.617   0.109    0.914    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.74 on 42 degrees of freedom
Multiple R-squared:  0.008416,  Adjusted R-squared:  -0.06241 
F-statistic: 0.1188 on 3 and 42 DF,  p-value: 0.9485
Anova(exercise, type = "III") #no significance
Anova Table (Type III tests)

Response: change
             Sum Sq Df F value    Pr(>F)    
(Intercept) 17532.0  1 37.1019 2.937e-07 ***
Gender         77.3  1  0.1636    0.6879    
Exercise       97.1  2  0.1028    0.9025    
Residuals   19846.5 42                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I then analyzed the data using factorial ANOVA. The outcome is continuous and both explanatory variables are categorical. The design is also fully randomized. Residual plots indicated all assumptions were met (there is no pattern in the residuals and they are normally distributed). Analysis shows an insignificant interaction (F2,40=1.2, p=.21) between exercise level and gender, so I removed the interaction term. The reduced model showed neither gender (F1,42=.16, p =.69) or exercise level (F2,42=.1, p=.90) influenced change in pulse rate, so I failed to reject the related null hypotheses.

3

Data from Valdez et al 2023 is available @ https://docs.google.com/spreadsheets/d/e/2PACX-1vT2gaLu6pyRMlcbzarn3ej4bFmT_iHvrlNWJYSdrsLdUWIjcJi7rU11-ipvYpGnqD9qLDnbhNd2sDUW/pub?gid=1707080634&single=true&output=csv.

Import it into to R and

  • determine how the snail grazing and nitrogen levels impact number of flowering shoots(Shoot.density..m2)
  • construct a plot to showcase your analysis
valdez_2023 <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vT2gaLu6pyRMlcbzarn3ej4bFmT_iHvrlNWJYSdrsLdUWIjcJi7rU11-ipvYpGnqD9qLDnbhNd2sDUW/pub?gid=1707080634&single=true&output=csv", stringsAsFactors = T)
shoot_model <-lm( Shoot.density..m2~Snail.Level + Nitrogen.level + Snail.Level:Nitrogen.level, valdez_2023[valdez_2023$Snail.Level != "uncaged",])
plot(shoot_model)

Anova(shoot_model, type="III")
Anova Table (Type III tests)

Response: Shoot.density..m2
                           Sum Sq Df  F value    Pr(>F)    
(Intercept)                196608  1 271.0588 1.334e-09 ***
Snail.Level                 36238  2  24.9804 5.277e-05 ***
Nitrogen.level               4267  1   5.8824    0.0320 *  
Snail.Level:Nitrogen.level   3100  2   2.1373    0.1607    
Residuals                    8704 12                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

No interaction, so I can remove or not.

shoot_model_reduced <- update(shoot_model, .~.- Snail.Level:Nitrogen.level)
Anova(shoot_model_reduced)
Anova Table (Type II tests)

Response: Shoot.density..m2
               Sum Sq Df F value    Pr(>F)    
Snail.Level     45596  2  27.039 1.556e-05 ***
Nitrogen.level  11150  1  13.224  0.002696 ** 
Residuals       11804 14                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both have impacts.

snail_post_hoc <- glht(shoot_model_reduced, linfct = mcp(Snail.Level = "Tukey"))
summary(snail_post_hoc)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Shoot.density..m2 ~ Snail.Level + Nitrogen.level, 
    data = valdez_2023[valdez_2023$Snail.Level != "uncaged", 
        ])

Linear Hypotheses:
                                     Estimate Std. Error t value Pr(>|t|)    
removal - control snails == 0           50.67      16.76   3.022  0.02306 *  
snail addition - control snails == 0   -72.00      16.76  -4.295  0.00201 ** 
snail addition - removal == 0         -122.67      16.76  -7.317  < 0.001 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

significant difference among all snails.

shoot_summary <- summarySE(valdez_2023[valdez_2023$Snail.Level != "uncaged",], measurevar = "Shoot.density..m2", groupvars = c("Snail.Level", "Nitrogen.level"))
shoot_summary
     Snail.Level Nitrogen.level N Shoot.density..m2        sd        se
1 control snails     Fertilized 3          256.0000 27.712813 16.000000
2 control snails        without 3          202.6667  9.237604  5.333333
3        removal     Fertilized 3          320.0000 16.000000  9.237604
4        removal        without 3          240.0000 27.712813 16.000000
5 snail addition     Fertilized 3          165.3333 36.950417 21.333333
6 snail addition        without 3          149.3333 33.306656 19.229607
        ci
1 68.84244
2 22.94748
3 39.74620
4 68.84244
5 91.78992
6 82.73832
shoot_summary$Snail.Level <- relevel(shoot_summary$Snail.Level, "removal")
shoot_summary$Snail.Level <- relevel(shoot_summary$Snail.Level, "uncaged")


ggplot(shoot_summary, aes(x=Snail.Level, 
                           y=Shoot.density..m2,
                           fill=Nitrogen.level)) +
  geom_col(color="black", position=position_dodge()) +
  geom_errorbar(aes(ymin=Shoot.density..m2, ymax=Shoot.density..m2+ci), position = position_dodge()) +
  labs(title="Grazing impacts depend on nitrogen levels",
       x= "Grazing level",
       y= expression(paste("# of shoots/" ,m^{-2})))

4

Find an example of a factorial ANOVA from a paper that is related to your work. Make sure you understand the connections between the methods, results, and graphs. Briefly answer the following questions

  • What was the dependent variable?
  • What were the independent variables?
  • Was the interaction significant?
    • If so, how did they interpret findings
    • If not, were the main effects significant?