Remember you should
add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
use visual mode or render your file to produce a version that you can see!
render your file to make sure it runs (and that you haven’t been working out of order)
save your work often
commit it via git!
push updates to github
Examples
If interaction is significant
Following the memory example from class, read in and check data
memory <- read.table ("http://www.statsci.org/data/general/eysenck.txt" , header = T,
stringsAsFactors = T)
str (memory)
'data.frame': 100 obs. of 3 variables:
$ Age : Factor w/ 2 levels "Older","Younger": 2 2 2 2 2 2 2 2 2 2 ...
$ Process: Factor w/ 5 levels "Adjective","Counting",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Words : num 8 6 4 6 7 6 5 7 9 7 ...
Let’s put younger level first
library (plyr)
memory$ Age <- relevel (memory$ Age, "Younger" )
and graph
library (Rmisc)
function_output <- summarySE (memory, measurevar= "Words" , groupvars =
c ("Age" , "Process" ), na.rm = T)
library (ggplot2)
ggplot (function_output, aes (x= Age, y= Words,color= Process,
shape = Process)) +
geom_line (aes (group= Process, linetype = Process), size= 2 ) +
geom_point (size = 5 ) +
ylab ("Words remembered" )+
xlab ("Age" ) +
ggtitle ("Process type interacts with age to impact memory" )+
theme (axis.title.x = element_text (face= "bold" , size= 28 ),
axis.title.y = element_text (face= "bold" , size= 28 ),
axis.text.y = element_text (size= 20 ),
axis.text.x = element_text (size= 20 ),
legend.text = element_text (size= 20 ),
legend.title = element_text (size= 20 , face= "bold" ),
plot.title = element_text (hjust = 0.5 , face= "bold" , size= 32 ))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
There appears to be some interactions. Let’ build a model
memory_interactions <- lm (Words ~ Age * Process, memory)
and check assumptions.
par (mfrow= c (2 ,2 ))
plot (memory_interactions)
These appear to be met, so look at output
library (car)
Anova (memory_interactions, type = "III" )
Anova Table (Type III tests)
Response: Words
Sum Sq Df F value Pr(>F)
(Intercept) 2190.4 1 272.9281 < 2.2e-16 ***
Age 72.2 1 8.9963 0.0034984 **
Process 1353.7 4 42.1690 < 2.2e-16 ***
Age:Process 190.3 4 5.9279 0.0002793 ***
Residuals 722.3 90
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since interaction is significant, analyze subsets. For example,
memory_interactions_young <- lm (Words ~ Process, memory[memory$ Age == "Younger" ,])
plot (memory_interactions_young)
Anova (memory_interactions_young, type = "III" )
Anova Table (Type III tests)
Response: Words
Sum Sq Df F value Pr(>F)
(Intercept) 2190.4 1 343.442 < 2.2e-16 ***
Process 1353.7 4 53.064 < 2.2e-16 ***
Residuals 287.0 45
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant difference in words recalled based on process, but why? Investigate with post-hoc tests.
library (multcomp)
comp_young <- glht (memory_interactions_young, linfct = mcp (Process = "Tukey" ))
summary (comp_young)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = Words ~ Process, data = memory[memory$Age == "Younger",
])
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Counting - Adjective == 0 -8.300 1.129 -7.349 < 0.001 ***
Imagery - Adjective == 0 2.800 1.129 2.479 0.11357
Intentional - Adjective == 0 4.500 1.129 3.984 0.00221 **
Rhyming - Adjective == 0 -7.200 1.129 -6.375 < 0.001 ***
Imagery - Counting == 0 11.100 1.129 9.828 < 0.001 ***
Intentional - Counting == 0 12.800 1.129 11.333 < 0.001 ***
Rhyming - Counting == 0 1.100 1.129 0.974 0.86545
Intentional - Imagery == 0 1.700 1.129 1.505 0.56454
Rhyming - Imagery == 0 -10.000 1.129 -8.854 < 0.001 ***
Rhyming - Intentional == 0 -11.700 1.129 -10.359 < 0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Blocking example
Following feather color example from class:
# more than 2? ####
feather <- read.csv ("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/wiebe_2002_example.csv" , stringsAsFactors = T)
str (feather)
'data.frame': 32 obs. of 3 variables:
$ Bird : Factor w/ 16 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feather : Factor w/ 2 levels "Odd","Typical": 2 2 2 2 2 2 2 2 2 2 ...
$ Color_index: num -0.255 -0.213 -0.19 -0.185 -0.045 -0.025 -0.015 0.003 0.015 0.02 ...
set.seed (25 )
special <- data.frame (Bird = LETTERS[1 : 16 ], Feather = "Special" ,
Color_index= feather[feather$ Feather == "Typical" , "Color_index" ] +
.3 + runif (16 ,1 ,1 )* .01 )
feather <- merge (feather, special, all = T)
Anova (lm (Color_index ~ Feather + Bird, data= feather), type= "III" )
Anova Table (Type III tests)
Response: Color_index
Sum Sq Df F value Pr(>F)
(Intercept) 0.36392 1 59.9538 1.224e-08 ***
Feather 1.67906 2 138.3093 7.208e-16 ***
Bird 0.34649 15 3.8055 0.0008969 ***
Residuals 0.18210 30
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library (multcomp)
compare <- glht (lm (Color_index ~ Feather + Bird, data= feather), linfct = mcp ("Feather" = "Tukey" ))
summary (compare)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = Color_index ~ Feather + Bird, data = feather)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Typical - Odd == 0 0.13712 0.02755 4.978 <1e-04 ***
Special - Odd == 0 0.44712 0.02755 16.232 <1e-04 ***
Special - Typical == 0 0.31000 0.02755 11.254 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
#note comparison doesn't work
Anova (lm (Color_index ~ Feather * Bird, data= feather), type= "III" )
Error in Anova.lm(lm(Color_index ~ Feather * Bird, data = feather), type = "III"): residual df = 0
Practice
1
A survey was conducted to see if athletes and non-athletes deal with anger in the same way. Data is @
angry <- read.csv(“https://docs.google.com/spreadsheets/d/e/2PACX-1vSaawG37o1ZUEs1B4keIJpZAY2c5tuljf29dWnzqQ0tHNCzfbz85AlWobYzBQ3nPPXJBLP-FWe4BNZB/pub?gid=1784556512&single=true&output=csv ”, stringsAsFactors = T)
and more information is at
http://onlinestatbook.com/case_studies/angry_moods.html .
Focus on the following variables:
Sports 1 = athletes, 2 = non-athletes Gender 1 = males, 2 = females Expression (AE) index of general anger expression: (Anger-Out) + (Anger-In) - (Control-Out) - (Control-In) + 48
Is there any evidence that gender or athlete status impact how anger is expressed?
angry <- read.csv ("https://docs.google.com/spreadsheets/d/e/2PACX-1vSaawG37o1ZUEs1B4keIJpZAY2c5tuljf29dWnzqQ0tHNCzfbz85AlWobYzBQ3nPPXJBLP-FWe4BNZB/pub?gid=1784556512&single=true&output=csv" , stringsAsFactors = T)
str (angry)
'data.frame': 78 obs. of 7 variables:
$ Gender : int 2 2 2 2 1 1 1 2 2 2 ...
$ Sports : int 1 1 1 1 1 1 1 1 1 1 ...
$ Anger.Out : int 18 14 13 17 16 16 12 13 16 12 ...
$ Anger.In : int 13 17 14 24 17 22 12 16 16 16 ...
$ Control.Out : int 23 25 28 23 26 25 31 22 22 29 ...
$ Control.In : int 20 24 28 23 28 23 27 31 24 29 ...
$ Anger_Expression: int 36 30 19 43 27 38 14 24 34 18 ...
angry$ Gender <- as.factor (angry$ Gender)
library (plyr)
angry$ Gender <- revalue (angry$ Gender, c ("1" = "male" ,
"2" = "female" ))
angry$ Sports <- as.factor (angry$ Sports)
angry$ Sports <- revalue (angry$ Sports, c ("1" = "athlete" ,
"2" = "non-athlete" ))
library (Rmisc)
anger_summary <- summarySE (angry, measurevar= "Anger_Expression" , groupvars =
c ("Sports" , "Gender" ), na.rm = T)
library (ggplot2)
ggplot (anger_summary, aes (x= Gender, y= Anger_Expression, color= Sports,
shape = Sports)) +
geom_point (size = 3 ) +
geom_line (aes (group= Sports, linetype = Sports), size= 2 ) +
geom_errorbar (aes (ymin= Anger_Expression- ci, ymax= Anger_Expression+ ci), size= 1.5 ) +
ylab ("Anger level" )+
xlab ("Gender" ) +
scale_shape_discrete (guide= FALSE )+
scale_linetype_discrete (guide= FALSE )+
ggtitle ("Anger level among groups" )+
theme (axis.title.x = element_text (face= "bold" , size= 28 ),
axis.title.y = element_text (face= "bold" , size= 28 ),
axis.text.y = element_text (size= 20 ),
axis.text.x = element_text (size= 20 ),
legend.text = element_text (size= 20 ),
legend.title = element_text (size= 20 , face= "bold" ),
plot.title = element_text (hjust = 0.5 , face= "bold" , size= 32 ))
Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
ggplot2 3.3.4.
ℹ Please use "none" instead.
I first read in and recoded some data for ease and plotting. I then produced a plot to consider the null hypotheses that
the sport an athlete plays does not influence anger level
the gender of an athlete does not influence anger level
the sport an athlete plays and their gender do not interact to influence anger level
angry_gender <- lm (Anger_Expression ~ Sports * Gender, angry)
plot (angry_gender)
library (car)
Anova (angry_gender, type = "III" )
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Anova Table (Type III tests)
Response: Anger_Expression
Sum Sq Df F value Pr(>F)
(Intercept) 11200.1 1 71.8690 1.617e-12 ***
Sports 480.1 1 3.0807 0.08336 .
Gender 17.7 1 0.1135 0.73711
Sports:Gender 5.2 1 0.0336 0.85505
Residuals 11532.2 74
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#remove interaction since not significant
angry_gender <- lm (Anger_Expression ~ Sports + Gender, angry)
plot (angry_gender)
Anova (angry_gender, type = "III" ) #only differs among those who play sports
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Anova Table (Type III tests)
Response: Anger_Expression
Sum Sq Df F value Pr(>F)
(Intercept) 17367.1 1 112.8964 < 2.2e-16 ***
Sports 1357.2 1 8.8227 0.003995 **
Gender 16.3 1 0.1061 0.745501
Residuals 11537.4 75
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I then analyzed the data using factorial ANOVA. The outcome is continuous and both explanatory variables are categorical. The design is also fully randomized. Resildual plots indicated all assumptions were met (there is no pattern in the residuals and they are normally distributed). Analysis shows an insignificant interaction (F1,74 =.04, p=.855) between sport and gender, so I removed the interaction term. The reduced model showed anger levels differed among athletes and non-athletes but not by gender. There was no need for post-hoc tests (only 2 levels/groups for each categorial variable.)
2
A professor carried out a long-term study to see how various factors impacted pulse rate before and after exercise. Data can be found at http://www.statsci.org/data/oz/ms212.txt With more info at http://www.statsci.org/data/oz/ms212.html . Is there evidence that frequency of exercise (Exercise column) and gender impact change in pulse rate for students who ran (Ran column = 1)?
pulse <- read.table ("http://www.statsci.org/data/oz/ms212.txt" , header = T,
stringsAsFactors = T)
pulse$ Exercise <- factor (pulse$ Exercise)
library (plyr)
pulse$ Exercise <- revalue (pulse$ Exercise, c ("1" = "high" ,
"2" = "moderate" ,
"3" = "low" ))
pulse$ Gender <- factor (pulse$ Gender)
pulse$ Gender <- revalue (pulse$ Gender, c ("1" = "male" , "2" = "female" ))
pulse$ change <- pulse$ Pulse2 - pulse$ Pulse1
change_summary <- summarySE (pulse[pulse$ Ran == 1 , ], measurevar= "change" , groupvars =
c ("Exercise" , "Gender" ), na.rm = T)
Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced
ggplot (change_summary, aes (x= Gender, shape = Exercise, color = Exercise,
y= change)) +
geom_point (size = 3 ) +
geom_line (aes (group= Exercise, linetype = Exercise), size= 2 ) +
geom_errorbar (aes (ymin= change- ci, ymax= change+ ci), size= 1.5 ) +
ylab ("Change in pulse \n (beats per minute)" ) +
scale_color_discrete (name = "Exercise level" )+
scale_shape_discrete (guide= FALSE )+
scale_linetype_discrete (guide= FALSE )+
ggtitle ("Change in pulse does \n not differ among groups" ) +
theme (axis.title.x = element_text (face= "bold" , size= 28 ),
axis.title.y = element_text (face= "bold" , size= 28 ),
axis.text.y = element_text (size= 20 ),
axis.text.x = element_text (size= 20 ),
legend.text = element_text (size= 20 ),
legend.title = element_text (size= 20 , face= "bold" ),
plot.title = element_text (hjust = 0.5 , face= "bold" , size= 32 ))
I first read in and recoded some data for ease and plotting. I then produced a plot to consider the null hypotheses that
exercise level does not influence change in pulse rate
gender does not influence change in pulse rate
gender and exercise level do not interact to influence change in pulse rate
exercise <- lm (change ~ Gender * Exercise, pulse[pulse$ Ran == 1 , ])
summary (exercise)
Call:
lm(formula = change ~ Gender * Exercise, data = pulse[pulse$Ran ==
1, ])
Residuals:
Min 1Q Median 3Q Max
-44.231 -11.300 1.769 10.083 48.444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.000 8.746 5.145 7.44e-06 ***
Genderfemale 35.000 23.139 1.513 0.1383
Exercisemoderate 9.231 10.573 0.873 0.3879
Exerciselow 12.400 12.972 0.956 0.3449
Genderfemale:Exercisemoderate -38.231 24.678 -1.549 0.1292
Genderfemale:Exerciselow -46.844 26.043 -1.799 0.0796 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.42 on 40 degrees of freedom
Multiple R-squared: 0.0828, Adjusted R-squared: -0.03185
F-statistic: 0.7222 on 5 and 40 DF, p-value: 0.6107
Anova (exercise, type = "III" )
Anova Table (Type III tests)
Response: change
Sum Sq Df F value Pr(>F)
(Intercept) 12150.0 1 26.4739 7.444e-06 ***
Gender 1050.0 1 2.2879 0.1383
Exercise 496.3 2 0.5407 0.5865
Gender:Exercise 1488.8 2 1.6220 0.2102
Residuals 18357.7 40
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#rerun without interaction
exercise <- lm (change ~ Gender + Exercise, pulse[pulse$ Ran == 1 , ])
summary (exercise)
Call:
lm(formula = change ~ Gender + Exercise, data = pulse[pulse$Ran ==
1, ])
Residuals:
Min 1Q Median 3Q Max
-43.99 -14.89 2.23 11.91 45.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.391 8.273 6.091 2.94e-07 ***
Genderfemale -2.738 6.770 -0.404 0.688
Exercisemoderate 3.603 9.572 0.376 0.708
Exerciselow 1.155 10.617 0.109 0.914
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.74 on 42 degrees of freedom
Multiple R-squared: 0.008416, Adjusted R-squared: -0.06241
F-statistic: 0.1188 on 3 and 42 DF, p-value: 0.9485
Anova (exercise, type = "III" ) #no significance
Anova Table (Type III tests)
Response: change
Sum Sq Df F value Pr(>F)
(Intercept) 17532.0 1 37.1019 2.937e-07 ***
Gender 77.3 1 0.1636 0.6879
Exercise 97.1 2 0.1028 0.9025
Residuals 19846.5 42
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I then analyzed the data using factorial ANOVA. The outcome is continuous and both explanatory variables are categorical. The design is also fully randomized. Residual plots indicated all assumptions were met (there is no pattern in the residuals and they are normally distributed). Analysis shows an insignificant interaction (F2,40 =1.2, p=.21) between exercise level and gender, so I removed the interaction term. The reduced model showed neither gender (F1,42 =.16, p =.69) or exercise level (F2,42 =.1, p=.90) influenced change in pulse rate, so I failed to reject the related null hypotheses.
3
Data from Valdez et al 2023 is available @ https://docs.google.com/spreadsheets/d/e/2PACX-1vT2gaLu6pyRMlcbzarn3ej4bFmT_iHvrlNWJYSdrsLdUWIjcJi7rU11-ipvYpGnqD9qLDnbhNd2sDUW/pub?gid=1707080634&single=true&output=csv .
Import it into to R and
determine how the snail grazing and nitrogen levels impact number of flowering shoots(Shoot.density..m2)
construct a plot to showcase your analysis
valdez_2023 <- read.csv ("https://docs.google.com/spreadsheets/d/e/2PACX-1vT2gaLu6pyRMlcbzarn3ej4bFmT_iHvrlNWJYSdrsLdUWIjcJi7rU11-ipvYpGnqD9qLDnbhNd2sDUW/pub?gid=1707080634&single=true&output=csv" , stringsAsFactors = T)
shoot_model <- lm ( Shoot.density..m2~ Snail.Level + Nitrogen.level + Snail.Level: Nitrogen.level, valdez_2023[valdez_2023$ Snail.Level != "uncaged" ,])
plot (shoot_model)
Anova (shoot_model, type= "III" )
Anova Table (Type III tests)
Response: Shoot.density..m2
Sum Sq Df F value Pr(>F)
(Intercept) 196608 1 271.0588 1.334e-09 ***
Snail.Level 36238 2 24.9804 5.277e-05 ***
Nitrogen.level 4267 1 5.8824 0.0320 *
Snail.Level:Nitrogen.level 3100 2 2.1373 0.1607
Residuals 8704 12
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
No interaction, so I can remove or not.
shoot_model_reduced <- update (shoot_model, .~ .- Snail.Level: Nitrogen.level)
Anova (shoot_model_reduced)
Anova Table (Type II tests)
Response: Shoot.density..m2
Sum Sq Df F value Pr(>F)
Snail.Level 45596 2 27.039 1.556e-05 ***
Nitrogen.level 11150 1 13.224 0.002696 **
Residuals 11804 14
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both have impacts.
snail_post_hoc <- glht (shoot_model_reduced, linfct = mcp (Snail.Level = "Tukey" ))
summary (snail_post_hoc)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = Shoot.density..m2 ~ Snail.Level + Nitrogen.level,
data = valdez_2023[valdez_2023$Snail.Level != "uncaged",
])
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
removal - control snails == 0 50.67 16.76 3.022 0.02306 *
snail addition - control snails == 0 -72.00 16.76 -4.295 0.00201 **
snail addition - removal == 0 -122.67 16.76 -7.317 < 0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
significant difference among all snails.
shoot_summary <- summarySE (valdez_2023[valdez_2023$ Snail.Level != "uncaged" ,], measurevar = "Shoot.density..m2" , groupvars = c ("Snail.Level" , "Nitrogen.level" ))
shoot_summary
Snail.Level Nitrogen.level N Shoot.density..m2 sd se
1 control snails Fertilized 3 256.0000 27.712813 16.000000
2 control snails without 3 202.6667 9.237604 5.333333
3 removal Fertilized 3 320.0000 16.000000 9.237604
4 removal without 3 240.0000 27.712813 16.000000
5 snail addition Fertilized 3 165.3333 36.950417 21.333333
6 snail addition without 3 149.3333 33.306656 19.229607
ci
1 68.84244
2 22.94748
3 39.74620
4 68.84244
5 91.78992
6 82.73832
shoot_summary$ Snail.Level <- relevel (shoot_summary$ Snail.Level, "removal" )
shoot_summary$ Snail.Level <- relevel (shoot_summary$ Snail.Level, "uncaged" )
ggplot (shoot_summary, aes (x= Snail.Level,
y= Shoot.density..m2,
fill= Nitrogen.level)) +
geom_col (color= "black" , position= position_dodge ()) +
geom_errorbar (aes (ymin= Shoot.density..m2, ymax= Shoot.density..m2+ ci), position = position_dodge ()) +
labs (title= "Grazing impacts depend on nitrogen levels" ,
x= "Grazing level" ,
y= expression (paste ("# of shoots/" ,m^ {- 2 })))
4
Find an example of a factorial ANOVA from a paper that is related to your work. Make sure you understand the connections between the methods, results, and graphs. Briefly answer the following questions
What was the dependent variable?
What were the independent variables?
Was the interaction significant?
If so, how did they interpret findings
If not, were the main effects significant?