<- data.frame(Survived = c("Y","N","Y", "N"),
everest Oxygen = c("Used", "Used", "Not used", "Not used"),
Number = c(1045, 32, 88, 8))
Compare proportions among groups
Remember you should
- add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
- render your file to produce a markdown version that you can see!
- save your work often
- commit it via git!
- push updates to github
Overview
This practice reviews the Compare means among groups lecture.
Examples
Issue is we often get data in spreadsheet format (expanded/long or wide/summarized, each shown below), but we need to get a vector or matrix for chisq.test and related functions.
The data
Following the Everest example from class. Assume data is in a dataframe where each row is a group data point.
Assume data is in a dataframe where each row is an individual data point.
library(mirt)
Loading required package: stats4
Loading required package: lattice
<- expand.table(everest) everest_expand
Tests
First, let’s ask if the same amount of people used or did not use oxygen. WE can use the table command to summarize. Note the chisq.test, by default, assumes each group is equally likely!
table(everest_expand$Oxygen)
Not used Used
96 1077
chisq.test(table(everest_expand$Oxygen))
Chi-squared test for given probabilities
data: table(everest_expand$Oxygen)
X-squared = 820.43, df = 1, p-value < 2.2e-16
Dong this with summarized data is actually harder
aggregate(Number~Oxygen, everest, sum)$Number
[1] 96 1077
chisq.test(aggregate(Number~Oxygen, everest, sum)$Number)
Chi-squared test for given probabilities
data: aggregate(Number ~ Oxygen, everest, sum)$Number
X-squared = 820.43, df = 1, p-value < 2.2e-16
But this is better!
binom.test(table(everest_expand$Oxygen))
Exact binomial test
data: table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.06679216 0.09902483
sample estimates:
probability of success
0.08184143
What if we wanted to compare to past years where 10% of climbers did not use oxygen? Note table function splits into alphabetical order.
binom.test(table(everest_expand$Oxygen), p=.1)
Exact binomial test
data: table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value = 0.04075
alternative hypothesis: true probability of success is not equal to 0.1
95 percent confidence interval:
0.06679216 0.09902483
sample estimates:
probability of success
0.08184143
What if we want to determine if using oxygen impacts surival?
chisq.test(table(everest_expand$Oxygen, everest_expand$Survived))
Warning in chisq.test(table(everest_expand$Oxygen, everest_expand$Survived)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: table(everest_expand$Oxygen, everest_expand$Survived)
X-squared = 6.1524, df = 1, p-value = 0.01312
Issue (which we’ll address), but note same as
chisq.test(table(everest_expand$Survived, everest_expand$Oxygen))
Warning in chisq.test(table(everest_expand$Survived, everest_expand$Oxygen)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: table(everest_expand$Survived, everest_expand$Oxygen)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312
Key is first argument must be all the info. This is different from (incorrect) approach like
chisq.test(everest$Survived,everest$Oxygen)
Warning in chisq.test(everest$Survived, everest$Oxygen): Chi-squared
approximation may be incorrect
Pearson's Chi-squared test
data: everest$Survived and everest$Oxygen
X-squared = 0, df = 1, p-value = 1
This is comparing split among Survived and not to split (expected) using Oxygen!
So order has minimal input with 2 groups. Other test options necessitated by the warning
fisher.test(table(everest_expand$Oxygen, everest_expand$Survived))
Fisher's Exact Test for Count Data
data: table(everest_expand$Oxygen, everest_expand$Survived)
p-value = 0.01284
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.144791 6.826869
sample estimates:
odds ratio
2.964765
library(DescTools)
GTest(table(everest_expand$Oxygen, everest_expand$Survived))
Log likelihood ratio (G-test) test of independence without correction
data: table(everest_expand$Oxygen, everest_expand$Survived)
G = 5.7466, X-squared df = 1, p-value = 0.01652
What if we added another group? Like Enriched, Regular, None for oxygen.
<- data.frame(Survived = c("Y","N","Y", "N", "Y", "N"),
everest_enriched Oxygen = c("Regular", "Regular", "None", "None", rep("Enriched", 2)),
Number = c(1045, 32, 88, 8, 15, 2))
<- expand.table(everest_enriched) everest_enriched_expand
Now we compare
table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
Enriched None Regular
N 2 8 32
Y 15 88 1045
chisq.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Warning in chisq.test(table(everest_enriched_expand$Survived,
everest_enriched_expand$Oxygen)): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
X-squared = 10.879, df = 2, p-value = 0.004343
Fisher again due to size
fisher.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Fisher's Exact Test for Count Data
data: table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
p-value = 0.00586
alternative hypothesis: two.sided
Now we follow-up, and rows/columns matter. Note default is row and fdr method. I order results for ease of view
library(rcompanion)
<- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)) everest_expand_correct_fdr
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
order(everest_expand_correct_fdr$p.adj.Fisher),] everest_expand_correct_fdr[
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq p.adj.Chisq
1 N : Y 0.00586 0.00586 0.0189 0.0189 0.00434 0.00434
Not quite what we wanted. How about
<- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
everest_expand_correct_fdr compare = "col")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
order(everest_expand_correct_fdr$p.adj.Fisher),] everest_expand_correct_fdr[
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3 None : Regular 0.0128 0.0384 0.0165 0.0495 0.0131
2 Enriched : Regular 0.0953 0.1430 0.1080 0.1620 0.1710
1 Enriched : None 0.6450 0.6450 0.6580 0.6580 1.0000
p.adj.Chisq
3 0.0393
2 0.2560
1 1.0000
and you can change methods
<- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
everest_expand_correct_fdr compare = "col",
method = "holm")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
order(everest_expand_correct_fdr$p.adj.Fisher),] everest_expand_correct_fdr[
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3 None : Regular 0.0128 0.0384 0.0165 0.0495 0.0131
2 Enriched : Regular 0.0953 0.1910 0.1080 0.2160 0.1710
1 Enriched : None 0.6450 0.6450 0.6580 0.6580 1.0000
p.adj.Chisq
3 0.0393
2 0.3420
1 1.0000
To put in manually, we need a few extra things
<- as.table(matrix(c(2,8,32,15,88,1045), nrow = 2, byrow = T))
everest_table rownames(everest_table) = c("N", "Y")
colnames(everest_table) = c("Enriched", "None", "Regular")
everest_table
Enriched None Regular
N 2 8 32
Y 15 88 1045
Swirl lesson
Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.
THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN! FOR THIS REASONS, THE CODE BELOW IS DISPLAYED BUT NOT RUN WHEN THIS FILE IS RENDERED
install the “swirl” package
run the following code once on the computer to install a new course
library(swirl) install_course_github("jsgosnell", "JSG_swirl_lessons")
start swirl!
swirl()
- swirl()
then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want
- Here we will focus on the Compare proportions among groups lesson
TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by
exiting swirl using escape key or bye() function
bye()
uninstalling and reinstalling courses
uninstall_all_courses() install_course_github("jsgosnell", "JSG_swirl_lessons")
when you restart swirl with swirl(), you may need to select
- No. Let me start something new
Let’s practice
Heart attacks
1
Let’s look at some heart attack data. Read in the data using
<- read.table("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/heartatk4R.txt",header=T, stringsAsFactors = T) heart_attacks
Every entry is a person that has suffered a heart attack. More information on the dataset can be found at
http://statland.org/Software_Help/DataDesk/datafile.htm
We want to again test if heart attacks occur equally across genders.
table(heart_attacks$SEX)
F M
5065 7779
binom.test(7779, 7779+5065)
Exact binomial test
data: 7779 and 7779 + 5065
number of successes = 7779, number of trials = 12844, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5971385 0.6141184
sample estimates:
probability of success
0.6056524
If I assume males compose 50% of the population, I can test the null hypothesis that 50% of heart attacks occur in males using a binom.test to conduct a binomial test. I used the table command to determine the number of heart attacks in males and females and then used binom.test. The alternative is less than or greater than 50% of hear attacks occur in males. With a p-value of <.001, I reject the null hypothesis. Data suggest that males are more likely to have heart attacks. Note this is better than
chisq.test(table(heart_attacks$SEX), p=c(.50, .50))
Chi-squared test for given probabilities
data: table(heart_attacks$SEX)
X-squared = 573.48, df = 1, p-value < 2.2e-16
which is an approximate test.
- What if we know that males actually make up 50.8% of the population?
table(heart_attacks$SEX)
F M
5065 7779
binom.test(7779, 7779+5065, .508)
Exact binomial test
data: 7779 and 7779 + 5065
number of successes = 7779, number of trials = 12844, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.508
95 percent confidence interval:
0.5971385 0.6141184
sample estimates:
probability of success
0.6056524
Note I can amend the test proportion as noted here. Results do not change.
2
Still using the heart attack data, is survival independent of gender?
#note what this does
table(heart_attacks$SEX, heart_attacks$DIED)
0 1
F 4298 767
M 7136 643
#then feed it to chisq.test (notice order here does not matter for 2x2 table)
chisq.test(table(heart_attacks$SEX, heart_attacks$DIED))
Pearson's Chi-squared test with Yates' continuity correction
data: table(heart_attacks$SEX, heart_attacks$DIED)
X-squared = 147.76, df = 1, p-value < 2.2e-16
chisq.test(table(heart_attacks$DIED, heart_attacks$SEX))
Pearson's Chi-squared test with Yates' continuity correction
data: table(heart_attacks$DIED, heart_attacks$SEX)
X-squared = 147.76, df = 1, p-value < 2.2e-16
I used a chi2 to consider if survival was independent of sex. Our null hypothesis is that survival does not differ based on sex. the alternative is that it does. I found a chi21=147.76, which corresponds to a p-value of <.001, so i reject the null hypothesis.
3
For people that have a heart attack before they turn 30, is survival independent of gender?
chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"],
$AGE <30, "DIED"])) heart_attacks[heart_attacks
Warning in chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], :
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: table(heart_attacks[heart_attacks$AGE < 30, "SEX"], heart_attacks[heart_attacks$AGE < 30, "DIED"])
X-squared = 3.2597e-30, df = 1, p-value = 1
#note warning on approximation, so check it
chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"],
$AGE <30, "DIED"]))$expected heart_attacks[heart_attacks
Warning in chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], :
Chi-squared approximation may be incorrect
0 1
F 7.8 0.2
M 31.2 0.8
#several <1, so use fisher.test
fisher.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"],
$AGE <30, "DIED"])) heart_attacks[heart_attacks
Fisher's Exact Test for Count Data
data: table(heart_attacks[heart_attacks$AGE < 30, "SEX"], heart_attacks[heart_attacks$AGE < 30, "DIED"])
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.006425781 Inf
sample estimates:
odds ratio
Inf
#so if you are young, no difference
I amended the previous question’s code to only focus on individuals who were under the age of 30 when they suffered a heart. Otherwise the hypotheses remain the same. I attempted to use a chi2 test but was warned the approximation may be incorrect. Remember that no cells can have expected values of <1 and <20% should have expected values <5. Upon checking 2 cells have expected values <1, so I instead used Fisher’s Test.
I found a p-value of 1, thus I fail to reject the null hypothesis.
Dolphins
4
Data on dolphin behavior was collected off the coast of Iceland. Data is @
http://www.statsci.org/data/general/dolpacti.txt
Since this is a .txt file, not a .csv, you’ll need to use something like
<- read.table("http://www.statsci.org/data/general/dolpacti.txt", sep="", header = T, stringsAsFactors = T) dolphin
More info on data @
http://www.statsci.org/data/general/dolpacti.html
Is traveling independent of time of day? You’ll need to consider traveling vs not traveling due to different number of groups observed in each period. Carry out post-hoc tests if needed.
I looked at the data and then just made a table manually
dolphin
Activity Period Groups
1 Travel Morning 6
2 Feed Morning 28
3 Social Morning 38
4 Travel Noon 6
5 Feed Noon 4
6 Social Noon 5
7 Travel Afternoon 14
8 Feed Afternoon 0
9 Social Afternoon 9
10 Travel Evening 13
11 Feed Evening 56
12 Social Evening 10
<- as.table(matrix(c(6, 28+ 38, 6, 9, 14, 9, 13, 66), nrow = 4, byrow = T))
travel_table #Adding in row and column names will make everything easier to read at end.
colnames(travel_table) = c("travel", "not_travel")
rownames(travel_table) = c("morning", "noon", "afternoon", "night")
#now look at it
travel_table
travel not_travel
morning 6 66
noon 6 9
afternoon 14 9
night 13 66
chisq.test(travel_table)
Warning in chisq.test(travel_table): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: travel_table
X-squared = 33.665, df = 3, p-value = 2.331e-07
#check outcome given warning
chisq.test(travel_table)$expected
Warning in chisq.test(travel_table): Chi-squared approximation may be incorrect
travel not_travel
morning 14.857143 57.14286
noon 3.095238 11.90476
afternoon 4.746032 18.25397
night 16.301587 62.69841
fisher.test(travel_table)
Fisher's Exact Test for Count Data
data: travel_table
p-value = 9.192e-07
alternative hypothesis: two.sided
library(rcompanion)
pairwiseNominalIndependence(travel_table, compare = "row", method = "holm")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
1 morning : noon 4.96e-03 1.98e-02 3.94e-03 1.58e-02 4.74e-03
2 morning : afternoon 7.88e-07 4.73e-06 4.01e-07 2.41e-06 3.65e-07
3 morning : night 1.49e-01 2.98e-01 1.28e-01 2.56e-01 2.09e-01
4 noon : afternoon 3.20e-01 3.20e-01 2.07e-01 2.56e-01 3.54e-01
5 noon : night 7.22e-02 2.17e-01 5.16e-02 1.55e-01 8.35e-02
6 afternoon : night 6.83e-05 3.42e-04 4.98e-05 2.49e-04 6.88e-05
p.adj.Chisq
1 1.90e-02
2 2.19e-06
3 4.18e-01
4 4.18e-01
5 2.50e-01
6 3.44e-04
I tested the null hypothesis that traveling is independent of time of day (compared to the alternative hypothesis that it is not, and thus differs aross time periods) using chi2 test. However, a warning and subsequent check indicated too many cells had low expected values, so I instead used a Fisher’s test. A resulting p-value of <.001 led me reject the null hypothesis. Since I was comparing more than two groups, I used a post-hoc test to see which periods were different and found that travel differed between morning and noon, morning and afternoon, and afternoon and night (using the p.adj.Fisher column).
Smoking
5
Use data on smoking and exercise from
http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence
to determine if smoking is independent of exercise. You’ll need to input data manually. Carry out post-hoc tests if needed.
I created a table from the data and tested it using a chi2 test to determine if smoking was independent of exercise (null hypothesis) or differed based on exercise levels (alternative). However, a warning led me to see the expected cell values were too small, so I instead used a Fisher’s test.
<- chisq.test(matrix(c(7, 1, 3, #spacing just for visual use
smoke 87,18,84,
12,3,4,
9,1,7), nrow = 4, byrow = T))
Warning in chisq.test(matrix(c(7, 1, 3, 87, 18, 84, 12, 3, 4, 9, 1, 7), :
Chi-squared approximation may be incorrect
$expected #too small! smoke
[,1] [,2] [,3]
[1,] 5.360169 1.072034 4.567797
[2,] 92.097458 18.419492 78.483051
[3,] 9.258475 1.851695 7.889831
[4,] 8.283898 1.656780 7.059322
fisher.test(matrix(c(7, 1, 3, #spacing just for visuals
87,18,84,
12,3,4,
9,1,7), nrow = 4, byrow = T))
Fisher's Exact Test for Count Data
data: matrix(c(7, 1, 3, 87, 18, 84, 12, 3, 4, 9, 1, 7), nrow = 4, byrow = T)
p-value = 0.4138
alternative hypothesis: two.sided
A p-value of .4138 meant I failed to reject the null hypothesis.