<- data.frame(Survived = c("Y","N","Y", "N"),
everest Oxygen = c("Used", "Used", "Not used", "Not used"),
Number = c(1045, 32, 88, 8))
Compare proportions among groups
Remember you should
- add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
- render your file to produce a markdown version that you can see!
- save your work often
- commit it via git!
- push updates to github
Overview
This practice reviews the Compare means among groups lecture.
Examples
Issue is we often get data in spreadsheet format (expanded/long or wide/summarized, each shown below), but we need to get a vector or matrix for chisq.test and related functions.
The data
Following the Everest example from class. Assume data is in a dataframe where each row is a group data point.
Assume data is in a dataframe where each row is an individual data point.
library(mirt)
Loading required package: stats4
Loading required package: lattice
<- expand.table(everest) everest_expand
Tests
First, let’s ask if the same amount of people used or did not use oxygen. WE can use the table command to summarize. Note the chisq.test, by default, assumes each group is equally likely!
table(everest_expand$Oxygen)
Not used Used
96 1077
chisq.test(table(everest_expand$Oxygen))
Chi-squared test for given probabilities
data: table(everest_expand$Oxygen)
X-squared = 820.43, df = 1, p-value < 2.2e-16
Dong this with summarized data is actually harder
aggregate(Number~Oxygen, everest, sum)$Number
[1] 96 1077
chisq.test(aggregate(Number~Oxygen, everest, sum)$Number)
Chi-squared test for given probabilities
data: aggregate(Number ~ Oxygen, everest, sum)$Number
X-squared = 820.43, df = 1, p-value < 2.2e-16
But this is better!
binom.test(table(everest_expand$Oxygen))
Exact binomial test
data: table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.06679216 0.09902483
sample estimates:
probability of success
0.08184143
What if we wanted to compare to past years where 10% of climbers did not use oxygen? Note table function splits into alphabetical order.
binom.test(table(everest_expand$Oxygen), p=.1)
Exact binomial test
data: table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value = 0.04075
alternative hypothesis: true probability of success is not equal to 0.1
95 percent confidence interval:
0.06679216 0.09902483
sample estimates:
probability of success
0.08184143
What if we want to determine if using oxygen impacts surival?
chisq.test(table(everest_expand$Oxygen, everest_expand$Survived))
Warning in chisq.test(table(everest_expand$Oxygen, everest_expand$Survived)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: table(everest_expand$Oxygen, everest_expand$Survived)
X-squared = 6.1524, df = 1, p-value = 0.01312
Issue (which we’ll address), but note same as
chisq.test(table(everest_expand$Survived, everest_expand$Oxygen))
Warning in chisq.test(table(everest_expand$Survived, everest_expand$Oxygen)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: table(everest_expand$Survived, everest_expand$Oxygen)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312
Key is first argument must be all the info. This is different from (incorrect) approach like
chisq.test(everest$Survived,everest$Oxygen)
Warning in chisq.test(everest$Survived, everest$Oxygen): Chi-squared
approximation may be incorrect
Pearson's Chi-squared test
data: everest$Survived and everest$Oxygen
X-squared = 0, df = 1, p-value = 1
This is comparing split among Survived and not to split (expected) using Oxygen!
So order has minimal input with 2 groups. Other test options necessitated by the warning
fisher.test(table(everest_expand$Oxygen, everest_expand$Survived))
Fisher's Exact Test for Count Data
data: table(everest_expand$Oxygen, everest_expand$Survived)
p-value = 0.01284
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.144791 6.826869
sample estimates:
odds ratio
2.964765
library(DescTools)
GTest(table(everest_expand$Oxygen, everest_expand$Survived))
Log likelihood ratio (G-test) test of independence without correction
data: table(everest_expand$Oxygen, everest_expand$Survived)
G = 5.7466, X-squared df = 1, p-value = 0.01652
What if we added another group? Like Enriched, Regular, None for oxygen.
<- data.frame(Survived = c("Y","N","Y", "N", "Y", "N"),
everest_enriched Oxygen = c("Regular", "Regular", "None", "None", rep("Enriched", 2)),
Number = c(1045, 32, 88, 8, 15, 2))
<- expand.table(everest_enriched) everest_enriched_expand
Now we compare
table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
Enriched None Regular
N 2 8 32
Y 15 88 1045
chisq.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Warning in chisq.test(table(everest_enriched_expand$Survived,
everest_enriched_expand$Oxygen)): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
X-squared = 10.879, df = 2, p-value = 0.004343
Fisher again due to size
fisher.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Fisher's Exact Test for Count Data
data: table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
p-value = 0.00586
alternative hypothesis: two.sided
Now we follow-up, and rows/columns matter. Note default is row and fdr method. I order results for ease of view
library(rcompanion)
<- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)) everest_expand_correct_fdr
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
order(everest_expand_correct_fdr$p.adj.Fisher),] everest_expand_correct_fdr[
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq p.adj.Chisq
1 N : Y 0.00586 0.00586 0.0189 0.0189 0.00434 0.00434
Not quite what we wanted. How about
<- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
everest_expand_correct_fdr compare = "col")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
order(everest_expand_correct_fdr$p.adj.Fisher),] everest_expand_correct_fdr[
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3 None : Regular 0.0128 0.0384 0.0165 0.0495 0.0131
2 Enriched : Regular 0.0953 0.1430 0.1080 0.1620 0.1710
1 Enriched : None 0.6450 0.6450 0.6580 0.6580 1.0000
p.adj.Chisq
3 0.0393
2 0.2560
1 1.0000
and you can change methods
<- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
everest_expand_correct_fdr compare = "col",
method = "holm")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
order(everest_expand_correct_fdr$p.adj.Fisher),] everest_expand_correct_fdr[
Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3 None : Regular 0.0128 0.0384 0.0165 0.0495 0.0131
2 Enriched : Regular 0.0953 0.1910 0.1080 0.2160 0.1710
1 Enriched : None 0.6450 0.6450 0.6580 0.6580 1.0000
p.adj.Chisq
3 0.0393
2 0.3420
1 1.0000
To put in manually, we need a few extra things
<- as.table(matrix(c(2,8,32,15,88,1045), nrow = 2, byrow = T))
everest_table rownames(everest_table) = c("N", "Y")
colnames(everest_table) = c("Enriched", "None", "Regular")
everest_table
Enriched None Regular
N 2 8 32
Y 15 88 1045
Swirl lesson
Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.
THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN! FOR THIS REASONS, THE CODE BELOW IS DISPLAYED BUT NOT RUN WHEN THIS FILE IS RENDERED
install the “swirl” package
run the following code once on the computer to install a new course
library(swirl) install_course_github("jsgosnell", "JSG_swirl_lessons")
start swirl!
swirl()
- swirl()
then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want
- Here we will focus on the Compare proportions among groups lesson
TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by
exiting swirl using escape key or bye() function
bye()
uninstalling and reinstalling courses
uninstall_all_courses() install_course_github("jsgosnell", "JSG_swirl_lessons")
when you restart swirl with swirl(), you may need to select
- No. Let me start something new
Let’s practice
Heart attacks
1
Let’s look at some heart attack data. Read in the data using
<- read.table("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/heartatk4R.txt",header=T, stringsAsFactors = T) heart_attacks
Every entry is a person that has suffered a heart attack. More information on the dataset can be found at
http://statland.org/Software_Help/DataDesk/datafile.htm
We want to again test if heart attacks occur equally across genders.
- What if we know that males actually make up 50.8% of the population?
2
Still using the heart attack data, is survival independent of gender?
3
For people that have a heart attack before they turn 30, is survival independent of gender?
Dolphins
4
Data on dolphin behavior was collected off the coast of Iceland. Data is @
http://www.statsci.org/data/general/dolpacti.txt
Since this is a .txt file, not a .csv, you’ll need to use something like
<- read.table("http://www.statsci.org/data/general/dolpacti.txt", sep="", header = T, stringsAsFactors = T) dolphin
More info on data @
http://www.statsci.org/data/general/dolpacti.html
Is travelling independent of time of day? You’ll need to consider traveling vs not traveling due to different number of groups observed in each period. Carry out post-hoc tests if needed.
Smoking
5
Use data on smoking and exercise from
http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence
to determine if smoking is independent of exercise. You’ll need to input data manually. Carry out post-hoc tests if needed.