Compare proportions among groups

Remember you should

Overview

This practice reviews the Compare means among groups lecture.

Examples

Issue is we often get data in spreadsheet format (expanded/long or wide/summarized, each shown below), but we need to get a vector or matrix for chisq.test and related functions.

The data

Following the Everest example from class. Assume data is in a dataframe where each row is a group data point.

everest <- data.frame(Survived = c("Y","N","Y", "N"),
                      Oxygen = c("Used", "Used", "Not used", "Not used"),
                      Number = c(1045, 32, 88, 8))

Assume data is in a dataframe where each row is an individual data point.

library(mirt)
Loading required package: stats4
Loading required package: lattice
everest_expand <- expand.table(everest)

Tests

First, let’s ask if the same amount of people used or did not use oxygen. WE can use the table command to summarize. Note the chisq.test, by default, assumes each group is equally likely!

table(everest_expand$Oxygen)

Not used     Used 
      96     1077 
chisq.test(table(everest_expand$Oxygen)) 

    Chi-squared test for given probabilities

data:  table(everest_expand$Oxygen)
X-squared = 820.43, df = 1, p-value < 2.2e-16

Dong this with summarized data is actually harder

aggregate(Number~Oxygen, everest, sum)$Number
[1]   96 1077
chisq.test(aggregate(Number~Oxygen, everest, sum)$Number) 

    Chi-squared test for given probabilities

data:  aggregate(Number ~ Oxygen, everest, sum)$Number
X-squared = 820.43, df = 1, p-value < 2.2e-16

But this is better!

binom.test(table(everest_expand$Oxygen))

    Exact binomial test

data:  table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.06679216 0.09902483
sample estimates:
probability of success 
            0.08184143 

What if we wanted to compare to past years where 10% of climbers did not use oxygen? Note table function splits into alphabetical order.

binom.test(table(everest_expand$Oxygen), p=.1)

    Exact binomial test

data:  table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value = 0.04075
alternative hypothesis: true probability of success is not equal to 0.1
95 percent confidence interval:
 0.06679216 0.09902483
sample estimates:
probability of success 
            0.08184143 

What if we want to determine if using oxygen impacts surival?

chisq.test(table(everest_expand$Oxygen, everest_expand$Survived))
Warning in chisq.test(table(everest_expand$Oxygen, everest_expand$Survived)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(everest_expand$Oxygen, everest_expand$Survived)
X-squared = 6.1524, df = 1, p-value = 0.01312

Issue (which we’ll address), but note same as

chisq.test(table(everest_expand$Survived, everest_expand$Oxygen))
Warning in chisq.test(table(everest_expand$Survived, everest_expand$Oxygen)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(everest_expand$Survived, everest_expand$Oxygen)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 32, 88,  8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312

Key is first argument must be all the info. This is different from (incorrect) approach like

chisq.test(everest$Survived,everest$Oxygen)
Warning in chisq.test(everest$Survived, everest$Oxygen): Chi-squared
approximation may be incorrect

    Pearson's Chi-squared test

data:  everest$Survived and everest$Oxygen
X-squared = 0, df = 1, p-value = 1

This is comparing split among Survived and not to split (expected) using Oxygen!

So order has minimal input with 2 groups. Other test options necessitated by the warning

fisher.test(table(everest_expand$Oxygen, everest_expand$Survived))

    Fisher's Exact Test for Count Data

data:  table(everest_expand$Oxygen, everest_expand$Survived)
p-value = 0.01284
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.144791 6.826869
sample estimates:
odds ratio 
  2.964765 
library(DescTools)
GTest(table(everest_expand$Oxygen, everest_expand$Survived))

    Log likelihood ratio (G-test) test of independence without correction

data:  table(everest_expand$Oxygen, everest_expand$Survived)
G = 5.7466, X-squared df = 1, p-value = 0.01652

What if we added another group? Like Enriched, Regular, None for oxygen.

everest_enriched <- data.frame(Survived = c("Y","N","Y", "N", "Y", "N"),
                      Oxygen = c("Regular", "Regular", "None", "None", rep("Enriched", 2)),
                      Number = c(1045, 32, 88, 8, 15, 2))
everest_enriched_expand <- expand.table(everest_enriched)

Now we compare

table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
   
    Enriched None Regular
  N        2    8      32
  Y       15   88    1045
chisq.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Warning in chisq.test(table(everest_enriched_expand$Survived,
everest_enriched_expand$Oxygen)): Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
X-squared = 10.879, df = 2, p-value = 0.004343

Fisher again due to size

fisher.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))

    Fisher's Exact Test for Count Data

data:  table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
p-value = 0.00586
alternative hypothesis: two.sided

Now we follow-up, and rows/columns matter. Note default is row and fdr method. I order results for ease of view

library(rcompanion)
everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]
  Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq p.adj.Chisq
1      N : Y  0.00586      0.00586  0.0189      0.0189 0.00434     0.00434

Not quite what we wanted. How about

everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
                                                          compare = "col")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]
          Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3     None : Regular   0.0128       0.0384  0.0165      0.0495  0.0131
2 Enriched : Regular   0.0953       0.1430  0.1080      0.1620  0.1710
1    Enriched : None   0.6450       0.6450  0.6580      0.6580  1.0000
  p.adj.Chisq
3      0.0393
2      0.2560
1      1.0000

and you can change methods

everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
                                                          compare = "col",
                                                          method = "holm")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]
          Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3     None : Regular   0.0128       0.0384  0.0165      0.0495  0.0131
2 Enriched : Regular   0.0953       0.1910  0.1080      0.2160  0.1710
1    Enriched : None   0.6450       0.6450  0.6580      0.6580  1.0000
  p.adj.Chisq
3      0.0393
2      0.3420
1      1.0000

To put in manually, we need a few extra things

everest_table <- as.table(matrix(c(2,8,32,15,88,1045), nrow = 2, byrow = T))
rownames(everest_table) = c("N", "Y")
colnames(everest_table) = c("Enriched", "None", "Regular")
everest_table
  Enriched None Regular
N        2    8      32
Y       15   88    1045

Swirl lesson

Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.

THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN! FOR THIS REASONS, THE CODE BELOW IS DISPLAYED BUT NOT RUN WHEN THIS FILE IS RENDERED

  • install the “swirl” package

  • run the following code once on the computer to install a new course

    library(swirl)
    install_course_github("jsgosnell", "JSG_swirl_lessons")
  • start swirl!

    swirl()
    • swirl()
  • then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want

    • Here we will focus on the Compare proportions among groups lesson
  • TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by

    • exiting swirl using escape key or bye() function

      bye()
    • uninstalling and reinstalling courses

      uninstall_all_courses()
      install_course_github("jsgosnell", "JSG_swirl_lessons")
    • when you restart swirl with swirl(), you may need to select

      • No. Let me start something new

Let’s practice

Heart attacks

1

Let’s look at some heart attack data. Read in the data using

heart_attacks <- read.table("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/heartatk4R.txt",header=T, stringsAsFactors = T)

Every entry is a person that has suffered a heart attack. More information on the dataset can be found at

http://statland.org/Software_Help/DataDesk/datafile.htm

We want to again test if heart attacks occur equally across genders.

table(heart_attacks$SEX)

   F    M 
5065 7779 
binom.test(7779, 7779+5065)

    Exact binomial test

data:  7779 and 7779 + 5065
number of successes = 7779, number of trials = 12844, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5971385 0.6141184
sample estimates:
probability of success 
             0.6056524 

If I assume males compose 50% of the population, I can test the null hypothesis that 50% of heart attacks occur in males using a binom.test to conduct a binomial test. I used the table command to determine the number of heart attacks in males and females and then used binom.test. The alternative is less than or greater than 50% of hear attacks occur in males. With a p-value of <.001, I reject the null hypothesis. Data suggest that males are more likely to have heart attacks. Note this is better than

chisq.test(table(heart_attacks$SEX), p=c(.50, .50))

    Chi-squared test for given probabilities

data:  table(heart_attacks$SEX)
X-squared = 573.48, df = 1, p-value < 2.2e-16

which is an approximate test.

  • What if we know that males actually make up 50.8% of the population?
table(heart_attacks$SEX)

   F    M 
5065 7779 
binom.test(7779, 7779+5065, .508)

    Exact binomial test

data:  7779 and 7779 + 5065
number of successes = 7779, number of trials = 12844, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.508
95 percent confidence interval:
 0.5971385 0.6141184
sample estimates:
probability of success 
             0.6056524 

Note I can amend the test proportion as noted here. Results do not change.

2

Still using the heart attack data, is survival independent of gender?

#note what this does
table(heart_attacks$SEX, heart_attacks$DIED)
   
       0    1
  F 4298  767
  M 7136  643
#then feed it to chisq.test (notice order here does not matter for 2x2 table)
chisq.test(table(heart_attacks$SEX, heart_attacks$DIED))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(heart_attacks$SEX, heart_attacks$DIED)
X-squared = 147.76, df = 1, p-value < 2.2e-16
chisq.test(table(heart_attacks$DIED, heart_attacks$SEX))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(heart_attacks$DIED, heart_attacks$SEX)
X-squared = 147.76, df = 1, p-value < 2.2e-16

I used a chi2 to consider if survival was independent of sex. Our null hypothesis is that survival does not differ based on sex. the alternative is that it does. I found a chi21=147.76, which corresponds to a p-value of <.001, so i reject the null hypothesis.

3

For people that have a heart attack before they turn 30, is survival independent of gender?

chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], 
                 heart_attacks[heart_attacks$AGE <30, "DIED"]))
Warning in chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], :
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(heart_attacks[heart_attacks$AGE < 30, "SEX"], heart_attacks[heart_attacks$AGE <     30, "DIED"])
X-squared = 3.2597e-30, df = 1, p-value = 1
#note warning on approximation, so check it
chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], 
                 heart_attacks[heart_attacks$AGE <30, "DIED"]))$expected
Warning in chisq.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], :
Chi-squared approximation may be incorrect
   
       0   1
  F  7.8 0.2
  M 31.2 0.8
#several <1, so use fisher.test
fisher.test(table(heart_attacks[heart_attacks$AGE < 30, "SEX"], 
                  heart_attacks[heart_attacks$AGE <30, "DIED"]))

    Fisher's Exact Test for Count Data

data:  table(heart_attacks[heart_attacks$AGE < 30, "SEX"], heart_attacks[heart_attacks$AGE < 30, "DIED"])
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.006425781         Inf
sample estimates:
odds ratio 
       Inf 
#so if you are young, no difference

I amended the previous question’s code to only focus on individuals who were under the age of 30 when they suffered a heart. Otherwise the hypotheses remain the same. I attempted to use a chi2 test but was warned the approximation may be incorrect. Remember that no cells can have expected values of <1 and <20% should have expected values <5. Upon checking 2 cells have expected values <1, so I instead used Fisher’s Test.
I found a p-value of 1, thus I fail to reject the null hypothesis.

Dolphins

4

Data on dolphin behavior was collected off the coast of Iceland. Data is @

http://www.statsci.org/data/general/dolpacti.txt

Since this is a .txt file, not a .csv, you’ll need to use something like

dolphin <- read.table("http://www.statsci.org/data/general/dolpacti.txt", sep="", header = T, stringsAsFactors = T)

More info on data @

http://www.statsci.org/data/general/dolpacti.html

Is traveling independent of time of day? You’ll need to consider traveling vs not traveling due to different number of groups observed in each period. Carry out post-hoc tests if needed.

I looked at the data and then just made a table manually

dolphin
   Activity    Period Groups
1    Travel   Morning      6
2      Feed   Morning     28
3    Social   Morning     38
4    Travel      Noon      6
5      Feed      Noon      4
6    Social      Noon      5
7    Travel Afternoon     14
8      Feed Afternoon      0
9    Social Afternoon      9
10   Travel   Evening     13
11     Feed   Evening     56
12   Social   Evening     10
travel_table <- as.table(matrix(c(6, 28+ 38, 6, 9, 14, 9, 13, 66), nrow = 4, byrow = T))
#Adding in row and column names will make everything easier to read at end.
colnames(travel_table) = c("travel", "not_travel")
rownames(travel_table) = c("morning", "noon", "afternoon", "night")
#now look at it
travel_table
          travel not_travel
morning        6         66
noon           6          9
afternoon     14          9
night         13         66
chisq.test(travel_table)
Warning in chisq.test(travel_table): Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  travel_table
X-squared = 33.665, df = 3, p-value = 2.331e-07
#check outcome given warning
chisq.test(travel_table)$expected
Warning in chisq.test(travel_table): Chi-squared approximation may be incorrect
             travel not_travel
morning   14.857143   57.14286
noon       3.095238   11.90476
afternoon  4.746032   18.25397
night     16.301587   62.69841
fisher.test(travel_table)

    Fisher's Exact Test for Count Data

data:  travel_table
p-value = 9.192e-07
alternative hypothesis: two.sided
library(rcompanion)
pairwiseNominalIndependence(travel_table, compare = "row", method = "holm")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
           Comparison p.Fisher p.adj.Fisher  p.Gtest p.adj.Gtest  p.Chisq
1      morning : noon 4.96e-03     1.98e-02 3.94e-03    1.58e-02 4.74e-03
2 morning : afternoon 7.88e-07     4.73e-06 4.01e-07    2.41e-06 3.65e-07
3     morning : night 1.49e-01     2.98e-01 1.28e-01    2.56e-01 2.09e-01
4    noon : afternoon 3.20e-01     3.20e-01 2.07e-01    2.56e-01 3.54e-01
5        noon : night 7.22e-02     2.17e-01 5.16e-02    1.55e-01 8.35e-02
6   afternoon : night 6.83e-05     3.42e-04 4.98e-05    2.49e-04 6.88e-05
  p.adj.Chisq
1    1.90e-02
2    2.19e-06
3    4.18e-01
4    4.18e-01
5    2.50e-01
6    3.44e-04

I tested the null hypothesis that traveling is independent of time of day (compared to the alternative hypothesis that it is not, and thus differs aross time periods) using chi2 test. However, a warning and subsequent check indicated too many cells had low expected values, so I instead used a Fisher’s test. A resulting p-value of <.001 led me reject the null hypothesis. Since I was comparing more than two groups, I used a post-hoc test to see which periods were different and found that travel differed between morning and noon, morning and afternoon, and afternoon and night (using the p.adj.Fisher column).

Smoking

5

Use data on smoking and exercise from

http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence

to determine if smoking is independent of exercise. You’ll need to input data manually. Carry out post-hoc tests if needed.

I created a table from the data and tested it using a chi2 test to determine if smoking was independent of exercise (null hypothesis) or differed based on exercise levels (alternative). However, a warning led me to see the expected cell values were too small, so I instead used a Fisher’s test.

smoke <- chisq.test(matrix(c(7, 1, 3, #spacing just for visual use
                             87,18,84,
                             12,3,4,
                             9,1,7), nrow = 4, byrow = T))
Warning in chisq.test(matrix(c(7, 1, 3, 87, 18, 84, 12, 3, 4, 9, 1, 7), :
Chi-squared approximation may be incorrect
smoke$expected #too small!
          [,1]      [,2]      [,3]
[1,]  5.360169  1.072034  4.567797
[2,] 92.097458 18.419492 78.483051
[3,]  9.258475  1.851695  7.889831
[4,]  8.283898  1.656780  7.059322
fisher.test(matrix(c(7, 1, 3, #spacing just for visuals
                     87,18,84,
                     12,3,4,
                     9,1,7), nrow = 4, byrow = T))

    Fisher's Exact Test for Count Data

data:  matrix(c(7, 1, 3, 87, 18, 84, 12, 3, 4, 9, 1, 7), nrow = 4, byrow = T)
p-value = 0.4138
alternative hypothesis: two.sided

A p-value of .4138 meant I failed to reject the null hypothesis.