Compare proportions among groups

Remember you should

Overview

This practice reviews the Compare means among groups lecture.

Examples

Issue is we often get data in spreadsheet format (expanded/long or wide/summarized, each shown below), but we need to get a vector or matrix for chisq.test and related functions.

The data

Following the Everest example from class. Assume data is in a dataframe where each row is a group data point.

everest <- data.frame(Survived = c("Y","N","Y", "N"),
                      Oxygen = c("Used", "Used", "Not used", "Not used"),
                      Number = c(1045, 32, 88, 8))

Assume data is in a dataframe where each row is an individual data point.

library(mirt)
Loading required package: stats4
Loading required package: lattice
everest_expand <- expand.table(everest)

Tests

First, let’s ask if the same amount of people used or did not use oxygen. WE can use the table command to summarize. Note the chisq.test, by default, assumes each group is equally likely!

table(everest_expand$Oxygen)

Not used     Used 
      96     1077 
chisq.test(table(everest_expand$Oxygen)) 

    Chi-squared test for given probabilities

data:  table(everest_expand$Oxygen)
X-squared = 820.43, df = 1, p-value < 2.2e-16

Dong this with summarized data is actually harder

aggregate(Number~Oxygen, everest, sum)$Number
[1]   96 1077
chisq.test(aggregate(Number~Oxygen, everest, sum)$Number) 

    Chi-squared test for given probabilities

data:  aggregate(Number ~ Oxygen, everest, sum)$Number
X-squared = 820.43, df = 1, p-value < 2.2e-16

But this is better!

binom.test(table(everest_expand$Oxygen))

    Exact binomial test

data:  table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.06679216 0.09902483
sample estimates:
probability of success 
            0.08184143 

What if we wanted to compare to past years where 10% of climbers did not use oxygen? Note table function splits into alphabetical order.

binom.test(table(everest_expand$Oxygen), p=.1)

    Exact binomial test

data:  table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value = 0.04075
alternative hypothesis: true probability of success is not equal to 0.1
95 percent confidence interval:
 0.06679216 0.09902483
sample estimates:
probability of success 
            0.08184143 

What if we want to determine if using oxygen impacts surival?

chisq.test(table(everest_expand$Oxygen, everest_expand$Survived))
Warning in chisq.test(table(everest_expand$Oxygen, everest_expand$Survived)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(everest_expand$Oxygen, everest_expand$Survived)
X-squared = 6.1524, df = 1, p-value = 0.01312

Issue (which we’ll address), but note same as

chisq.test(table(everest_expand$Survived, everest_expand$Oxygen))
Warning in chisq.test(table(everest_expand$Survived, everest_expand$Oxygen)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(everest_expand$Survived, everest_expand$Oxygen)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312
chisq.test(x = matrix(c(1045, 32, 88,  8), 2, 2, byrow = T))
Warning in chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312

Key is first argument must be all the info. This is different from (incorrect) approach like

chisq.test(everest$Survived,everest$Oxygen)
Warning in chisq.test(everest$Survived, everest$Oxygen): Chi-squared
approximation may be incorrect

    Pearson's Chi-squared test

data:  everest$Survived and everest$Oxygen
X-squared = 0, df = 1, p-value = 1

This is comparing split among Survived and not to split (expected) using Oxygen!

So order has minimal input with 2 groups. Other test options necessitated by the warning

fisher.test(table(everest_expand$Oxygen, everest_expand$Survived))

    Fisher's Exact Test for Count Data

data:  table(everest_expand$Oxygen, everest_expand$Survived)
p-value = 0.01284
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.144791 6.826869
sample estimates:
odds ratio 
  2.964765 
library(DescTools)
GTest(table(everest_expand$Oxygen, everest_expand$Survived))

    Log likelihood ratio (G-test) test of independence without correction

data:  table(everest_expand$Oxygen, everest_expand$Survived)
G = 5.7466, X-squared df = 1, p-value = 0.01652

What if we added another group? Like Enriched, Regular, None for oxygen.

everest_enriched <- data.frame(Survived = c("Y","N","Y", "N", "Y", "N"),
                      Oxygen = c("Regular", "Regular", "None", "None", rep("Enriched", 2)),
                      Number = c(1045, 32, 88, 8, 15, 2))
everest_enriched_expand <- expand.table(everest_enriched)

Now we compare

table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
   
    Enriched None Regular
  N        2    8      32
  Y       15   88    1045
chisq.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Warning in chisq.test(table(everest_enriched_expand$Survived,
everest_enriched_expand$Oxygen)): Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
X-squared = 10.879, df = 2, p-value = 0.004343

Fisher again due to size

fisher.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))

    Fisher's Exact Test for Count Data

data:  table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
p-value = 0.00586
alternative hypothesis: two.sided

Now we follow-up, and rows/columns matter. Note default is row and fdr method. I order results for ease of view

library(rcompanion)
everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]
  Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq p.adj.Chisq
1      N : Y  0.00586      0.00586  0.0189      0.0189 0.00434     0.00434

Not quite what we wanted. How about

everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
                                                          compare = "col")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]
          Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3     None : Regular   0.0128       0.0384  0.0165      0.0495  0.0131
2 Enriched : Regular   0.0953       0.1430  0.1080      0.1620  0.1710
1    Enriched : None   0.6450       0.6450  0.6580      0.6580  1.0000
  p.adj.Chisq
3      0.0393
2      0.2560
1      1.0000

and you can change methods

everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
                                                          compare = "col",
                                                          method = "holm")
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]
          Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3     None : Regular   0.0128       0.0384  0.0165      0.0495  0.0131
2 Enriched : Regular   0.0953       0.1910  0.1080      0.2160  0.1710
1    Enriched : None   0.6450       0.6450  0.6580      0.6580  1.0000
  p.adj.Chisq
3      0.0393
2      0.3420
1      1.0000

To put in manually, we need a few extra things

everest_table <- as.table(matrix(c(2,8,32,15,88,1045), nrow = 2, byrow = T))
rownames(everest_table) = c("N", "Y")
colnames(everest_table) = c("Enriched", "None", "Regular")
everest_table
  Enriched None Regular
N        2    8      32
Y       15   88    1045

Swirl lesson

Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.

THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN! FOR THIS REASONS, THE CODE BELOW IS DISPLAYED BUT NOT RUN WHEN THIS FILE IS RENDERED

  • install the “swirl” package

  • run the following code once on the computer to install a new course

    library(swirl)
    install_course_github("jsgosnell", "JSG_swirl_lessons")
  • start swirl!

    swirl()
    • swirl()
  • then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want

    • Here we will focus on the Compare proportions among groups lesson
  • TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by

    • exiting swirl using escape key or bye() function

      bye()
    • uninstalling and reinstalling courses

      uninstall_all_courses()
      install_course_github("jsgosnell", "JSG_swirl_lessons")
    • when you restart swirl with swirl(), you may need to select

      • No. Let me start something new

Let’s practice

Heart attacks

1

Let’s look at some heart attack data. Read in the data using

heart_attacks <- read.table("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/heartatk4R.txt",header=T, stringsAsFactors = T)

Every entry is a person that has suffered a heart attack. More information on the dataset can be found at

http://statland.org/Software_Help/DataDesk/datafile.htm

We want to again test if heart attacks occur equally across genders.

  • What if we know that males actually make up 50.8% of the population?

2

Still using the heart attack data, is survival independent of gender?

3

For people that have a heart attack before they turn 30, is survival independent of gender?

Dolphins

4

Data on dolphin behavior was collected off the coast of Iceland. Data is @

http://www.statsci.org/data/general/dolpacti.txt

Since this is a .txt file, not a .csv, you’ll need to use something like

dolphin <- read.table("http://www.statsci.org/data/general/dolpacti.txt", sep="", header = T, stringsAsFactors = T)

More info on data @

http://www.statsci.org/data/general/dolpacti.html

Is travelling independent of time of day? You’ll need to consider traveling vs not traveling due to different number of groups observed in each period. Carry out post-hoc tests if needed.

Smoking

5

Use data on smoking and exercise from

http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence

to determine if smoking is independent of exercise. You’ll need to input data manually. Carry out post-hoc tests if needed.