Compare proportions among groups

Remember you should

add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
render your file to produce a markdown version that you can see!
save your work often
- commit it via git!
- push updates to github

Overview

This practice reviews the Compare means among groups lecture.

Examples

Issue is we often get data in spreadsheet format (expanded/long or wide/summarized, each shown below), but we need to get a vector or matrix for chisq.test and related functions.

The data

Following the Everest example from class. Assume data is in a dataframe where each row is a group data point.

everest <- data.frame(Survived = c("Y","N","Y", "N"),
                      Oxygen = c("Used", "Used", "Not used", "Not used"),
                      Number = c(1045, 32, 88, 8))

Assume data is in a dataframe where each row is an individual data point.

library(mirt)

Loading required package: stats4

Loading required package: lattice

everest_expand <- expand.table(everest)

Tests

First, let’s ask if the same amount of people used or did not use oxygen. WE can use the table command to summarize. Note the chisq.test, by default, assumes each group is equally likely!

table(everest_expand$Oxygen)


Not used     Used 
      96     1077

chisq.test(table(everest_expand$Oxygen))


    Chi-squared test for given probabilities

data:  table(everest_expand$Oxygen)
X-squared = 820.43, df = 1, p-value < 2.2e-16

Dong this with summarized data is actually harder

aggregate(Number~Oxygen, everest, sum)$Number

[1]   96 1077

chisq.test(aggregate(Number~Oxygen, everest, sum)$Number)


    Chi-squared test for given probabilities

data:  aggregate(Number ~ Oxygen, everest, sum)$Number
X-squared = 820.43, df = 1, p-value < 2.2e-16

But this is better!

binom.test(table(everest_expand$Oxygen))


    Exact binomial test

data:  table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.06679216 0.09902483
sample estimates:
probability of success 
            0.08184143

What if we wanted to compare to past years where 10% of climbers did not use oxygen? Note table function splits into alphabetical order.

binom.test(table(everest_expand$Oxygen), p=.1)


    Exact binomial test

data:  table(everest_expand$Oxygen)
number of successes = 96, number of trials = 1173, p-value = 0.04075
alternative hypothesis: true probability of success is not equal to 0.1
95 percent confidence interval:
 0.06679216 0.09902483
sample estimates:
probability of success 
            0.08184143

What if we want to determine if using oxygen impacts surival?

chisq.test(table(everest_expand$Oxygen, everest_expand$Survived))

Warning in chisq.test(table(everest_expand$Oxygen, everest_expand$Survived)):
Chi-squared approximation may be incorrect


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(everest_expand$Oxygen, everest_expand$Survived)
X-squared = 6.1524, df = 1, p-value = 0.01312

Issue (which we’ll address), but note same as

chisq.test(table(everest_expand$Survived, everest_expand$Oxygen))

Warning in chisq.test(table(everest_expand$Survived, everest_expand$Oxygen)):
Chi-squared approximation may be incorrect


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(everest_expand$Survived, everest_expand$Oxygen)
X-squared = 6.1524, df = 1, p-value = 0.01312

chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T))

Warning in chisq.test(x = matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect


    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(1045, 88, 32, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312

chisq.test(x = matrix(c(1045, 32, 88,  8), 2, 2, byrow = T))

Warning in chisq.test(x = matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)):
Chi-squared approximation may be incorrect


    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(1045, 32, 88, 8), 2, 2, byrow = T)
X-squared = 6.1524, df = 1, p-value = 0.01312

Key is first argument must be all the info. This is different from (incorrect) approach like

chisq.test(everest$Survived,everest$Oxygen)

Warning in chisq.test(everest$Survived, everest$Oxygen): Chi-squared
approximation may be incorrect


    Pearson's Chi-squared test

data:  everest$Survived and everest$Oxygen
X-squared = 0, df = 1, p-value = 1

This is comparing split among Survived and not to split (expected) using Oxygen!

So order has minimal input with 2 groups. Other test options necessitated by the warning

fisher.test(table(everest_expand$Oxygen, everest_expand$Survived))


    Fisher's Exact Test for Count Data

data:  table(everest_expand$Oxygen, everest_expand$Survived)
p-value = 0.01284
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.144791 6.826869
sample estimates:
odds ratio 
  2.964765

library(DescTools)
GTest(table(everest_expand$Oxygen, everest_expand$Survived))


    Log likelihood ratio (G-test) test of independence without correction

data:  table(everest_expand$Oxygen, everest_expand$Survived)
G = 5.7466, X-squared df = 1, p-value = 0.01652

What if we added another group? Like Enriched, Regular, None for oxygen.

everest_enriched <- data.frame(Survived = c("Y","N","Y", "N", "Y", "N"),
                      Oxygen = c("Regular", "Regular", "None", "None", rep("Enriched", 2)),
                      Number = c(1045, 32, 88, 8, 15, 2))
everest_enriched_expand <- expand.table(everest_enriched)

Now we compare

table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)

   
    Enriched None Regular
  N        2    8      32
  Y       15   88    1045

chisq.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))

Warning in chisq.test(table(everest_enriched_expand$Survived,
everest_enriched_expand$Oxygen)): Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
X-squared = 10.879, df = 2, p-value = 0.004343

Fisher again due to size

fisher.test(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))


    Fisher's Exact Test for Count Data

data:  table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen)
p-value = 0.00586
alternative hypothesis: two.sided

Now we follow-up, and rows/columns matter. Note default is row and fdr method. I order results for ease of view

library(rcompanion)
everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen))

Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect

everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]

  Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq p.adj.Chisq
1      N : Y  0.00586      0.00586  0.0189      0.0189 0.00434     0.00434

Not quite what we wanted. How about

everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
                                                          compare = "col")

Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect

everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]

          Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3     None : Regular   0.0128       0.0384  0.0165      0.0495  0.0131
2 Enriched : Regular   0.0953       0.1430  0.1080      0.1620  0.1710
1    Enriched : None   0.6450       0.6450  0.6580      0.6580  1.0000
  p.adj.Chisq
3      0.0393
2      0.2560
1      1.0000

and you can change methods

everest_expand_correct_fdr <- pairwiseNominalIndependence(table(everest_enriched_expand$Survived, everest_enriched_expand$Oxygen),
                                                          compare = "col",
                                                          method = "holm")

Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect

everest_expand_correct_fdr[order(everest_expand_correct_fdr$p.adj.Fisher),]

          Comparison p.Fisher p.adj.Fisher p.Gtest p.adj.Gtest p.Chisq
3     None : Regular   0.0128       0.0384  0.0165      0.0495  0.0131
2 Enriched : Regular   0.0953       0.1910  0.1080      0.2160  0.1710
1    Enriched : None   0.6450       0.6450  0.6580      0.6580  1.0000
  p.adj.Chisq
3      0.0393
2      0.3420
1      1.0000

To put in manually, we need a few extra things

everest_table <- as.table(matrix(c(2,8,32,15,88,1045), nrow = 2, byrow = T))
rownames(everest_table) = c("N", "Y")
colnames(everest_table) = c("Enriched", "None", "Regular")
everest_table

  Enriched None Regular
N        2    8      32
Y       15   88    1045

Swirl lesson

Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.

THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN! FOR THIS REASONS, THE CODE BELOW IS DISPLAYED BUT NOT RUN WHEN THIS FILE IS RENDERED

install the “swirl” package

run the following code once on the computer to install a new course

library(swirl)
install_course_github("jsgosnell", "JSG_swirl_lessons")

start swirl!
```
swirl()
```
- swirl()
then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want
- Here we will focus on the Compare proportions among groups lesson
TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by
- exiting swirl using escape key or bye() function
```
bye()
```
- uninstalling and reinstalling courses
```
uninstall_all_courses()
install_course_github("jsgosnell", "JSG_swirl_lessons")
```
- when you restart swirl with swirl(), you may need to select
  - No. Let me start something new

Let’s practice

Heart attacks

1

Let’s look at some heart attack data. Read in the data using

heart_attacks <- read.table("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/datasets/heartatk4R.txt",header=T, stringsAsFactors = T)

Every entry is a person that has suffered a heart attack. More information on the dataset can be found at

http://statland.org/Software_Help/DataDesk/datafile.htm

We want to again test if heart attacks occur equally across genders.

What if we know that males actually make up 50.8% of the population?

2

Still using the heart attack data, is survival independent of gender?

3

For people that have a heart attack before they turn 30, is survival independent of gender?

Dolphins

4

Data on dolphin behavior was collected off the coast of Iceland. Data is @

http://www.statsci.org/data/general/dolpacti.txt

Since this is a .txt file, not a .csv, you’ll need to use something like

dolphin <- read.table("http://www.statsci.org/data/general/dolpacti.txt", sep="", header = T, stringsAsFactors = T)

More info on data @

http://www.statsci.org/data/general/dolpacti.html

Is travelling independent of time of day? You’ll need to consider traveling vs not traveling due to different number of groups observed in each period. Carry out post-hoc tests if needed.

Smoking

5

Use data on smoking and exercise from

http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence

to determine if smoking is independent of exercise. You’ll need to input data manually. Carry out post-hoc tests if needed.