Tests for continuous data from one sample

Remember you should

add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
render your file to produce a markdown version that you can see!
save your work often
- commit it via git!
- push updates to github

Overview

This practice reviews the Tests for continuous data from one sample lecture.

Examples

From lecture! Consider if average height of males training at the Australian Institute of Sport is different than average of human population.

These are all one sample tests, but they differ in what we know. If we know the variance of our population, we use a z test (function in BSDA package).

sport <- read.table("http://www.statsci.org/data/oz/ais.txt", header = T)
library(BSDA)

Loading required package: lattice


Attaching package: 'BSDA'

The following object is masked from 'package:datasets':

    Orange

z.test(sport[sport$Sex == "male", "Ht"], mu = 175.6, sigma.x=7)


    One-sample z-Test

data:  sport[sport$Sex == "male", "Ht"]
z = 14.292, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 175.6
95 percent confidence interval:
 184.1474 186.8643
sample estimates:
mean of x 
 185.5059

If we don’t, we use a t-test

t.test(sport[sport$Sex == "male", "Ht"], mu = 175.6)


    One Sample t-test

data:  sport[sport$Sex == "male", "Ht"]
t = 12.658, df = 101, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 175.6
95 percent confidence interval:
 183.9535 187.0583
sample estimates:
mean of x 
 185.5059

These both assume the means of the data are normal! If we want to relax that assumption, we can use the Wilcoxon test (also known as Mann-Whitney test, signed binary transform, or other terms!). This assumes the distribution of means is symmetric.

wilcox.test(sport[sport$Sex == "male", "Ht"], mu = 175.6)


    Wilcoxon signed rank test with continuity correction

data:  sport[sport$Sex == "male", "Ht"]
V = 5052, p-value = 5.714e-16
alternative hypothesis: true location is not equal to 175.6

or the sign-test/media test.

SIGN.test(sport[sport$Sex == "male", "Ht"], md = 175.6)


    One-sample Sign-Test

data:  sport[sport$Sex == "male", "Ht"]
s = 90, p-value = 8.882e-16
alternative hypothesis: true median is not equal to 175.6
95 percent confidence interval:
 183.9000 187.4684
sample estimates:
median of x 
     185.55 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level L.E.pt   U.E.pt
Lower Achieved CI     0.9406  183.9 187.3000
Interpolated CI       0.9500  183.9 187.4684
Upper Achieved CI     0.9629  183.9 187.7000

Note this is just transforming data to 1/0 and doing a binomial test!

above_175.6 <- nrow(sport[sport$Sex == "male" & sport$Ht > 175.6,])
binom.test(above_175.6, nrow(sport[sport$Sex == "male",]))


    Exact binomial test

data:  above_175.6 and nrow(sport[sport$Sex == "male", ])
number of successes = 90, number of trials = 102, p-value = 6.125e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.8035103 0.9377091
sample estimates:
probability of success 
             0.8823529

We can also bootstrap the data.

number_of_simulations <- 1000
library(ggplot2)
boostrap_data<- sport[sport$Sex == "male", "Ht"]
boostrap_outcomes <- data.frame(mean = rep(NA, number_of_simulations), sd = NA)
for (i in 1:number_of_simulations){
iris_bootstrap <-sample(boostrap_data, length(boostrap_data), replace = T)
boostrap_outcomes$mean[i] <- mean(iris_bootstrap)
boostrap_outcomes$sd[i] <- sd(iris_bootstrap)
}
ggplot(boostrap_outcomes, aes(x=mean)) +
  geom_histogram(color="black") +
  labs(title=expression(paste("Bootstrapped means")),
       x= "Mean value",
       y= "Frequency")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

and find associated quantile-based 95% confidence intervals:

quantile(boostrap_outcomes$mean, probs=c(.025, .975) )

    2.5%    97.5% 
184.0188 187.1022

or using functions in the boot library

library(boot)


Attaching package: 'boot'

The following object is masked from 'package:lattice':

    melanoma

results <- boot(data=boostrap_data, statistic = function(x, inds) mean(x[inds]),
   R=number_of_simulations)
ggplot(data.frame(results$t), aes(x=results.t)) +
  geom_histogram(color="black") +
  labs(title=expression(paste("Bootstrapped means")),
       x= "Mean value",
       y= "Frequency")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

quantile( results$t, probs=c(.025, .975) )

    2.5%    97.5% 
183.8642 187.0591

boot.ci(results)

Warning in boot.ci(results): bootstrap variances needed for studentized
intervals

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results)

Intervals : 
Level      Normal              Basic         
95%   (183.9, 187.1 )   (183.9, 187.2 )  

Level     Percentile            BCa          
95%   (183.8, 187.1 )   (183.8, 187.0 )  
Calculations and Intervals on Original Scale

library(MKinfer)
boot.t.test(sport[sport$Sex == "male", "Ht"], mu = 175.6)


    Bootstrap One Sample t-test

data:  sport[sport$Sex == "male", "Ht"]
number of bootstrap samples:  9999
bootstrap p-value < 1e-04 
bootstrap mean of x (SE) = 185.503 (0.7769018) 
95 percent bootstrap percentile confidence interval:
 184.0029 187.0168

Results without bootstrap:
t = 12.658, df = 101, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 175.6
95 percent confidence interval:
 183.9535 187.0583
sample estimates:
mean of x 
 185.5059

Swirl lesson

Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.

THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN!

install the “swirl” package

run the following code once on the computer to install a new course

library(swirl)
install_course_github("jsgosnell", "JSG_swirl_lessons")

start swirl!
```
swirl()
```
- swirl()
then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want
- Here we will focus on the Tests for continuous data from one sample lesson
TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by
- exiting swirl using escape key or bye() function
```
bye()
```
- uninstalling and reinstalling courses
```
uninstall_all_courses()
install_course_github("jsgosnell", "JSG_swirl_lessons")
```
- when you restart swirl with swirl(), you may need to select
  - No. Let me start something new

Let’s practice!

Recognizing and assessing normality

1

Using the qqplot_example.R code, examine the following distributions and, for the continuous distributions (marked with a “*”), observe how a normal probability plot (qqplot) can be used to visually test for approximate normality.

Normal (u= 0; σ²= 1, 10, 100)
Student’s t (df = 1, 10, 30, & 100)
Chi-square (df= 1, 2, 5, 30, 50)
Bernoulli (P=0.1, 0.5, & 0.9)
Binomial (P=0.05; N= 2, 5, 25, & 50); (P=0.25; N= 2, 5, 25, & 50); (P=0.50; N= 2, 5, 25, & 50); (P=0.75; N= 2, 5, 25, & 50); (P=0.95; N= 2, 5, 25, & 50)
Poisson ( u= 2, 5, 10, 30, & 50)

For this question, its easiest to just source the main file and see what happens. When you source a script, it is run in R without showing any console output (but graphs and objects are still produced!). Try source(“https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/code_examples/qqplot_example.R”)

Notice the spread of DATA of every distribution tend towards normality as sample size increases

2

Review the central_limit_theorem.R code (remember

library(VGAM)

Loading required package: stats4

Loading required package: splines


Attaching package: 'VGAM'

The following objects are masked from 'package:boot':

    logit, simplex

source("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/code_examples/central_limit_theorem.R")

Press [enter] to continue

Press [enter] to continue

Press [enter] to continue

Press [enter] to continue

Press [enter] to continue

if you need to convince/remind yourself how common normality of means is for even non-normal data.

Here we are focused on how the means look as sample size increases

Working with data (note some sample sizes may be too small for these to all be good ideas!)

Make sure you are comfortable with null and alternative hypotheses for all examples. You should also feel comfortable graphing the data.

3

Seven observers were shown, for a brief period, a grill with 161 flies impaled and were asked to estimate the number. The results are given by Cochran (1954). Based on five estimates, they were 183.2, 149.0, 154.0, 167.2, 187.2, 158.0, and 143.0. Test the null hypothesis that the mean of the estimates is 161 flies.

Assuming variance = 275

flies <- c(183.2, 149.0, 154.0, 167.2, 187.2, 158.0, 143.0)
library(BSDA)
z.test(x=flies, mu = 161, sigma.x=sqrt(275))


    One-sample z-Test

data:  flies
z = 0.33276, p-value = 0.7393
alternative hypothesis: true mean is not equal to 161
95 percent confidence interval:
 150.8010 175.3704
sample estimates:
mean of x 
 163.0857

Using a z-test, I found a test statistics of z~=0.33 .This corresponds to a p-value of 0.73. This p value is >.05, so I fail to reject the null hypothesis that the mean of the estimates is 161 flies.

Estimating the variance from the data

t.test(x=flies,mu = 161)


    One Sample t-test

data:  flies
t = 0.32656, df = 6, p-value = 0.7551
alternative hypothesis: true mean is not equal to 161
95 percent confidence interval:
 147.4576 178.7138
sample estimates:
mean of x 
 163.0857

Using a t-test, which is appropriate when the variance must be estimated from the sample and the means of the data may be assumed to follow a normal distribution, I found a test statistics of t₆=0.32. This corresponds to a p-value of 0.76. This p-value is >.05, so I fail to reject the null hypothesis that the mean of the estimates is 161 flies.

Using rank transform analysis

wilcox.test(flies, mu=161)


    Wilcoxon signed rank exact test

data:  flies
V = 15, p-value = 0.9375
alternative hypothesis: true location is not equal to 161

Using a Wilcoxon signed rank test, which is appropriate when normality assumptions can’t be met and the distribution of the data appears to be symmetric, I found a test statistics of V = 15 .This corresponds to a p-value of 0.94. This p-value is >.05, so I fail to reject the null hypothesis that the mean of the estimates is 161 flies.

Using binary transform analysis

SIGN.test(flies, md=161)


    One-sample Sign-Test

data:  flies
s = 3, p-value = 1
alternative hypothesis: true median is not equal to 161
95 percent confidence interval:
 144.8857 185.9429
sample estimates:
median of x 
        158 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level   L.E.pt   U.E.pt
Lower Achieved CI     0.8750 149.0000 183.2000
Interpolated CI       0.9500 144.8857 185.9429
Upper Achieved CI     0.9844 143.0000 187.2000

Using a sign test, which is appropriate when the data is continuous and other assumptions can’t be met, I found a test statistics of s = 3 .This corresponds to a p-value of 1. This p-value is >.05, so I fail to reject the null hypothesis that the median (Note change here) of the estimates is 161 flies.

Note there are several ways to load the data! You can make a list (since the list is short):

flies <- c(183.2, 149.0, 154.0, 167.2, 187.2, 158.0, 143.0 )

or make a dataframe in a spreadsheet software (eg, Excel, Google Sheets) and then upload using a read.csv command. We did this in your introduction to R!

4

Yields of 10 strawberry plants in a uniformity trial are given by Baker and Baker (1953) as 239, 176, 235, 217, 234, 216, 318, 190, 181, and 225 g. Test the hypothesis that µ = 205

strawberries <- c(239, 176, 235, 217, 234, 216, 318, 190, 181, 225)
z.test(x=strawberries,mu = 205, sigma.x=sqrt(1500))


    One-sample z-Test

data:  strawberries
z = 1.4779, p-value = 0.1394
alternative hypothesis: true mean is not equal to 205
95 percent confidence interval:
 199.0954 247.1046
sample estimates:
mean of x 
    223.1

Using a z-test, I found a test statistics of z=1.48. This corresponds to a p-value of 0.14. This p-value is >.05, so I fail to reject the null hypothesis that the population mean is equal to 205.

Estimating the variance from the data

t.test(x=strawberries,mu = 205)


    One Sample t-test

data:  strawberries
t = 1.4164, df = 9, p-value = 0.1903
alternative hypothesis: true mean is not equal to 205
95 percent confidence interval:
 194.1922 252.0078
sample estimates:
mean of x 
    223.1

Using a t-test, which is appropriate when the variance must be estimated from the sample and the means of the data may be assumed to follow a normal distribution, I found a test statistics of t₉=1.42. This corresponds to a p-value of 0.19. This p-value is >.05, so I fail to reject the null hypothesis that the population mean is equal to 205.

Using rank transform analysis

wilcox.test(strawberries, mu=205)

Warning in wilcox.test.default(strawberries, mu = 205): cannot compute exact
p-value with ties


    Wilcoxon signed rank test with continuity correction

data:  strawberries
V = 40.5, p-value = 0.2023
alternative hypothesis: true location is not equal to 205

Using a Wilcoxon signed rank test, which is appropriate when normality assumptions can’t be met and the distribution of the data appears to be symmetric, I found a test statistics of V=40.5. This corresponds to a p-value of 0.20. This p-value is >.05, so I fail to reject the null hypothesis that the population mean is equal to 205.

Using binary transform analysis

SIGN.test(strawberries, md=205)


    One-sample Sign-Test

data:  strawberries
s = 7, p-value = 0.3437
alternative hypothesis: true median is not equal to 205
95 percent confidence interval:
 183.9200 237.7022
sample estimates:
median of x 
        221 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level L.E.pt   U.E.pt
Lower Achieved CI     0.8906 190.00 235.0000
Interpolated CI       0.9500 183.92 237.7022
Upper Achieved CI     0.9785 181.00 239.0000

Using a sign test, which is appropriate when the data is continuous and other assumptions can’t be met, I found a test statistics of s= 7. This corresponds to a p-value of 0.34. This p-value is >.05,so I fail to reject the null hypothesis that the population median (Note change here) is equal to 205.

5

Evolutionary geneticists predicts the family sex ratio will be 80% female in broods of eagles that successfully fledge >3 young. Nests that fledge 3 or more chicks are very rare but a sample of 30 chicks are obtained from such nests and they yield 25 females and 5 males. Test the hypotheses that that: * a) the sex ratio is 50% females * b) the sex ratio is 80% females.

binom.test(25,30, p=.5)


    Exact binomial test

data:  25 and 30
number of successes = 25, number of trials = 30, p-value = 0.0003249
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.6527883 0.9435783
sample estimates:
probability of success 
             0.8333333

A binomial test was used as we are comparing an observed proportion against a set value. Given a p-value of <.001, I reject the null hypothesis that the proportion of sons is equal to .5.

1. the sex ratio is 80% females.

binom.test(25,30, .8)


    Exact binomial test

data:  25 and 30
number of successes = 25, number of trials = 30, p-value = 0.8205
alternative hypothesis: true probability of success is not equal to 0.8
95 percent confidence interval:
 0.6527883 0.9435783
sample estimates:
probability of success 
             0.8333333

A binomial test was used as we are comparing an observed proportion against a set value. Given a p-value of <.001, I fail to reject the null hypothesis that the proportion of sons is equal to .8.

6

Studies of flying snakes have led researchers to posit the mean undulation rate is 1.4 Hz. You wish to test this hypothesis using the small sample of undulation rates shown below. Create a small dataset of the paradise tree snake undulation rates and choose and justify a test you can use to assess the data.

Undulation rates (in Hz): 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6

Using a t-test, which is appropriate when the variance must be estimated from the sample and the means of the data may be assumed to follow a normal distribution, I found a test statistics of t₇=-.22. This corresponds to a p-value of 0.83. This p-value is >.05, so I fail to reject the null hypothesis that the mean undulation rate is 1.4 Hz.

7

Using data from Australian athletes (http://www.statsci.org/data/oz/ais.html for details), determine if the average male training at the Australian Institute of Sport differs in weight from the average Australian male (85.9 kg) using bootstrapping techniques. Data at

sport <- read.table("http://www.statsci.org/data/oz/ais.txt", header = T, 
                    stringsAsFactors = T)

library(MKinfer)
boot.t.test(sport[sport$Sex == "male", "Wt"], mu= 85.9)


    Bootstrap One Sample t-test

data:  sport[sport$Sex == "male", "Wt"]
number of bootstrap samples:  9999
bootstrap p-value = 0.011 
bootstrap mean of x (SE) = 82.51936 (1.217846) 
95 percent bootstrap percentile confidence interval:
 80.15684 84.96265

Results without bootstrap:
t = -2.7487, df = 101, p-value = 0.007089
alternative hypothesis: true mean is not equal to 85.9
95 percent confidence interval:
 80.08671 84.96035
sample estimates:
mean of x 
 82.52353

Using a bootstrap test wtih 10,000 samples, we found a p-value of .007; we thus reject the null hypothesis that males training at the AIS have the same weight as the average Australian male. Data indicated they weigh less.