Tests for continuous data from one sample

Remember you should

add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
render your file to produce a markdown version that you can see!
save your work often
- commit it via git!
- push updates to github

Overview

This practice reviews the Tests for continuous data from one sample lecture.

Examples

From lecture! Consider if average height of males training at the Australian Institute of Sport is different than average of human population.

These are all one sample tests, but they differ in what we know. If we know the variance of our population, we use a z test (function in BSDA package).

sport <- read.table("http://www.statsci.org/data/oz/ais.txt", header = T)
library(BSDA)

Loading required package: lattice


Attaching package: 'BSDA'

The following object is masked from 'package:datasets':

    Orange

z.test(sport[sport$Sex == "male", "Ht"], mu = 175.6, sigma.x=7)


    One-sample z-Test

data:  sport[sport$Sex == "male", "Ht"]
z = 14.292, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 175.6
95 percent confidence interval:
 184.1474 186.8643
sample estimates:
mean of x 
 185.5059

If we don’t, we use a t-test

t.test(sport[sport$Sex == "male", "Ht"], mu = 175.6)


    One Sample t-test

data:  sport[sport$Sex == "male", "Ht"]
t = 12.658, df = 101, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 175.6
95 percent confidence interval:
 183.9535 187.0583
sample estimates:
mean of x 
 185.5059

These both assume the means of the data are normal! If we want to relax that assumption, we can use the Wilcoxon test (also known as Mann-Whitney test, signed binary transform, or other terms!). This assumes the distribution of means is symmetric.

wilcox.test(sport[sport$Sex == "male", "Ht"], mu = 175.6)


    Wilcoxon signed rank test with continuity correction

data:  sport[sport$Sex == "male", "Ht"]
V = 5052, p-value = 5.714e-16
alternative hypothesis: true location is not equal to 175.6

or the sign-test/media test.

SIGN.test(sport[sport$Sex == "male", "Ht"], md = 175.6)


    One-sample Sign-Test

data:  sport[sport$Sex == "male", "Ht"]
s = 90, p-value = 8.882e-16
alternative hypothesis: true median is not equal to 175.6
95 percent confidence interval:
 183.9000 187.4684
sample estimates:
median of x 
     185.55 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level L.E.pt   U.E.pt
Lower Achieved CI     0.9406  183.9 187.3000
Interpolated CI       0.9500  183.9 187.4684
Upper Achieved CI     0.9629  183.9 187.7000

Note this is just transforming data to 1/0 and doing a binomial test!

above_175.6 <- nrow(sport[sport$Sex == "male" & sport$Ht > 175.6,])
binom.test(above_175.6, nrow(sport[sport$Sex == "male",]))


    Exact binomial test

data:  above_175.6 and nrow(sport[sport$Sex == "male", ])
number of successes = 90, number of trials = 102, p-value = 6.125e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.8035103 0.9377091
sample estimates:
probability of success 
             0.8823529

We can also bootstrap the data.

number_of_simulations <- 1000
library(ggplot2)
boostrap_data<- sport[sport$Sex == "male", "Ht"]
boostrap_outcomes <- data.frame(mean = rep(NA, number_of_simulations), sd = NA)
for (i in 1:number_of_simulations){
iris_bootstrap <-sample(boostrap_data, length(boostrap_data), replace = T)
boostrap_outcomes$mean[i] <- mean(iris_bootstrap)
boostrap_outcomes$sd[i] <- sd(iris_bootstrap)
}
ggplot(boostrap_outcomes, aes(x=mean)) +
  geom_histogram(color="black") +
  labs(title=expression(paste("Bootstrapped means")),
       x= "Mean value",
       y= "Frequency")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

and find associated quantile-based 95% confidence intervals:

quantile(boostrap_outcomes$mean, probs=c(.025, .975) )

    2.5%    97.5% 
183.9968 186.9917

or using functions in the boot library

library(boot)


Attaching package: 'boot'

The following object is masked from 'package:lattice':

    melanoma

results <- boot(data=boostrap_data, statistic = function(x, inds) mean(x[inds]),
   R=number_of_simulations)
ggplot(data.frame(results$t), aes(x=results.t)) +
  geom_histogram(color="black") +
  labs(title=expression(paste("Bootstrapped means")),
       x= "Mean value",
       y= "Frequency")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

quantile( results$t, probs=c(.025, .975) )

    2.5%    97.5% 
184.0492 187.0179

boot.ci(results)

Warning in boot.ci(results): bootstrap variances needed for studentized
intervals

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results)

Intervals : 
Level      Normal              Basic         
95%   (184.0, 187.0 )   (183.9, 187.0 )  

Level     Percentile            BCa          
95%   (184.0, 187.1 )   (184.1, 187.1 )  
Calculations and Intervals on Original Scale

library(MKinfer)
boot.t.test(sport[sport$Sex == "male", "Ht"], mu = 175.6)


    Bootstrap One Sample t-test

data:  sport[sport$Sex == "male", "Ht"]
number of bootstrap samples:  9999
bootstrap p-value < 1e-04 
bootstrap mean of x (SE) = 185.5021 (0.7769843) 
95 percent bootstrap percentile confidence interval:
 183.9735 187.0345

Results without bootstrap:
t = 12.658, df = 101, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 175.6
95 percent confidence interval:
 183.9535 187.0583
sample estimates:
mean of x 
 185.5059

Swirl lesson

Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.

THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN!

install the “swirl” package

run the following code once on the computer to install a new course

library(swirl)
install_course_github("jsgosnell", "JSG_swirl_lessons")

start swirl!
```
swirl()
```
- swirl()
then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want
- Here we will focus on the Tests for continuous data from one sample lesson
TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by
- exiting swirl using escape key or bye() function
```
bye()
```
- uninstalling and reinstalling courses
```
uninstall_all_courses()
install_course_github("jsgosnell", "JSG_swirl_lessons")
```
- when you restart swirl with swirl(), you may need to select
  - No. Let me start something new

Let’s practice!

Recognizing and assessing normality

1

Using the qqplot_example.R code, examine the following distributions and, for the continuous distributions (marked with a “*”), observe how a normal probability plot (qqplot) can be used to visually test for approximate normality.

Normal (u= 0; σ²= 1, 10, 100)
Student’s t (df = 1, 10, 30, & 100)
Chi-square (df= 1, 2, 5, 30, 50)
Bernoulli (P=0.1, 0.5, & 0.9)
Binomial (P=0.05; N= 2, 5, 25, & 50); (P=0.25; N= 2, 5, 25, & 50); (P=0.50; N= 2, 5, 25, & 50); (P=0.75; N= 2, 5, 25, & 50); (P=0.95; N= 2, 5, 25, & 50)
Poisson ( u= 2, 5, 10, 30, & 50)

For this question, its easiest to just source the main file and see what happens. When you source a script, it is run in R without showing any console output (but graphs and objects are still produced!). Try source(“https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/code_examples/qqplot_example.R”)

2

Review the central_limit_theorem.R code (remember

library(VGAM)

Loading required package: stats4

Loading required package: splines


Attaching package: 'VGAM'

The following objects are masked from 'package:boot':

    logit, simplex

source("https://raw.githubusercontent.com/jsgosnell/CUNY-BioStats/master/code_examples/central_limit_theorem.R")

Press [enter] to continue

Press [enter] to continue

Press [enter] to continue

Press [enter] to continue

Press [enter] to continue

if you need to convince/remind yourself how common normality of means is for even non-normal data.

Working with data (note some sample sizes may be too small for these to all be good ideas!)

Make sure you are comfortable with null and alternative hypotheses for all examples. You should also feel comfortable graphing the data.

3

Seven observers were shown, for a brief period, a grill with 161 flies impaled and were asked to estimate the number. The results are given by Cochran (1954). Based on five estimates, they were 183.2, 149.0, 154.0, 167.2, 187.2, 158.0, and 143.0. Test the null hypothesis that the mean of the estimates is 161 flies.

Assuming variance = 275
Estimating the variance from the data
Using rank transform analysis
Using binary transform analysis

Note there are several ways to load the data! You can make a list (since the list is short):

flies <- c(183.2, 149.0, 154.0, 167.2, 187.2, 158.0, 143.0 )

or make a dataframe in a spreadsheet software (eg, Excel, Google Sheets) and then upload using a read.csv command. We did this in your introduction to R!

4

Yields of 10 strawberry plants in a uniformity trial are given by Baker and Baker (1953) as 239, 176, 235, 217, 234, 216, 318, 190, 181, and 225 g. Test the hypothesis that µ = 205

Assuming variance = 1500
Estimating the variance from the data
Using rank transform analysis
Using binary transform analysis

5

Evolutionary geneticists predicts the family sex ratio will be 80% female in broods of eagles that successfully fledge >3 young. Nests that fledge 3 or more chicks are very rare but a sample of 30 chicks are obtained from such nests and they yield 25 females and 5 males. Test the hypotheses that that: * a) the sex ratio is 50% females * b) the sex ratio is 80% females.

6

Studies of flying snakes have led researchers to posit the mean undulation rate is 1.4 Hz. You wish to test this hypothesis using the small sample of undulation rates shown below. Create a small dataset of the paradise tree snake undulation rates and choose and justify a test you can use to assess the data.

Undulation rates (in Hz): 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6

7

Using data from Australian athletes (http://www.statsci.org/data/oz/ais.html for details), determine if the average male training at the Australian Institute of Sport differs in weight from the average Australian male (85.9 kg) using bootstrapping techniques. Data at

sport <- read.table("http://www.statsci.org/data/oz/ais.txt", header = T, 
                    stringsAsFactors = T)