library(ggplot2)
ggplot(iris, aes(x=Petal.Length, y=Sepal.Length)) +
geom_point(size = 3) +
ylab("Sepal Length")+ggtitle("Sepal length increases \n with petal length")+
theme(axis.title.x = element_text(face="bold", size=28),
axis.title.y = element_text(face="bold", size=28),
axis.text.y = element_text(size=20),
axis.text.x = element_text(size=20),
legend.text =element_text(size=20),
legend.title = element_text(size=20, face="bold"),
plot.title = element_text(hjust = 0.5, face="bold", size=32))+
xlab("Petal length (cm)") +
ylab("Sepal length (cm)")
Relationships among numerical variables
Remember you should
- add code chunks by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I to answer the questions!
- render your file to produce a markdown version that you can see!
- save your work often
- commit it via git!
- push updates to github
Overview
This practice reviews the Relationships among numerical variabless lecuture.
Example
Following the iris dataset from class, let’s investigate the relationships between sepal and petal length. First, we can plot the data.
A linear relationship appears logical here. We can build the model and check assumptions.
<- lm(Sepal.Length ~ Petal.Length, iris)
iris_regression plot(iris_regression)
Visual checks of assumptions appear to be met, so we can determine if the slope differs from 0.
library(car)
Warning: package 'car' was built under R version 4.4.1
Loading required package: carData
Anova(iris_regression, type = "III")
Anova Table (Type III tests)
Response: Sepal.Length
Sum Sq Df F value Pr(>F)
(Intercept) 500.16 1 3018.28 < 2.2e-16 ***
Petal.Length 77.64 1 468.55 < 2.2e-16 ***
Residuals 24.53 148
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(iris_regression)
Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.24675 -0.29657 -0.01515 0.27676 1.00269
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.30660 0.07839 54.94 <2e-16 ***
Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
We find that F1,148=468.5, p < 0.01, so we reject the null hypothesis that the slope is equal to 0. The estimate indicates a slope of 0.408, so sepal length increases with petal length. An R2 value of 0.76 indicates petal length explains about 75% of the variation in sepal length.
Alternatively, we could consider if the association between the two variables is equal to 0 (or not).
cor.test(~ Sepal.Length + Petal.Length, data = iris)
Pearson's product-moment correlation
data: Sepal.Length and Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
We find the same p-value (using a t-distribution), and see that the estimated linear correlation coefficient, .87, is the square of our R2 value.
If we prefer a rank-based test, we can update the code:
cor.test(~ Sepal.Length + Petal.Length, data = iris,
method="spearman")
Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
exact p-value with ties
Spearman's rank correlation rho
data: Sepal.Length and Petal.Length
S = 66429, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8818981
Which again leads us to reject the null hypothesis. Finally, a bootstrap options can be produced
<- Boot(iris_regression) bootstrap_iris
Loading required namespace: boot
Confint(bootstrap_iris)
Bootstrap bca confidence intervals
Estimate 2.5 % 97.5 %
(Intercept) 4.3066034 4.1581328 4.451639
Petal.Length 0.4089223 0.3720202 0.446677
where estimates indicated the slope does not contain 0.
Swirl lesson
Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.
THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN!
install the “swirl” package
run the following code once on the computer to install a new course
library(swirl) install_course_github("jsgosnell", "JSG_swirl_lessons")
start swirl!
swirl()
- swirl()
then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want
- Here we will focus on the Relationships among numerical variables lesson
TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by
exiting swirl using escape key or bye() function
bye()
uninstalling and reinstalling courses
uninstall_all_courses() install_course_github("jsgosnell", "JSG_swirl_lessons")
when you restart swirl with swirl(), you may need to select
- No. Let me start something new
Practice
1
- A professor carried out a long-term study to see how various factors impacted pulse rate before and after exercise. Data can be found at
http://www.statsci.org/data/oz/ms212.txt
With more info at
http://www.statsci.org/data/oz/ms212.html.
Is there evidence that age, height, or weight impact change in pulse rate for students who ran (Ran column = 1)? For each of these, how much variation in pulse rate do they explain?
<- read.table("http://www.statsci.org/data/oz/ms212.txt", header = T, stringsAsFactors = T)
pulse $change <- pulse$Pulse2 - pulse$Pulse1
pulse#need to make columns entered as numeral change to factor, although it doesn't
#really matter when only 2 groups (why?)
$Exercise <-as.factor(pulse$Exercise)
pulse$Gender <- as.factor(pulse$Gender)
pulse
#age
<- lm(change ~ Age, pulse[pulse$Ran == 1, ])
exercise par(mfrow =c (2,2))
plot(exercise)
require(car)
Anova(exercise, type = "III")
Anova Table (Type III tests)
Response: change
Sum Sq Df F value Pr(>F)
(Intercept) 3882.7 1 8.6317 0.005242 **
Age 222.7 1 0.4950 0.485395
Residuals 19792.3 44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(exercise)
Call:
lm(formula = change ~ Age, data = pulse[pulse$Ran == 1, ])
Residuals:
Min 1Q Median 3Q Max
-41.512 -12.183 2.591 12.893 44.868
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.3759 22.9328 2.938 0.00524 **
Age -0.7932 1.1274 -0.704 0.48539
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.21 on 44 degrees of freedom
Multiple R-squared: 0.01113, Adjusted R-squared: -0.01135
F-statistic: 0.495 on 1 and 44 DF, p-value: 0.4854
First we need to make a column that shows change in pulse rate. We also should change Exercise and gender to factors.
For age we note the model meets assumptions (no patterns in residuals and residuals follow a normal distribution). We also find no evidence that age impacts change (F1,44 = .4950, p = 0.49). We fail to reject our null hypothesis that there is no relationship between age and change in pulse rate. We also note that age only explains 1.1% of the variation in change in pulse rate (likely due to chance!).
#weight
<- lm(change ~ Weight, pulse[pulse$Ran == 1, ])
exercise par(mfrow =c (2,2))
plot(exercise)
Anova(exercise, type = "III")
Anova Table (Type III tests)
Response: change
Sum Sq Df F value Pr(>F)
(Intercept) 3588.9 1 7.9618 0.007143 **
Weight 181.5 1 0.4027 0.528990
Residuals 19833.4 44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(exercise)
Call:
lm(formula = change ~ Weight, data = pulse[pulse$Ran == 1, ])
Residuals:
Min 1Q Median 3Q Max
-43.173 -17.343 1.967 13.503 42.760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.1276 14.9300 2.822 0.00714 **
Weight 0.1381 0.2176 0.635 0.52899
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.23 on 44 degrees of freedom
Multiple R-squared: 0.009069, Adjusted R-squared: -0.01345
F-statistic: 0.4027 on 1 and 44 DF, p-value: 0.529
For weight we note the model meets assumptions. We also find no evidence that weight impacts change (F1,44 = .4027, p = 0.53). We fail to reject our null hypothesis that there is no relationship between weight and change in pulse rate. We also note that weight only explains 1% of the variation in change in pulse rate (likely due to chance!).
#height
<- lm(change ~ Height, pulse[pulse$Ran == 1, ])
exercise par(mfrow =c (2,2))
plot(exercise)
Anova(exercise, type = "III")
Anova Table (Type III tests)
Response: change
Sum Sq Df F value Pr(>F)
(Intercept) 243.9 1 0.5503 0.4621
Height 511.4 1 1.1536 0.2886
Residuals 19503.6 44
summary(exercise)
Call:
lm(formula = change ~ Height, data = pulse[pulse$Ran == 1, ])
Residuals:
Min 1Q Median 3Q Max
-42.798 -17.012 1.848 12.177 43.861
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.0688 28.4017 0.742 0.462
Height 0.1773 0.1650 1.074 0.289
Residual standard error: 21.05 on 44 degrees of freedom
Multiple R-squared: 0.02555, Adjusted R-squared: 0.003402
F-statistic: 1.154 on 1 and 44 DF, p-value: 0.2886
For height we note the model meets assumptions. We also find no evidence that weight impacts change (F1,44 = 1.15, p = 0.29). We fail to reject our null hypothesis that there is no relationship between height and change in pulse rate. We also note that age only explains 2.5% of the variation in change in pulse rate (likely due to chance!).
2
- (from OZDASL repository, http://www.statsci.org/data/general/stature.html; reference for more information)
When anthropologists analyze human skeletal remains, an important piece of information is living stature. Since skeletons are commonly based on statistical methods that utilize measurements on small bones. The following data was presented in a paper in the American Journal of Physical Anthropology to validate one such method. Data is available @
http://www.statsci.org/data/general/stature.txt
as a tab-delimted file (need to use read.table!) Is there evidence that metacarpal bone length is a good predictor of stature? If so, how much variation does it account for in the response variable?
<- read.table("http://www.statsci.org/data/general/stature.txt",
height header = T, stringsAsFactors = T)
head(height)
MetaCarp Stature
1 45 171
2 51 178
3 39 157
4 41 163
5 48 172
6 49 183
<- lm(Stature ~ MetaCarp, height)
metacarp_relationship plot(metacarp_relationship)
Anova(metacarp_relationship, type = "III")
Anova Table (Type III tests)
Response: Stature
Sum Sq Df F value Pr(>F)
(Intercept) 515.73 1 28.491 0.001078 **
MetaCarp 347.29 1 19.186 0.003234 **
Residuals 126.71 7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(metacarp_relationship)
Call:
lm(formula = Stature ~ MetaCarp, data = height)
Residuals:
Min 1Q Median 3Q Max
-4.0102 -3.1091 -1.1128 0.3891 7.4880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.428 17.691 5.338 0.00108 **
MetaCarp 1.700 0.388 4.380 0.00323 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.255 on 7 degrees of freedom
Multiple R-squared: 0.7327, Adjusted R-squared: 0.6945
F-statistic: 19.19 on 1 and 7 DF, p-value: 0.003234
To consider the relationship among these continuous variables, we used linear regression. Analysis of model assumptions suggest assumptions are met, although the dataset is small. Analysis suggests there is a significant positive relationship between metacarpal length and stature (F1,7 = 19.19, p = 0.003). The R2 value indicates that metacarpal length explains 73% of the variation in stature. Coefficients indicate that stature increases with increasing metacarpal length.
3
- Data on medals won by various countries in the 1992 and 1994 Olympics is available in a tab-delimited file at
http://www.statsci.org/data/oz/medals.txt
More information on the data can be found at:
http://www.statsci.org/data/oz/medals.html
Is there any relationship between a country’s population and the total number of medals they win?
<- read.table(header = T, "http://www.statsci.org/data/oz/medals.txt",
medals stringsAsFactors = T)
head(medals)
Country Summer Winter Population Latitude
1 UnifiedTeam 112 34 231.5 61
2 UnitesStates 108 13 260.7 38
3 Germany 82 24 81.1 51
4 China 54 3 1190.4 36
5 Cuba 31 0 11.1 22
6 Hungary 30 0 10.3 46
$total <- medals$Summer + medals$Winter
medals<- lm(total ~ Population, medals)
population_medals plot(population_medals)
summary(population_medals)
Call:
lm(formula = total ~ Population, data = medals)
Residuals:
Min 1Q Median 3Q Max
-36.470 -12.303 -9.525 4.379 118.141
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.01849 3.51123 3.423 0.00112 **
Population 0.06842 0.02117 3.233 0.00199 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.8 on 60 degrees of freedom
Multiple R-squared: 0.1483, Adjusted R-squared: 0.1341
F-statistic: 10.45 on 1 and 60 DF, p-value: 0.001994
Anova(population_medals, type = "III")
Anova Table (Type III tests)
Response: total
Sum Sq Df F value Pr(>F)
(Intercept) 7799 1 11.716 0.001122 **
Population 6957 1 10.450 0.001994 **
Residuals 39942 60
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cor.test(~total + Population, medals, method = "spearman")
Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
exact p-value with ties
Spearman's rank correlation rho
data: total and Population
S = 29456, p-value = 0.04271
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.2582412
There is a high leverage point in the dataset (row 4), but residuals appear to be fairly normally distributed and little structure exists in the graph of Residuals vs. Fitted Values. Analysis using linear regression suggests a significant ( F1,60 = 10.45, p = 0.002) positive relationship between population size and medal count that explains ~15% of the variation in the response variable. Rank- correlation analysis also indicated this relationship exists.
4
- Continuing with the Olympic data, is there a relationship between the latitude of a country and the number of medals won in summer or winter Olympics?
#still using medals
<- lm(Summer ~ Latitude, medals)
summer_medals plot(summer_medals)
Anova(summer_medals, type = "III")
Anova Table (Type III tests)
Response: Summer
Sum Sq Df F value Pr(>F)
(Intercept) 3.6 1 0.0075 0.93143
Latitude 2440.3 1 5.0389 0.02848 *
Residuals 29057.2 60
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(summer_medals)
Call:
lm(formula = Summer ~ Latitude, data = medals)
Residuals:
Min 1Q Median 3Q Max
-19.707 -10.856 -4.922 0.352 93.827
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5403 6.2531 0.086 0.9314
Latitude 0.3588 0.1598 2.245 0.0285 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.01 on 60 degrees of freedom
Multiple R-squared: 0.07747, Adjusted R-squared: 0.0621
F-statistic: 5.039 on 1 and 60 DF, p-value: 0.02848
<- lm(Winter ~ Latitude, medals)
winter_medals plot(winter_medals)
Anova(winter_medals, type = "III")
Anova Table (Type III tests)
Response: Winter
Sum Sq Df F value Pr(>F)
(Intercept) 90.07 1 2.2353 0.1401300
Latitude 502.29 1 12.4652 0.0008035 ***
Residuals 2417.71 60
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(winter_medals)
Call:
lm(formula = Winter ~ Latitude, data = medals)
Residuals:
Min 1Q Median 3Q Max
-6.906 -3.773 -1.383 1.395 26.768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.6967 1.8037 -1.495 0.140130
Latitude 0.1628 0.0461 3.531 0.000803 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.348 on 60 degrees of freedom
Multiple R-squared: 0.172, Adjusted R-squared: 0.1582
F-statistic: 12.47 on 1 and 60 DF, p-value: 0.0008035
Visual analysis of residuals from both models show some structure in the residual and deviations from normality, but we continue on with linear regression given the small sample size. Both summer and winter medal counts are positively (surpisingly) and significantly (both p <.05) related to latitude, with latitude explaining ~17% of the variation in winter medal count and ~8% of the data in summer medal count.
5
- Data on FEV (forced expiratory volume), a measure of lung function, can be found at
http://www.statsci.org/data/general/fev.txt
More information on the dataset is available at
http://www.statsci.org/data/general/fev.html.
Is there evidence that FEV depends on age or height? If so, how do these factors impact FEV, and how much variance does each explain?
<- read.table("http://www.statsci.org/data/general/fev.txt", header = T,
fev stringsAsFactors = T)
head(fev)
ID Age FEV Height Sex Smoker
1 301 9 1.708 57.0 Female Non
2 451 8 1.724 67.5 Female Non
3 501 7 1.720 54.5 Female Non
4 642 9 1.558 53.0 Male Non
5 901 9 1.895 57.0 Male Non
6 1701 8 2.336 61.0 Female Non
<- lm(FEV ~ Height, fev)
fev_height plot(fev_height)
Anova(fev_height, type = "III")
Anova Table (Type III tests)
Response: FEV
Sum Sq Df F value Pr(>F)
(Intercept) 166.25 1 896.33 < 2.2e-16 ***
Height 369.99 1 1994.73 < 2.2e-16 ***
Residuals 120.93 652
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fev_height)
Call:
lm(formula = FEV ~ Height, data = fev)
Residuals:
Min 1Q Median 3Q Max
-1.75167 -0.26619 -0.00401 0.24474 2.11936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.432679 0.181460 -29.94 <2e-16 ***
Height 0.131976 0.002955 44.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4307 on 652 degrees of freedom
Multiple R-squared: 0.7537, Adjusted R-squared: 0.7533
F-statistic: 1995 on 1 and 652 DF, p-value: < 2.2e-16
Model assumptions appear to be met. Height appears to have a positive relationship with FEV (F1,652 = 1995, p<.001).
<- lm(FEV ~ Age, fev)
fev_age plot(fev_age)
Anova(fev_age, type = "III")
Anova Table (Type III tests)
Response: FEV
Sum Sq Df F value Pr(>F)
(Intercept) 9.89 1 30.707 4.359e-08 ***
Age 280.92 1 872.184 < 2.2e-16 ***
Residuals 210.00 652
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fev_age)
Call:
lm(formula = FEV ~ Age, data = fev)
Residuals:
Min 1Q Median 3Q Max
-1.57539 -0.34567 -0.04989 0.32124 2.12786
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.431648 0.077895 5.541 4.36e-08 ***
Age 0.222041 0.007518 29.533 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5675 on 652 degrees of freedom
Multiple R-squared: 0.5722, Adjusted R-squared: 0.5716
F-statistic: 872.2 on 1 and 652 DF, p-value: < 2.2e-16
Model assumptions appear to be met. Age appears to have a positive relationship with FEV (F1,652 = 872.2, p<.001).
6
- Continuing with the FEV data, produce plots that illustrate how height, age, and gender each impact FEV.
library(ggplot2)
#age plot####
ggplot(fev, aes(x=Age, y=FEV)) +
geom_point(size = 3) +
geom_smooth(method = "lm") +
ylab("FEV (L)")+ggtitle("FEV increases with age")+
theme(axis.title.x = element_text(face="bold", size=28),
axis.title.y = element_text(face="bold", size=28),
axis.text.y = element_text(size=20),
axis.text.x = element_text(size=20),
legend.text =element_text(size=20),
legend.title = element_text(size=20, face="bold"),
plot.title = element_text(hjust = 0.5, face="bold", size=32))
`geom_smooth()` using formula = 'y ~ x'
#height plot####
ggplot(fev, aes(x=Height, y=FEV)) +
geom_point(size = 3) +
geom_smooth(method = "lm") +
ylab("FEV (L)")+ggtitle("FEV increases with height")+
theme(axis.title.x = element_text(face="bold", size=28),
axis.title.y = element_text(face="bold", size=28),
axis.text.y = element_text(size=20),
axis.text.x = element_text(size=20),
legend.text =element_text(size=20),
legend.title = element_text(size=20, face="bold"),
plot.title = element_text(hjust = 0.5, face="bold", size=32))
`geom_smooth()` using formula = 'y ~ x'
#gender plot ####
#bar chart with error bars ####
library(Rmisc)
Loading required package: lattice
Loading required package: plyr
<- summarySE(fev, measurevar="FEV", groupvars =
function_output c("Sex"))
ggplot(function_output, aes(x=Sex, y=FEV)) +
geom_col(size = 3) +
ylab("FEV (L)") +
ggtitle("FEV is higher in males ")+
geom_errorbar(aes(ymin=FEV-ci, ymax=FEV+ci), size=1.5) +
theme(axis.title.x = element_text(face="bold", size=28),
axis.title.y = element_text(face="bold", size=28),
axis.text.y = element_text(size=20),
axis.text.x = element_text(size=20),
legend.text =element_text(size=20),
legend.title = element_text(size=20, face="bold"),
plot.title = element_text(hjust = 0.5, face="bold", size=32))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
7
Find an example of a regression or correlation from a paper that is related to your research or a field of interest. Make sure you understand the connections between the methods, results, and graphs. Briefly answer the following questions
- What was the dependent variable?
- What were the independent variables?
- Was there a relationship between them? If so, describe it (positive, negative).