Relationships among numerical variables

Remember you should

Overview

This practice reviews the Relationships among numerical variabless lecuture.

Example

Following the iris dataset from class, let’s investigate the relationships between sepal and petal length. First, we can plot the data.

library(ggplot2)
ggplot(iris, aes(x=Petal.Length, y=Sepal.Length)) +
  geom_point(size = 3) +
  ylab("Sepal Length")+ggtitle("Sepal length increases \n with petal length")+
  theme(axis.title.x = element_text(face="bold", size=28), 
        axis.title.y = element_text(face="bold", size=28), 
        axis.text.y  = element_text(size=20),
        axis.text.x  = element_text(size=20), 
        legend.text =element_text(size=20),
        legend.title = element_text(size=20, face="bold"),
        plot.title = element_text(hjust = 0.5, face="bold", size=32))+
  xlab("Petal length (cm)") +
  ylab("Sepal length (cm)")

A linear relationship appears logical here. We can build the model and check assumptions.

iris_regression <- lm(Sepal.Length ~ Petal.Length, iris)
plot(iris_regression)

Visual checks of assumptions appear to be met, so we can determine if the slope differs from 0.

library(car)
Warning: package 'car' was built under R version 4.4.1
Loading required package: carData
Anova(iris_regression, type = "III")
Anova Table (Type III tests)

Response: Sepal.Length
             Sum Sq  Df F value    Pr(>F)    
(Intercept)  500.16   1 3018.28 < 2.2e-16 ***
Petal.Length  77.64   1  468.55 < 2.2e-16 ***
Residuals     24.53 148                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(iris_regression)

Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24675 -0.29657 -0.01515  0.27676  1.00269 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.30660    0.07839   54.94   <2e-16 ***
Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

We find that F1,148=468.5, p < 0.01, so we reject the null hypothesis that the slope is equal to 0. The estimate indicates a slope of 0.408, so sepal length increases with petal length. An R2 value of 0.76 indicates petal length explains about 75% of the variation in sepal length.

Alternatively, we could consider if the association between the two variables is equal to 0 (or not).

cor.test(~ Sepal.Length + Petal.Length, data = iris)

    Pearson's product-moment correlation

data:  Sepal.Length and Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538 

We find the same p-value (using a t-distribution), and see that the estimated linear correlation coefficient, .87, is the square of our R2 value.

If we prefer a rank-based test, we can update the code:

cor.test(~ Sepal.Length + Petal.Length, data = iris,
         method="spearman")
Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
exact p-value with ties

    Spearman's rank correlation rho

data:  Sepal.Length and Petal.Length
S = 66429, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8818981 

Which again leads us to reject the null hypothesis. Finally, a bootstrap options can be produced

bootstrap_iris <- Boot(iris_regression)
Loading required namespace: boot
Confint(bootstrap_iris)
Bootstrap bca confidence intervals

              Estimate    2.5 %    97.5 %
(Intercept)  4.3066034 4.156400 4.4474298
Petal.Length 0.4089223 0.371636 0.4496046

where estimates indicated the slope does not contain 0.

Swirl lesson

Swirl is an R package that provides guided lessons to help you learn and review material. These lessons should serve as a bridge between all the code provided in the slides and background reading and the key functions and concepts from each lesson. A full course lesson (all lessons combined) can also be downloaded using the following instructions.

THIS IS ONE OF THE FEW TIMES I RECOMMEND WORKING DIRECTLY IN THE CONSOLE! THERE IS NO NEED TO DEVELOP A SCRIPT FOR THESE INTERACTIVE SESSIONS, THOUGH YOU CAN!

  • install the “swirl” package

  • run the following code once on the computer to install a new course

    library(swirl)
    install_course_github("jsgosnell", "JSG_swirl_lessons")
  • start swirl!

    swirl()
    • swirl()
  • then follow the on-screen prompts to select the JSG_swirl_lessons course and the lessons you want

    • Here we will focus on the Relationships among numerical variables lesson
  • TIP: If you are seeing duplicate courses (or odd versions of each), you can clear all courses and then re-download the courses by

    • exiting swirl using escape key or bye() function

      bye()
    • uninstalling and reinstalling courses

      uninstall_all_courses()
      install_course_github("jsgosnell", "JSG_swirl_lessons")
    • when you restart swirl with swirl(), you may need to select

      • No. Let me start something new

Practice

1

A professor carried out a long-term study to see how various factors impacted pulse rate before and after exercise. Data can be found at

http://www.statsci.org/data/oz/ms212.txt

With more info at

http://www.statsci.org/data/oz/ms212.html.

Is there evidence that age, height, or weight impact change in pulse rate for students who ran (Ran column = 1)? For each of these, how much variation in pulse rate do they explain?

2

(from OZDASL repository, http://www.statsci.org/data/general/stature.html; reference for more information)

When anthropologists analyze human skeletal remains, an important piece of information is living stature. Since skeletons are commonly based on statistical methods that utilize measurements on small bones. The following data was presented in a paper in the American Journal of Physical Anthropology to validate one such method. Data is available @

http://www.statsci.org/data/general/stature.txt

as a tab-delimted file (need to use read.table!) Is there evidence that metacarpal bone length is a good predictor of stature? If so, how much variation does it account for in the response variable?

3

Data on medals won by various countries in the 1992 and 1994 Olympics is available in a tab-delimited file at

http://www.statsci.org/data/oz/medals.txt

More information on the data can be found at:

http://www.statsci.org/data/oz/medals.html

Is there any relationship between a country’s population and the total number of medals they win?

4

Continuing with the Olympic data, is there a relationship between the latitude of a country and the number of medals won in summer or winter Olympics?

5

Data on FEV (forced expiratory volume), a measure of lung function, can be found at

http://www.statsci.org/data/general/fev.txt

More information on the dataset is available at

http://www.statsci.org/data/general/fev.html.

Is there evidence that FEV depends on age or height? If so, how do these factors impact FEV, and how much variance does each explain?

6

Continuing with the FEV data, produce plots that illustrate how height, age, and gender each impact FEV.

7

Find an example of a regression or correlation from a paper that is related to your research or a field of interest. Make sure you understand the connections between the methods, results, and graphs. Briefly answer the following questions

  • What was the dependent variable?
  • What were the independent variables?
  • Was there a relationship between them? If so, describe it (positive, negative).