Linear model extensions

When to trust it, and how to fix it when it’s broken

The past several chapters/lessons have focused on linear models. Here we explore options for analysis when linear model assumptions are not met. In doing so, we review several different approaches, many of which could be (and are) the focal topic of courses and books. The goal here is to

Citations to relevant articles papers and books are noted throughout for further exploration.

Sticking with the linear model

Linear models are useful for a number of reasons. They are a great way to unify most/many tests from classical statistics. In fact, most of the ranked tests we’ve developed can actually be run as linear models when n >15. For example, we can go back to our Wilcox-Mann Whitney U tests (for 2 populations) and Kruskal-Wallis (for 3+) from the comparing means among groups chapter and note the outcome from a wilcox.test

two_species_subset <- iris[iris$Species!="setosa",]
wilcox.test(Sepal.Length ~ Species, two_species_subset)

    Wilcoxon rank sum test with continuity correction

data:  Sepal.Length by Species
W = 526, p-value = 5.869e-07
alternative hypothesis: true location shift is not equal to 0

is very close to a linear model predicting the signed rank of the data

library(car)
signed_rank = function(x) sign(x) * rank(abs(x))
Anova(lm(signed_rank(Sepal.Length) ~ Species, two_species_subset), type="III")
Anova Table (Type III tests)

Response: signed_rank(Sepal.Length)
            Sum Sq Df F value    Pr(>F)    
(Intercept)  64872  1 102.378 < 2.2e-16 ***
Species      20967  1  33.089 1.003e-07 ***
Residuals    62098 98                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In fact, we could run simulations and show that p values from these 2 approaches are highly correlated (now you know what that means) with a \(\beta\) of almost 1 (from Lindelov (n.d.)).

Linear models are also extremely robust. Consider the basic assumptions of a linear model

\[ \epsilon \approx i.i.d.\ N(\mu,\sigma) \]

Although the residuals are meant to be homoscedastic (equal or constant across all groups), it turns out the model is robust of when the largest group variance is 4-10x larger than the smallest group variance and sample sizes are approximately equal (Blanca et al. 2018; Fox 2015; A. F. Zuur, Ieno, and Elphick 2010), though highly uneven group sizes begin to cause issues (Blanca et al. 2018).

Similarly, non-normal data is not an issue. This is partially because the assumptions are focused on residuals, but also because the procedure is highly robust (Blanca et al. 2017). This finding further supports the graphical consideration of assumptions , especially since many tests for normality are conservative (samples are almost never perfectly normal, and slight deviations are easier to pick up with larger sample sizes despite the fact the central limit theorem suggests this is when the issue is least important) (A. F. Zuur, Ieno, and Elphick 2010; Shatz 2023).

These issues have led some authors to argue violations of the linear model assumptions are less dangerous than trying new, and often less-tested, techniques that may inflate type I error rates (Knief and Forstmeier 2021; D. I. Warton et al. 2016). However, new techniques (or sometimes old techniques that are new to a given field) may be more appropriate when assumptions are clearly broken (D. Warton and Hui 2010; Reitan and Nielsen 2016; Geissinger et al. 2022). In this section we explore common-ish approaches for analyzing data when the assumptions of a linear model are broken. Our goal here is to introduce these methods. Each could be the focus of their own class, book, or much larger study. Fortunately most can be viewed as extensions to our existing knowledge, although many of the assumptions and techniques for them are less developed/still being argued about.

The various assumptions related to linear models may be prioritized on their relative importance. One such order is provided (roughly) by Gelman and Hill (2006).

  • Model validity

    • As noted in the multiple regression chapter, we only should investigate relationships we have a mechanism to explain
  • Linear relationship and additive effects of predictor variables

  • Errors are

    • independently distributed

    • identical (homoscedastic)

    • follow a normal distribution

Many datasets will violate multiple of these assumptions simultaneously, so addressing issues is often best resolved by understanding why this is happening.

Linear relationship is inappropriate

A central (and often overlooked assumption) of linear models is that the relationship between the predictors and the outcome variable is linear and additive. When the relationship is not linear, the resulting residuals are often not appropriately distributed as well.

What’s linear anyway?

To be clear, the linear model only focuses on the linear and additive relationship between the predictors and the outcome variable (this will become more important/obvious later in this section!). The model doesn’t know what the variables are. For that reason, we can add predictor variables to a model that are squares or cubes of predictors, or we can transform the outcome variable. We just need the \(Y = X\beta\) relationship to be additive and linear.

When non-linearity occurs, several options to address it exist. The best approach may depend on why the relationship is non-linear, as relationships among variables may be non-linear for a number of reasons. For example, the outcome may not actually be continuous (e.g. counts. proportions), and thus true linear relationships are not possible; outcomes may also be related to the predictors in different ways (e.g., logarithmic).

Transform the data (least advisable, sometimes)

One way to address non-linearity is to transform the data (typically focusing on the dependent variable) so that the resulting variable meets the linear model assumptions (and thus uses the strengths of linear models that we noted above). As shown above, our rank-based approaches are using a similar method (not technically the same, but it works for larger sample sizes). This approach was often used in the past (e.g., arc-sin transforms of proportion data D. Warton and Hui (2010)) and supported by various approaches. For example, the Box-Cox transformation helped researchers find the best way to transform data to reduce issues with the distribution of residuals; this method also tended to impact linearity and differences in variances.

Two things should be noted regarding this approach and transformations in general:

  • The Box-Cox approach requires a model - it still focused on transforming data so that residuals meet assumptions. Data should not be transformed before model assumptions are analyzed to ensure it is necessary. For example, highly skewed data may arise due to unequal sample sizes (which may pose their own problems, but not outright), but models using this data may meet assumptions.
set.seed(8)
example_skewed <- data.frame(Population= c(rep("a",40),
                                           rep("b", 30),
                                           rep(letters[3:4], each=15)), 
                             Growth = c(rnorm(40,1),
                                    rnorm(30, 4),
                                    rnorm(15, 7),
                                    rnorm(15,10)))
library(ggplot2)
ggplot(example_skewed,aes(x=Growth))+
  geom_histogram()+
  labs(y="Some value",
       title="Example of skewed data appearing due to unequal sample size")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(lm(Growth~Population, example_skewed))

Anova(lm(Growth~Population, example_skewed), type="III")
Anova Table (Type III tests)

Response: Growth
             Sum Sq Df F value    Pr(>F)    
(Intercept)   38.75  1  32.436 1.344e-07 ***
Population  1007.71  3 281.175 < 2.2e-16 ***
Residuals    114.69 96                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Similarly (and already noted!). non-normal data may lead to models with normal residuals. However, normal data (or data transformed to be more normal) does typically lead to normal residuals, so if residuals are not normal transformations may help.

  • The transformed variable is now linear in respect to the predictors. This highlights the actual assumption of the model. Similarly, higher-terms (squares, cubes, etc) may be added to a linear model. The model does not care what the data represent - it only focuses on if linear relationships exist among them.

  • Transformations can make model interpretation and prediction difficult.

If the decision is made to transform the data, several approaches exist. Some are driven by the distribution of the data, and all depend on it. For example, log and related root transformations are useful for right-skewed data, but some can only be carried out for non-negative (e.g., square root) or strictly positive (e.g., log) values. To address this for log transformations, a small value is often added to 0 measurements.

Let’s use data (not residuals) to show what different types of data look like and consider possible fixes (always fit a model first for real analysis!). For example, we can return to our right-skewed blue jay data from the summarizing data chapter(target=“_blank”)(idea from (Hunt, n.d.)) .

set.seed(19)
blue_jays <- round(rbeta(10000,2,8)*100+runif(10000,60,80),3)
ggplot(data.frame(blue_jays), 
       aes(x=blue_jays)) +
  geom_histogram( fill="blue", color="black") +
  labs(title="Weight of Westchester blue jays",
       x= "Weight (g)",
       y= "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note the right-skewed data shows a convex curve on the Q-Q plot.

qqnorm(blue_jays)
qqline(blue_jays)

To help you understand Q-Q plots, remember they are comparing the relative spread of data (quantiles) from a target and theoretical distribution. For right-skewed data, points are shifted right (both smallest and largest observations are larger than you would expect from a normal distribution).

Conversely, we could also return to our left-skewed cardinal data

set.seed(19)
cardinals <- round(rbeta(10000,10,2)*50+runif(10000,5,10),3)
ggplot(data.frame(cardinals), 
       aes(x=cardinals)) +
  geom_histogram( fill="red", color="black") +
  labs(title="Weight of Westchester cardinals",
       x= "Weight (g)",
       y= "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

and note a concave shape is seen in the Q-Q plot,as all points are shifted left.

qqnorm(cardinals)
qqline(cardinals)

For this type of data, power transformations (raising the variable to the 2, 3, or higher power) may be useful.

Meanwhile, our uniformly-distributed distributed robin data

set.seed(19)
rochester <- round(c(runif(1000,75,85)),3)
ggplot(data.frame(rochester), 
       aes(x=rochester)) +
  geom_histogram( fill="pink", color="black") +
  labs(title="Weight of Rochester robins",
       x= "Weight (g)",
       y= "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shows as a s-shape on the Q-Q plot

qqnorm(rochester)
qqline(rochester)

because it is under-dispersed (has no tails, and thus the lower points are larger than expected and the higher points are smaller than expected). Alternatively, data may be over-dispersed, like this (fake) finch data where lower points are lower than expected and higher points are higher.

set.seed(19)
library(VGAM)
finch <- rlaplace(1000,location=50, scale=4)
ggplot(data.frame(finch), 
       aes(x=finch)) +
  geom_histogram( fill="cyan", color="black") +
  labs(title="Weight of finches",
       x= "Weight (g)",
       y= "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(finch)
qqline(finch)

Over- and under-dispersed data may mean there’s a missing factor in your analysis. For example, our bi-modal woodpecker data

set.seed(19)
woodpeckers <- round(c(rnorm(100,60,4),rnorm(100,80,4)),3)
ggplot(data.frame(woodpeckers), 
       aes(x=woodpeckers)) +
  geom_histogram( fill="orange", color="black") +
  labs(title="Weight of  Westchester woodpeckers",
       x= "Weight (g)",
       y= "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

is under-dispersed due the shape of the distribution, but there also signs of other issues.

qqnorm(woodpeckers)
qqline(woodpeckers)

Under-dispersion could also happen if data are bounded (e.g., by practicality or due to a measurement issue). Over-dispersion can similarly occur if the model does not account for variability (e.g., missing factors, non-linear relationship) and/or outliers, which might be related to the underlying form of the data (Payne et al. 2017) (more to come on this). In this way, over- and under-dispersion relate to both the linear relationship.

Different data require a different approach: Introducing generalized linear models

The linear relationship may be inappropriate because our data doesn’t fit it! For example, if we are modeling proportions, estimates less than 0 or below 1 do not make sense, but a linear model doesn’t account for that. The same issues arise for binary outcomes (outcome must be 0 or 1) and count data (where outcomes can’t be non-integer). While we may be able to transform the response variable to make relationships linear, another option is to translate the link function.

Although we haven’t fully explained it yet, a linear model contains three components. There is always a random component that focuses on the distribution of the data. There is also a systematic component, where a number of covariates and data points produce an estimate. Finally, there is the link function, which connects that estimate to the data distribution (this maintains the linearity of the model).

Notice this means the link connects to the distribution of the data, not the data itself (which is what transforming the data is doing). So far we have focused on the mean of the data, and the estimate is the mean, so the link has been implied. We can generalize this setup to include other distributions in the exponential family (and now others (“Applied Regression Analysis and Generalized Linear Models” 2022, ch. 15)).

While we won’t develop all the math, exponential family distributions all have a canonical parameter (the mean for Gaussian, or normal, data) and a dispersion parameter (the variance for normal data). Different distributions have different relationships between these two parameters. For normal data, there is no relationship, so the variance is constant. This is not true for other families; for example, the variance in binomial data depends on the p parameter (as we noted in the [chapter introducing binomial data](Binomial.qmd){target=“_blank”))!

We can specify other families using generalized linear models (which are different than general linear models, which is another term used to describe our “regular” linear models). These models are also known as glm or glim (we’ll use the glm jargon here). GLM make different assumptions (though everyone does not agree on what they are!). While we still need independence of the residuals (or some extension, see below), they no longer need to be normally distributed (A. Zuur, Ieno, and Smith 2007, 86–87; A. Zuur et al. 2009, sec. 9.8.3) (though normality may hold at large sample sizes for Pearson (Agresti 2007, 87) or deviance (Montgomery, Peck, and Vining 2012, sec. 13.4.4) residuals, and some authors argue normality is required for testing (Feng, Li, and Sadeghpour 2020)) . As noted above, non-Gaussian families don’t assume a constant variance, so homogeneity assumptions would not be appropriate (Fox 2016). However, we need to ensure the correct family (and thus mean-variance link) is chosen. In general, plotting the residuals against (A. Zuur et al. 2009)

  • the predicted values

    • to check for model fit
  • each explanatory variable in model

    • to check for linearity
  • each explanatory not retained/used in model

    • to check for missing patterns
  • against time and space (if applicable)

    • to check for patterns and correlation

will be useful. Any patterns may suggest missing predictors or lack of independence/correlation among measurements. Patterns in the residuals may also indicate the incorrect family has been chosen (A. Zuur et al. 2009, sec. 9.8.4); for families with fixed dispersion parameters, inspection of the estimated dispersion parameter can also be a related diagnostic.

What are residuals for a GLM?

Residuals for a Gaussian/normal glm (what we’ve been doing) are easy to calculate. We simply subtract the estimate for each data point (which is the average for similar observations!) from the observed. However, though these raw or response residuals exist for GLM, they rarely make sense. Variance may increase with mean for some distributions, for example, or outcomes may be binary (so every residual is 0 or 1!).

To account for this, we could divide the residuals by the standard deviation of the observations. This normalizes the residuals and leads to Pearson residuals. Another option is based on something called deviance. Deviance is similar to sums of squares but based on likelihood (which we shortly see is what we use to test for significance in GLM); it can also be considered the generalized form of variance or residual sums of squares. It is calculated as the difference of log-likelihoods between the focal model and the saturated model (or a model that has a coefficient for every observation and thus will have perfect fit). Deviance residuals consider how important each point is to the overall deviance. You can also compare the deviance of a model to a perfect fit (no deviance/saturated) model and an intercept-only model (worst fit, DO) and generate a pseudo-R2 measure using the formula

\[ 1-\frac{D}{D_O} \]

This is known as McFadden’s pseudo-R2 value.

Once developed, GLMs can use most of the tools we developed for basic linear models. However, a few key differences exist. Estimation of parameters and significance for these models rely on maximum-likelihood methods (remember, the ratio of likelihood values follow a \(\chi^2\) distribution and can thus provide a p-value to compare possible models as noted in introducing G tests{target=“_blank”)).This may be represented as a Z test for single parameters (since a \(\chi^2\) distribution is a Z distribution squared - sometimes labelled a Wald test/statistic). F test reappear for distributions where we estimate the dispersion parameter (like the normal), but otherwise we can use the dispersion parameter to check that the right approach was taken.

It should be noted that solving for coefficients also requires iterative approaches due to non-linearity; common methods (not described here) include the Newton-Raphson or Fisher scoring methods. The key point here for practice is any noted issues with optimization are due to this process. Finally, it should be noted that coefficients and predictions are related to the estimate, which is no longer the actual mean. This means impacts and outcomes will need to be transformed (based on the link) to the actual scale.

There are numerous types of GLMs. Here we outline some of the more common ones. Each of these deserves a much fuller treatment - remember, the intent here is to get you to a place to where you can begin working with them. Harrison et al. (2018) (which also covers mixed models (coming up!)) provides an excellent review of these models, and fuller treatments are provided by A. Zuur et al. (2009) (chapters 8-11) and Fox (2016) (chapters 14 - 15). For each I provide the most relevant math, an example (or two), and some immediate extensions.

Logistic regression: Bernoulli and binomial outcomes

Logistic regression (D. Warton and Hui 2010) focuses on modelling yes/no or success/failure outcomes (Bernoulli or binomial data) using the logit link (thus logistic regression). The easy-to-understand issue here is that we want to focus on probability (\(\pi\)), but that can only vary from 0 to 1. Linear models are thus inappropriate, but the logit link can be used where the systematic component of the model predicts

\[ logit(\pi) = \ln(\frac{\pi}{1-\pi}), \text{where } \pi \text{ is the probability for a given set of outcomes} \]

This logit link is useful as it effectively connects the systematic component and real data (the purpose of the link). This link also allows coefficients to be interpreted as odds and odds ratios (introduced in the comparing proportions among groups- we’ll continue to see connections this chapter for logistic regression). Remember, the odds can be written as

\[ \frac{\pi}{1-\pi} \]

For a given coefficient \(\beta\), the logit link also means we can say a 1 unit increase in that single variable leads to a change in the odds ratio of \(e^{\beta}\). This also means no impact is shown as \(\beta = 0\), with positive and negative numbers implying increases or decreases in odds, which is convenient. Other link options (not discussed here) include the probit and complementary log-log approaches. For a final piece of math, note the dispersion parameter for binomial data is not estimated - it is assumed to be 1.

For example, Blais, Johnson, and Koprowski (2023) investigated how various factors impacted behavior in garter snakes (Thamnophis cyrtopsis).

Thamnophis cyrtopsis (Family: Natricidae). Common name: blackneck garter snake.

Juan Carlos Fonseca Mata, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

One of their analyses considered how snake movement (labelled 0 for no movement, 1 for movement) differed among age classes (adults and juveniles).

behavior <- read.csv("data/database_Thamnophis_crytopsis_study - Copy of THCY_MH_BEH.csv",
                     stringsAsFactors = T)
behavior_table <- table(behavior$movement, behavior$ageclass)
behavior_table
   
    adult juvenile
  0    29       29
  1    14       16

You may recognize this analyses could be carried out using \(\chi^2\) tests.

chisq.test(behavior_table)

    Pearson's Chi-squared test with Yates' continuity correction

data:  behavior_table
X-squared = 0.0051228, df = 1, p-value = 0.9429

which indicate no impact of age class on movement (p=.9429). You (hopefully) also remember we mentioned the Yate’s correction (noted in the output) was due to using a continuous distribution to fit counts. If we remove it,

chisq.test(behavior_table, correct = F)

    Pearson's Chi-squared test

data:  behavior_table
X-squared = 0.087924, df = 1, p-value = 0.7668

we find an answer that we can replicate using glm (since the outcome is a 0 or 1 here!). We do this using the glm function in base R, which is very similar to the lm function; the only difference is we now note the distribution we are exploring using the “family” argument. Note Bernoulli data is considered a subset of binomial data and thus classified as “binomial”.

as_glm <- glm(movement~ageclass, behavior, family = "binomial")

We can then use our normal approach (including summary, Anova, and glht functions) to consider the outcomes. Checking assumptions is a little trickier than for linear models (as noted above). One major check is typically ensuring the dispersion parameter is correct. We can typically check the parameter by dividing the residual deviance by its associated degrees of freedom; this information is provided by the summary function (here, 112.84/86);

summary(as_glm)

Call:
glm(formula = movement ~ ageclass, family = "binomial", data = behavior)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)  
(Intercept)       -0.7282     0.3254  -2.238   0.0252 *
ageclassjuvenile   0.1335     0.4504   0.296   0.7669  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 112.93  on 87  degrees of freedom
Residual deviance: 112.84  on 86  degrees of freedom
  (15 observations deleted due to missingness)
AIC: 116.84

Number of Fisher Scoring iterations: 4

For Bernoulli data like this, however, over-dispersion is not an issue (Molenberghs et al. 2012). For binomial and Poisson data, this should be approximately one. We can note the impact of given factors using Anova

Anova(as_glm, type="III")
Analysis of Deviance Table (Type III tests)

Response: movement
         LR Chisq Df Pr(>Chisq)
ageclass 0.087974  1     0.7668

which now uses likelihood-based tests by default. Similarly, the summary output notes details related to fitting the model (Number of Fisher scoring iterations). To interpret the coefficient from the glm, we need to consider the link function (here, the logit). If we backtransform the logit link function, we can actually say that juveniles had a \(e^\beta\) greater odds of movement than adults, which is equal to

exp(.1335)
[1] 1.142821

and very close to 1 (thus the high p-value given our small sample size).

While this example simply replicates the \(\chi^2\) tests we learned earlier (deviations may arise when counts are low or high for any group, and thus \(\hat{p}\) approaches extreme values and breaks the assumptions!), the power of these models comes from the ability to consider multiple variables (which we couldn’t do before). These models are sometimes called log-linear models.

To consider this approach and extend glm to binomial data, consider work by Bueno, Machado, and Leite (2020). The authors carried out a series of experiments to determine how various factors impacted colonization of a host algae, Sargassum filipendula, by a small amphipod, Cymadusa filosa.

Rob Young, Centre for Biodiversity Genomics - CC - BY - NC - S

For one experiment, they stocked portions of aquariums with pieces of algal. Half the algae pieces had adult amphipods present(4 total; the species builds tubes to live on algae) while the other half had no adult residents. Juveniles were initially placed on patches of natural substrate, artificial substrate, or bare areas(no substrate) in the aquarium. After 24 hours, the number of juveniles that colonized the focal algae were counted. Since the number of introduced juveniles was known, the number that did not colonize was also accounted for (raw data estimated from supplemental data given variation in sample size).

juvenile <- read.csv ("data/Supplemental_Data_Bueno_2020-S1-Presence of adults_lab.csv", 
                      stringsAsFactors = T)

since the adults column indicates the presence or absence of adults, let’s update it for clarity

library(plyr)
juvenile$adults_updated <- factor(revalue(as.character(juvenile$adults), c("0"= "absent", "4" ="present")))

This is an example of a factorial-design with the following hypotheses:

  • null hypothesis
    • HO: There is no impact on the habitat type juveniles start in on their likelihood to move
    • HO: There is no impact on adult presence in the new habitat on the likelihood of juveniles to move
    • HO: There is no interaction between the impacts of the habitat type juveniles start in and adult presence in the new habitat on the likelihood of juveniles to move
  • alternative hypothesis
    • HA: There is an impact on the habitat type juveniles start in on their likelihood to move
    • HA: There is an impact on adult presence in the new habitat on the likelihood of juveniles to move
    • HA: There is an interaction between the impacts of the habitat type juveniles start in and adult presence in the new habitat on the likelihood of juveniles to move

The outcome, however, is a proportion. Note we can fit a linear model to the data:

juvenile_fit_lm <- lm(Colonized ~ adults_updated*source.habitat, 
    juvenile)
plot(juvenile_fit_lm)

The assumptions don’t even look that bad, but note the data is over-dispersed (will come back to this). We also know the model will make predictions that aren’t logical (outcomes outside of the 0-1 range). For this reason we will use a generalized linear model (or a log linear model) to analyze the data. We can fit a glm using the binomial family.

juvenile_fit_glm <- glm(cbind(Colonized, Not_colonized) ~ adults*source.habitat, 
    juvenile, family = "binomial")

The summary again allows us to estimate the dispersion parameter by dividing the residual deviance by its associated degrees of freedom, and we can now consider the impact of multiple variables (including interactions, which are significant here!).

summary(juvenile_fit_glm)

Call:
glm(formula = cbind(Colonized, Not_colonized) ~ adults * source.habitat, 
    family = "binomial", data = juvenile)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   1.84583    0.31063   5.942 2.81e-09 ***
adults                       -0.28417    0.09507  -2.989 0.002799 ** 
source.habitatnatural        -3.88271    0.43669  -8.891  < 2e-16 ***
source.habitatnone            1.02207    0.55484   1.842 0.065459 .  
adults:source.habitatnatural  0.46383    0.13819   3.356 0.000789 ***
adults:source.habitatnone    -0.38722    0.15665  -2.472 0.013441 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 347.435  on 35  degrees of freedom
Residual deviance:  97.393  on 30  degrees of freedom
AIC: 186.64

Number of Fisher Scoring iterations: 5

We can also use our Anova function to investigate outcomes.

Anova(juvenile_fit_glm, type="III")
Analysis of Deviance Table (Type III tests)

Response: cbind(Colonized, Not_colonized)
                      LR Chisq Df Pr(>Chisq)    
adults                   9.712  1   0.001831 ** 
source.habitat         195.748  2  < 2.2e-16 ***
adults:source.habitat   34.738  2  2.862e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that there is a significant interaction between the presence of adults and source habitat on the proportion of amphipods that move to colonize algae. Given this, we can follow our typical approach to significant interactions - break the data into subsets.

However, we should first note that our dispersion estimate is 97.393/30 ~ 3; this should be closer to 1! What should we do? One option is to use an approach that estimates the dispersion parameter. Though not fully developed here, this approach uses a “quasi-binomial” family.

juvenile_fit_glm_quasi <- glm(cbind(Colonized, Not_colonized) ~ adults*source.habitat, 
    juvenile, family = "quasibinomial")

We can investigate this model using our established protocols

summary(juvenile_fit_glm_quasi)

Call:
glm(formula = cbind(Colonized, Not_colonized) ~ adults * source.habitat, 
    family = "quasibinomial", data = juvenile)

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    1.8458     0.5562   3.319  0.00238 ** 
adults                        -0.2842     0.1702  -1.669  0.10544    
source.habitatnatural         -3.8827     0.7819  -4.966 2.56e-05 ***
source.habitatnone             1.0221     0.9934   1.029  0.31177    
adults:source.habitatnatural   0.4638     0.2474   1.875  0.07060 .  
adults:source.habitatnone     -0.3872     0.2805  -1.381  0.17760    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 3.205612)

    Null deviance: 347.435  on 35  degrees of freedom
Residual deviance:  97.393  on 30  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
Anova(juvenile_fit_glm_quasi, type="III")
Analysis of Deviance Table (Type III tests)

Response: cbind(Colonized, Not_colonized)
                      LR Chisq Df Pr(>Chisq)    
adults                   3.030  1   0.081751 .  
source.habitat          61.064  2  5.496e-14 ***
adults:source.habitat   10.837  2   0.004435 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since we estimated our dispersion parameter, we can also use F tests here

Anova(juvenile_fit_glm_quasi, type="III", test = "F")
Analysis of Deviance Table (Type III tests)

Response: cbind(Colonized, Not_colonized)
Error estimate based on Pearson residuals 

                       Sum Sq Df F values    Pr(>F)    
adults                  9.712  1   3.0298  0.091998 .  
source.habitat        195.748  2  30.5326 5.842e-08 ***
adults:source.habitat  34.738  2   5.4184  0.009796 ** 
Residuals              96.167 30                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another option for proportion data when the outcome is bounded (does not include) between 0 and 1 is beta regression.

Beta regression

Beta regression (Geissinger et al. 2022) is similar to logistic regression, but it focuses on true proportion data where the outcome is between 0 and 1 (e.g. % cover, % nitrogen).

Poisson regression

Poisson regression focuses on count-based data.

Negative binomial regression

Similar to poisson, but an additional parameter allows the mean to not be equal to the variance.

Non-linear options

Data are not independent

In respect to predictors

A major issue for linear models is when predictors are co-linear. Mathematically speaking, perfect collinearity occurs when any column of the design (X) matrix can be derived by combining other columns. Perfect collinearity will lead to a message noting singularity issues, which is R’s way of telling you the matrix isn’t invertible (which it has to be to solve the equation).

Even partial collinearity will lead to an increase in Type II errors (A. F. Zuur, Ieno, and Elphick 2010). To put it simply, partitioning variance among related predictors is hard. For this reason, a few approaches may be used.

Check for issues

The first step is identifying issues. From the outset, relationships among predictor variables can be assessed using the pairs function in R. If two variables are highly correlated (r2 > .8 is a general limit), only one should be included in the model. Similarly, variance inflation factors (vif) can be assessed for the final and other models to consider this issues (all this is covered in the previous chapter that introduces multiple regression.

In respect to measured outcomes

When outcome variables are linked together, a few options exist. Note this issue may be obvious from checks of assumptions, but it also may be due to experimental design.

Consider this example. In order to investigate impacts of climate stress on oysters, specimens are placed in individual tanks and held at normal summer (calculated using recent data) temperature or at temperatures predicted under 2 IPCC — Intergovernmental Panel on Climate Change- scenarios. Oysters were also exposed to predator cues by dropping in water from tanks with 0, low (.25/m2), or high (2/m2) predators. After two months changes in oyster shell length (growth) was measured. Twenty-five oysters were measured for each treatment combination.

You hopefully recognize this as a factorial ANOVA experiment that you know how to analyze. If you need a reminder, see the chapter on ANOVA extensions. Experiments like this are odd, however, given the space they require. It is far more common to put lots of organisms in a single container given space and costs. However, this means our measurements are connected; remember blocking and paired tests)?

There are several ways to deal with this. Here we explore each for our oyster example.

Ignore it (don’t do this!)

First, let’s ignore the lack of independence. This is not an option, but it let’s you see the impact.

growth_lm <- lm(growth~predator_cue*temperature, experiment)
plot(growth_lm)

library(car)
Anova(growth_lm, type="III")
Anova Table (Type III tests)

Response: growth
                          Sum Sq  Df F value    Pr(>F)    
(Intercept)               15.576   1 16.3673 7.265e-05 ***
predator_cue              22.085   2 11.6031 1.635e-05 ***
temperature               70.061   2 36.8090 1.753e-14 ***
predator_cue:temperature  21.376   4  5.6154 0.0002558 ***
Residuals                205.563 216                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find significant main effects and interactions using this wrong approach.

Find average for each unit

One way of doing this focuses on the average for each unit

library(Rmisc)
averaged_experiment <- summarySE(experiment, measurevar = "growth",
                             groupvars = c("predator_cue", "temperature", "container"))
library(rmarkdown)
paged_table(averaged_experiment)

and use that for your analysis.

average_analysis <- lm(growth~predator_cue*temperature, averaged_experiment)
Anova(average_analysis, type = "III")
Error in Anova.lm(average_analysis, type = "III"): residual df = 0

but that leads to an issue! Since we only get 9 average outcomes and our model requires 10 degrees of freedom (consider why), we are left with no “noise” to make the denominator for our F ratio! Even when this doesn’t happen, you have reduced your data to a much smaller number of points and are not getting credit for all your work! This is a good example of why you should analyze simulated data before you run an experiment, but there are other options.

Blocking

The blocking approach we’ve already covered works might seem appropriate here.

blocking_analysis <- lm(growth~predator_cue*temperature+container, experiment)
Anova(blocking_analysis, type = "III")
Error in Anova.III.lm(mod, error, singular.ok = singular.ok, ...): there are aliased coefficients in the model

but its not? Why? This error means now our model matrix has collinearity issues. We can actually see where

alias(blocking_analysis)
Model :
growth ~ predator_cue * temperature + container

Complete :
                                                 (Intercept) predator_cuenone
containerf                                        0           0              
containerg                                       -1           1              
containerh                                        0           0              
containeri                                        1          -1              
predator_cuenone:temperatureelevated_scenario1    0           0              
predator_cuenormal:temperatureelevated_scenario1  0           0              
predator_cuenone:temperatureelevated_scenario2   -1           1              
predator_cuenormal:temperatureelevated_scenario2  0           0              
                                                 predator_cuenormal
containerf                                        0                
containerg                                        0                
containerh                                        1                
containeri                                       -1                
predator_cuenone:temperatureelevated_scenario1    0                
predator_cuenormal:temperatureelevated_scenario1  0                
predator_cuenone:temperatureelevated_scenario2    0                
predator_cuenormal:temperatureelevated_scenario2  1                
                                                 temperatureelevated_scenario1
containerf                                        1                           
containerg                                        1                           
containerh                                        0                           
containeri                                       -1                           
predator_cuenone:temperatureelevated_scenario1    0                           
predator_cuenormal:temperatureelevated_scenario1  0                           
predator_cuenone:temperatureelevated_scenario2    1                           
predator_cuenormal:temperatureelevated_scenario2  0                           
                                                 temperatureelevated_scenario2
containerf                                        0                           
containerg                                        1                           
containerh                                        0                           
containeri                                        0                           
predator_cuenone:temperatureelevated_scenario1    0                           
predator_cuenormal:temperatureelevated_scenario1  0                           
predator_cuenone:temperatureelevated_scenario2    1                           
predator_cuenormal:temperatureelevated_scenario2  0                           
                                                 containerb containerc
containerf                                        0          0        
containerg                                        1          1        
containerh                                       -1          0        
containeri                                        0         -1        
predator_cuenone:temperatureelevated_scenario1    0          0        
predator_cuenormal:temperatureelevated_scenario1  0          0        
predator_cuenone:temperatureelevated_scenario2    1          1        
predator_cuenormal:temperatureelevated_scenario2 -1          0        
                                                 containerd containere
containerf                                       -1         -1        
containerg                                       -1          0        
containerh                                        0         -1        
containeri                                        1          1        
predator_cuenone:temperatureelevated_scenario1    1          0        
predator_cuenormal:temperatureelevated_scenario1  0          1        
predator_cuenone:temperatureelevated_scenario2   -1          0        
predator_cuenormal:temperatureelevated_scenario2  0         -1        

though the output is confusing. In general, the issue here is each unit only contributes to one level of other traits..so if we know the average impact of ambient temperatures, for example, and the impacts in two of the treatments that were held at that temperature, we can predict the other. If instead each unit contributed to multiple levels, like in our feather experiment) this isn’t an issue.

Random effects

Our final option takes a new approach. It considers the units we measured as simply a sample from a larger population. Using that background, we use the information from the units to consider the distribution of sample effects we might see. The impact of unit is then considered a random-effect. For this to work, you probably want 5+ levels of the unit variable. This is because we are using the means to estimate variance (confusing?). For factors with <5 levels, random effects likely offer no benefit (Gomes 2022).

When models contain fixed (what we’ve done before) and random effects, we call them mixed-effects models. Two common packages for carrying out this analysis in R are the nlme and lme4 packages. We will focus on the lme4 package here. Random effects can be entered in the lmer (linear mixed-effects regression) function and specified as (1|Grouping Unit). One nice thing about lme4 is it will handle crossed and random effects on it’s own as long as you don’t repeat unit names. For example, we could note

mixed_analysis <- lmer(growth~predator_cue*temperature+(1|container), experiment)

Once built, we need to consider assumptions. The main assumption we add here is that the random effects are normally distributed. This should be checked at each level of grouping. The check_mixed_model function (provided below) offers an automated approach for one level (also known as one-way random effects).

check_mixed_model <- function (model, model_name = NULL) {
  #collection of things you might check for mixed model
  par(mfrow = c(2,3))
  if(length(names(ranef(model))<2)){
    qqnorm(ranef(model, drop = T)[[1]], pch = 19, las = 1, cex = 1.4, main= paste(model_name, 
                                                                                  "\n Random effects Q-Q plot"))
    qqline(ranef(model, drop = T)[[1]])
  }
  plot(fitted(model),residuals(model), main = paste(model_name, 
                                                    "\n residuals vs fitted"))
  qqnorm(residuals(model), main =paste(model_name, 
                                       "\nresiduals q-q plot"))
  qqline(residuals(model))
  hist(residuals(model), main = paste(model_name, 
                                      "\nresidual histogram"))
}
check_mixed_model(mixed_analysis)

Here we have only 9 levels of units, so the spread is not perfect. However, we also know each of these is itself an average,and averages should be normally-distributed under the central limit theorem, so we can plow ahead.

We can consider the outcome using our summary command - note the output denotes we have 225 observations and 9 grouping levels.

summary(mixed_analysis)
Linear mixed model fit by REML ['lmerMod']
Formula: growth ~ predator_cue * temperature + (1 | container)
   Data: experiment

REML criterion at convergence: 631.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.68605 -0.79915 -0.02646  0.70023  2.72393 

Random effects:
 Groups    Name        Variance Std.Dev.
 container (Intercept) 4.6171   2.1487  
 Residual              0.9517   0.9755  
Number of obs: 225, groups:  container, 9

Fixed effects:
                                                 Estimate Std. Error t value
(Intercept)                                       0.78934    2.15758   0.366
predator_cuenone                                  1.28982    3.05129   0.423
predator_cuenormal                                0.92308    3.05129   0.303
temperatureelevated_scenario1                     0.03073    3.05129   0.010
temperatureelevated_scenario2                     2.06547    3.05129   0.677
predator_cuenone:temperatureelevated_scenario1    0.46256    4.31517   0.107
predator_cuenormal:temperatureelevated_scenario1 -0.24377    4.31517  -0.056
predator_cuenone:temperatureelevated_scenario2   -1.22466    4.31517  -0.284
predator_cuenormal:temperatureelevated_scenario2 -0.63903    4.31517  -0.148

Correlation of Fixed Effects:
             (Intr) prdtr_cnn prdtr_cnr tmpr_1 tmpr_2 prdtr_cnn:_1 prdtr_cnr:_1
predatr_cnn  -0.707                                                            
prdtr_cnrml  -0.707  0.500                                                     
tmprtrlvt_1  -0.707  0.500     0.500                                           
tmprtrlvt_2  -0.707  0.500     0.500     0.500                                 
prdtr_cnn:_1  0.500 -0.707    -0.354    -0.707 -0.354                          
prdtr_cnr:_1  0.500 -0.354    -0.707    -0.707 -0.354  0.500                   
prdtr_cnn:_2  0.500 -0.707    -0.354    -0.354 -0.707  0.500        0.250      
prdtr_cnr:_2  0.500 -0.354    -0.707    -0.354 -0.707  0.250        0.500      
             prdtr_cnn:_2
predatr_cnn              
prdtr_cnrml              
tmprtrlvt_1              
tmprtrlvt_2              
prdtr_cnn:_1             
prdtr_cnr:_1             
prdtr_cnn:_2             
prdtr_cnr:_2  0.500      

We can also still use Anova to get p-values. However, these are now calculated by default using likelihood-associated \(\chi^2\) tests.

Anova(mixed_analysis, type = "III")
Analysis of Deviance Table (Type III Wald chisquare tests)

Response: growth
                          Chisq Df Pr(>Chisq)
(Intercept)              0.1338  1     0.7145
predator_cue             0.1898  2     0.9095
temperature              0.6020  2     0.7401
predator_cue:temperature 0.1837  4     0.9960

You can also ask for F tests, but note the degrees of freedom associated with these tests is not clear. It’s somewhere between the “average” and “ignore” approach used above.

Anova(mixed_analysis, type = "III", test="F")
Analysis of Deviance Table (Type III Wald F tests with Kenward-Roger df)

Response: growth
                              F Df  Df.res Pr(>F)
(Intercept)              0.1338  1 3230137 0.7145
predator_cue             0.0949  2 3230137 0.9095
temperature              0.3010  2 3230137 0.7401
predator_cue:temperature 0.0459  4 3230137 0.9960

Note this approach suggests we do not have enough data to reject the null hypothesis. Ignoring the linkages among data led to very different results. This issue (pseudopreplication) has been noted in ecology and other fields (Hurlbert 1984; Heffner, Butler, and Reilly 1996; Lazic 2010).

Is this really new?

It should be noted that random effects match up to paired t-tests and blocking approaches for simple experiments.

two_way_anova_example <- lm(Color_index ~ Feather_type + Bird, data=feather)
Anova(two_way_anova_example, type= "III")
Anova Table (Type III tests)

Response: Color_index
              Sum Sq Df F value   Pr(>F)    
(Intercept)  0.24133  1 26.5059 0.000119 ***
Feather_type 0.15043  1 16.5214 0.001017 ** 
Bird         0.21950 15  1.6072 0.184180    
Residuals    0.13657 15                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(Color_index ~ Feather_type, data=feather, paired=TRUE)

    Paired t-test

data:  Color_index by Feather_type
t = -4.0647, df = 15, p-value = 0.001017
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.20903152 -0.06521848
sample estimates:
mean difference 
      -0.137125 
two_way_lme_example <- lmer(Color_index ~ Feather_type + (1|Bird), data=feather)
Anova(two_way_lme_example, type= "III", test="F")
Analysis of Deviance Table (Type III Wald F tests with Kenward-Roger df)

Response: Color_index
                  F Df Df.res   Pr(>F)    
(Intercept)  41.816  1 28.457 4.88e-07 ***
Feather_type 16.521  1 15.000 0.001017 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All same if we use least-squares approaches (not the default \(\chi^2\) tests that match likelihood-based methods).

Errors are not equal among groups

Another option is to use weighted-least squares regression - this approach specifically helps when residuals are not evenly distributed among groups (or when a funnel/cone appears when you plot the model to check assumptions). For example, we could take the model on below-ground biomass that we developed in the More Anovas chapter. As a reminder, Valdez et al. (2023) wanted to consider the impact of top-down (snail grazing) and bottom- up (nutrient availability) on marsh plant (Spartina alterniflora) growth. To do this, they assigned plots to one of 3 grazer treatments and one of 2 nitrogen treatments. We focused on the impact of these treatments on standing dead mass and noted a funnel-shape when reviewing the residuals.

valdez_2023 <- read.csv("data/Spartina_alterniflora_traits.csv", stringsAsFactors = T)
sdm_model <-lm(Standing.Dead..dry..m2.~Snail.Level * Nitrogen.level, valdez_2023[valdez_2023$Snail.Level != "uncaged",])
plot(sdm_model)

To address the, we can use weighted least squares. This approach assume you built the model and then noted an issue with heteroscedasticity. To use, we can calculate a weight for each residual that is based on its variance. Doing this requires an iterative process similar to those used for estimation for generalized linear models (we’ll get to these) - below makes a value that increases with low variance.

wt_sdm <- 1 / lm(abs(sdm_model$residuals) ~ sdm_model$fitted.values)$fitted.values^2

We can then add a new argument to the lm function to use these weights.

sdm_model_wls <-lm(Standing.Dead..dry..m2.~Snail.Level * Nitrogen.level, valdez_2023[valdez_2023$Snail.Level != "uncaged",], weights = wt_sdm)

We can then continue on our normal route:

plot(sdm_model_wls) 

Anova(sdm_model_wls, type="III")
Anova Table (Type III tests)

Response: Standing.Dead..dry..m2.
                            Sum Sq Df F value    Pr(>F)    
(Intercept)                203.228  1  76.442 1.497e-06 ***
Snail.Level                157.447  2  29.611 2.288e-05 ***
Nitrogen.level             130.416  1  49.054 1.427e-05 ***
Snail.Level:Nitrogen.level 143.013  2  26.896 3.681e-05 ***
Residuals                   31.903 12                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you compare the two models you notice slight differences - these are minimal here due to lack of differences in variance.

summary(sdm_model) 

Call:
lm(formula = Standing.Dead..dry..m2. ~ Snail.Level * Nitrogen.level, 
    data = valdez_2023[valdez_2023$Snail.Level != "uncaged", 
        ])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5800 -0.8350 -0.0317  0.6600  3.7400 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                       18.710      1.032  18.124
Snail.Levelremoval                               -14.150      1.460  -9.692
Snail.Levelsnail addition                         -7.567      1.460  -5.183
Nitrogen.levelwithout                            -15.240      1.460 -10.439
Snail.Levelremoval:Nitrogen.levelwithout          16.153      2.065   7.824
Snail.Levelsnail addition:Nitrogen.levelwithout   10.357      2.065   5.016
                                                Pr(>|t|)    
(Intercept)                                     4.39e-10 ***
Snail.Levelremoval                              5.02e-07 ***
Snail.Levelsnail addition                       0.000228 ***
Nitrogen.levelwithout                           2.25e-07 ***
Snail.Levelremoval:Nitrogen.levelwithout        4.72e-06 ***
Snail.Levelsnail addition:Nitrogen.levelwithout 0.000301 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.788 on 12 degrees of freedom
Multiple R-squared:  0.9284,    Adjusted R-squared:  0.8986 
F-statistic: 31.14 on 5 and 12 DF,  p-value: 1.792e-06
summary(sdm_model_wls)

Call:
lm(formula = Standing.Dead..dry..m2. ~ Snail.Level * Nitrogen.level, 
    data = valdez_2023[valdez_2023$Snail.Level != "uncaged", 
        ], weights = wt_sdm)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-2.31784 -0.89013 -0.03702  0.92266  2.46121 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                       18.710      2.140   8.743
Snail.Levelremoval                               -14.150      2.202  -6.426
Snail.Levelsnail addition                         -7.567      2.490  -3.039
Nitrogen.levelwithout                            -15.240      2.176  -7.004
Snail.Levelremoval:Nitrogen.levelwithout          16.153      2.322   6.956
Snail.Levelsnail addition:Nitrogen.levelwithout   10.357      2.620   3.953
                                                Pr(>|t|)    
(Intercept)                                     1.50e-06 ***
Snail.Levelremoval                              3.27e-05 ***
Snail.Levelsnail addition                        0.01030 *  
Nitrogen.levelwithout                           1.43e-05 ***
Snail.Levelremoval:Nitrogen.levelwithout        1.53e-05 ***
Snail.Levelsnail addition:Nitrogen.levelwithout  0.00192 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.631 on 12 degrees of freedom
Multiple R-squared:  0.8747,    Adjusted R-squared:  0.8225 
F-statistic: 16.75 on 5 and 12 DF,  p-value: 4.789e-05

Why not just always do this? Because weighted least squares implicitly assumes we know the weights. We are actually estimating them, so small datasets may lead to bad estimates and outcomes.

Residuals are not normally distributed

A very common concern regarding linear models is normality. I list it last here because this assumption is one of the least important (and the assumption is based on residuals, not data!). However, non-normal residuals are often (not always) connected to other issues, namely linearity, as noted above.

Combinining these

Next steps

These methods can be extended to other models that are used when linear model assumptions are not met, which is the focus of the next chapter.

References

Agresti, Alan. 2007. “Introduction to Categorical Data Analysis.”
“Applied Regression Analysis and Generalized Linear Models.” 2022. https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254.
Blais, Brian R., Samantha L. Johnson, and John L. Koprowski. 2023. “Effects of Disturbances and Environmental Changes on an Aridland Riparian Generalist.” PeerJ 11 (June): e15563. https://doi.org/10.7717/peerj.15563.
Blanca, María J., Rafael Alarcón, Jaume Arnau, Roser Bono, and Rebecca Bendayan. 2017. “Non-normal data: Is ANOVA still a valid option?” Psicothema 29 (4): 552–57. https://doi.org/10.7334/psicothema2016.383.
———. 2018. “Effect of Variance Ratio on ANOVA Robustness: Might 1.5 Be the Limit?” Behavior Research Methods 50 (3): 937–62. https://doi.org/10.3758/s13428-017-0918-2.
Bueno, Marilia, Glauco B. O. Machado, and Fosca P. P. Leite. 2020. “Colonization of Novel Algal Habitats by Juveniles of a Marine Tube-Dwelling Amphipod.” PeerJ 8 (October): e10188. https://doi.org/10.7717/peerj.10188.
Feng, Cindy, Longhai Li, and Alireza Sadeghpour. 2020. “A Comparison of Residual Diagnosis Tools for Diagnosing Regression Models for Count Data.” BMC Medical Research Methodology 20 (1): 175. https://doi.org/10.1186/s12874-020-01055-2.
Fox, John. 2015. Applied Regression Analysis and Generalized Linear Models. SAGE Publications.
———. 2016. Applied Regression Analysis and Generalized Linear Models. Third Edition. Los Angeles: SAGE.
Geissinger, Emilie A., Celyn L. L. Khoo, Isabella C. Richmond, Sally J. M. Faulkner, and David C. Schneider. 2022. “A Case for Beta Regression in the Natural Sciences.” Ecosphere 13 (2): e3940. https://doi.org/10.1002/ecs2.3940.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. 1st edition. Cambridge ; New York: Cambridge University Press.
Gomes, Dylan G. E. 2022. “Should I Use Fixed Effects or Random Effects When I Have Fewer Than Five Levels of a Grouping Factor in a Mixed-Effects Model?” PeerJ 10 (January): e12794. https://doi.org/10.7717/peerj.12794.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily E. D. Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. 2018. “A Brief Introduction to Mixed Effects Modelling and Multi-Model Inference in Ecology.” PeerJ 6 (May): e4794. https://doi.org/10.7717/peerj.4794.
Heffner, Robert A., Mark J. Butler, and Colleen Keelan Reilly. 1996. “Pseudoreplication Revisited.” Ecology 77 (8): 2558–62. https://doi.org/10.2307/2265754.
Hunt, Michael. n.d. “QQ-Plots.” RPubs. https://rpubs.com/mbh038/725314.
Hurlbert, Stuart H. 1984. “Pseudoreplication and the Design of Ecological Field Experiments.” Ecological Monographs 54 (2): 187–211. https://doi.org/10.2307/1942661.
Knief, Ulrich, and Wolfgang Forstmeier. 2021. “Violating the Normality Assumption May Be the Lesser of Two Evils.” Behavior Research Methods 53 (6): 2576–90. https://doi.org/10.3758/s13428-021-01587-5.
Lazic, Stanley E. 2010. “The Problem of Pseudoreplication in Neuroscientific Studies: Is It Affecting Your Analysis?” BMC Neuroscience 11 (1): 5. https://doi.org/10.1186/1471-2202-11-5.
Lindelov, J. K. n.d. Common Statistical Tests Are Linear Models (or: How to Teach Stats). https://lindeloev.github.io/tests-as-linear/.
Molenberghs, Geert, Geert Verbeke, Samuel Iddi, and Clarice G. B. Demétrio. 2012. “A Combined Beta and Normal Random-Effects Model for Repeated, Overdispersed Binary and Binomial Data.” Journal of Multivariate Analysis 111 (October): 94–109. https://doi.org/10.1016/j.jmva.2012.05.005.
Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. 2012. Introduction to Linear Regression Analysis. 5th edition. Hoboken, NJ: Wiley.
Payne, Elizabeth H., James W. Hardin, Leonard E. Egede, Viswanathan Ramakrishnan, Anbesaw Selassie, and Mulugeta Gebregziabher. 2017. “Approaches for dealing with various sources of overdispersion in modeling count data: Scale adjustment versus modeling.” Statistical Methods in Medical Research 26 (4): 1802–23. https://doi.org/10.1177/0962280215588569.
Reitan, Trond, and Anders Nielsen. 2016. “Do Not Divide Count Data with Count Data; A Story from Pollination Ecology with Implications Beyond.” PLOS ONE 11 (2): e0149129. https://doi.org/10.1371/journal.pone.0149129.
Shatz, Itamar. 2023. “Assumption-Checking Rather Than (Just) Testing: The Importance of Visualization and Effect Size in Statistical Diagnostics.” Behavior Research Methods, March. https://doi.org/10.3758/s13428-023-02072-x.
Valdez, Stephanie R., Pedro Daleo, David S. DeLaMater Iii, and Brian R. Silliman. 2023. “Variable Responses to Top-down and Bottom-up Control on Multiple Traits in the Foundational Plant, Spartina Alterniflora.” PLOS ONE 18 (5): e0286327. https://doi.org/10.1371/journal.pone.0286327.
Warton, David I., Mitchell Lyons, Jakub Stoklosa, and Anthony R. Ives. 2016. “Three Points to Consider When Choosing a LM or GLM Test for Count Data.” Methods in Ecology and Evolution 7 (8): 882–90. https://doi.org/10.1111/2041-210X.12552.
Warton, David, and Francis Hui. 2010. “The Arcsine Is Asinine: The Analysis of Proportions in Ecology.” Ecology 92 (1): 3–10. https://doi.org/10.1890/10-0340.1.
Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14. https://doi.org/https://doi.org/10.1111/j.2041-210X.2009.00001.x.
Zuur, Alain, Elena N. Ieno, and Graham M. Smith. 2007. Analyzing Ecological Data. 2007th edition. New York ; London: Springer.
Zuur, Alain, Elena N. Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed Effects Models and Extensions in Ecology with r. New York, NY: Springer.