When considering which statistical procedure works best for examining a given hypothesis, researchers are often unaware that there is not always a one-to-one mapping or simple translation between a research question and a statistical procedure. This is because each procedure comes with different advantages and disadvantages and can only answer questions within a specific, limited context. The unawareness regarding the limits of statistical procedures plays a major role in the replication crisis (see, e.g., Ioannidis, 2005; Yong, 2012) and contributes to what can be called the gap between research questions and statistical procedures (see, e.g., Scheel et al., 2021).
In this paper, we will use a simulated data example to illustrate how the typical workflow in analysing psychological data can fail to bridge the gap between research questions and statistical procedures. We will focus on research questions that involve the evaluation of the effect of a focal categorical predictor X on an outcome variable Y taking into account several covariates. A typical research question in this context may be “Is there a significant effect of X on Y on average?” or “Is the effect of X larger for males than for females?”.
The usual approach to analyze this type of data is by using analysis of variance (ANOVA; Rutherford, 2001) in combination with null hypothesis significance testing (NHST). Technically, ANOVA is just a linear regression model with at least one categorical focus predictor. By using the term NHST, we refer to the following typical setting. Under the null hypothesis, we use equality constraints. That is, we test whether one or more quantities (such as model parameters or means) are equal to zero (or another constant) or whether several quantities are equal to each other. Under the alternative hypothesis, we assume that there are no equalities. An example could refer to regression coefficients, where is tested against , or in other words is tested against .
ANOVA is one of the most often used statistical procedures in psychology. Unfortunately, using ANOVA in combination with NHST is often insufficient for two main reasons. First, ANOVA does not provide us with the quantities of interest, which are usually average or conditional effects. And second, it does not take into account the expectations of the researcher about these effects of interest, for example their order.
We will introduce two methodological approaches that may be more suitable. These are the EffectLiteR framework (Mayer et al., 2016; Mayer & Dietzfelbinger, 2019) and informative hypothesis testing (IHT; Hoijtink, 2012; Silvapulle & Sen, 2005). We will show how both approaches can be used simultaneously to narrow the gap between research questions and statistical procedures. After these demonstrations, we will present two empirical data examples, one in the context of linear regression and one in the context of the generalized linear model. We provide R (R Core Team, 2020) code as well as further supplemental materials on the OSF project site (see Keck et al., 2024).
By reading this paper, we hope that applied researchers will gain awareness about the pitfalls of using ANOVA together with NHST. Furthermore, we hope that applied researchers will obtain some familiarity with our proposed method, IHT in the EffectLiteR framework, including all its advantages. In order to ease the transition for readers that are currently using traditional ANOVA in combination with NHST, this paper will only consider the frequentist framework. We acknowledge that ultimately, a fully Bayesian approach may provide even more options and flexibility (in the future).
Simulated Data Example
The running example focuses on a clinical researcher who is interested in the effect of a new drug in combination with cognitive behavioral therapy (CBT) to reduce depression. The main hypothesis is that CBT in combination with either the old or the new drug is more effective than CBT only, and that the new drug with CBT is more effective than the old drug with CBT. The simulated data set can be found on the OSF project site under the name “runningExampleData.csv” (see Keck et al., 2024).
The researcher sets up a non-randomized experiment with three treatment groups; one group receiving CBT only (), one group receiving CBT together with the old drug () and one group receiving CBT together with the new drug (). The total sample size is . As covariates, the continuous variable depression pre-test (Z) and the dichotomous variable “treatment experience” (K) are considered. The latter indicates whether any treatment has been received before () or not (). The outcome variable Y is the depression post-test. Note that both Z and Y will be treated as manifest variables and higher scores denote better mental health. Furthermore, note that our running example is a simplification of a non-randomized experiment. In a real world setting, more covariates would have to be taken into account.
The expectation of the researcher is that , where and correspond to the treatment experience and pre-test adjusted means of Y for the combination of CBT with the new and the old drug, respectively, and corresponds to the treatment experience and pre-test adjusted mean of Y for CBT with no drugs. However, as will be explained in more detail later on, there are different ways to adjust means that researchers may not always be aware of.
After the data collection, the researcher will typically first have a look at some descriptive statistics. Table 1 shows the estimates of various (conditional) expectations and adjusted means in the simulated example. In line with our data generation, we find that there are no (significant) baseline differences in depression pre-test Z between the levels of X and K. This is also reflected by an F-test: , where we compared a model with main effects for X and K and their interaction effect (X:K) to the intercept-only model. Often, the researcher will also visualize the data, for example as can be seen in Figures 1 and 2. Figure 1 shows a boxplot of the post-test Y for the different treatment groups X. Figure 2 depicts the linear regression of post-test Y on pre-test Z in the different combinations of treatment group X and treatment experience K. We see that the slopes do not differ much between the grid elements. In other words, there seems to be no three-way interaction between either X or K and the two continuous variables Y and Z. Again, this can be confirmed using an F-test; .
Table 1
Estimates of Various (Conditional) Expectations and Adjusted Means in the Simulated Example
| Group | X = 0 | X = 1 | X = 2 | X = . |
|---|---|---|---|---|
| K = 0 | 0.008 | 0.386 | 0.410 | 0.257 |
| 0.006 | −0.017 | −0.036 | −0.016 | |
| 0.200 | 0.150 | 0.200 | 0.550 | |
| K = 1 | 0.091 | 0.596 | 0.913 | 0.695 |
| −0.005 | 0.123 | −0.039 | −0.013 | |
| 0.100 | 0.050 | 0.300 | 0.450 | |
| K = . | 0.036 | 0.438 | 0.712 | |
| 0.044 | 0.475 | 0.643 | ||
| 0.002 | 0.018 | −0.038 | ||
| 0.300 | 0.200 | 0.500 |
Note. The last column of the table shows the expectations and probabilities where we only condition on K. The last row shows the expectations and probabilities where we only condition on X. The cell probabilities are fixed by design. The adjusted means (AdjM) are computed according to EffectLiteR, which will be explained in more detail later on.
Figure 1
Boxplot of Depression Post-Test Y Grouped by the Three Levels of X
Figure 2
Slopes of the Linear Regression of the Post-Test Y on the Pre-Test Z for the Different Combinations of Group X and Treatment Experience K
ANOVA Versus EffectLiteR
For the data generation as well as the data analysis, the following model is used1 :
1
Note that this model does not include a three-way interaction term. The control group is . The variables group1i and group2i are dummy variables which indicate by a value of , if a subject belongs to group or , respectively, and are otherwise. Similarly, the variable treatexp1i is a dummy variable, which indicates whether a subject has treatment experience () or not ().
Then, the researcher will formulate the hypotheses of interest. As in our example, the focus is usually on the “main effect” of the treatment and following the classical NHST approach, will be tested against . However, as mentioned before, the ’s may correspond to different types of adjusted means, which will be explained in the next section. Furthermore, the alternative hypothesis does not correspond to the initial expectation of the researcher, where the adjusted means are ordered: . We will also argue that the researcher is actually interested in the so-called average effect of the treatment, as will be explained in more detail later on. After specifying the hypotheses, the researcher will fit the model. In the sequel, we will first give a theoretical overview of ANOVA, before coming back to our running example and the results of the fitted model.
ANOVA
ANOVA is one of the most popular statistical techniques in the social and behavioral sciences. It is a framework or collection of methods based on a linear regression model, where at least one predictor is categorical in nature. Usually, this categorical predictor describes the different conditions of an experiment, or the different (treatment) groups in an intervention study. The ANOVA framework (Edwards, 1993) includes one-way and multi-way ANOVA, univariate and multivariate (M)ANOVA, ANOVA using within-subjects and/or between-subjects factors, and AN(C)OVA where covariates are included in the model.
The problem with ANOVA is that we only obtain regression coefficients as well as main and interaction effects, which are often difficult to interpret. Especially the interpretation of a main effect in the presence of an interaction effect is far from trivial. This is because different sum of squares (SS) can be used in ANOVA for hypothesis testing. Depending on the SS, the main effect is defined in a different way and thus, a different null hypothesis is tested (for an overview, see, e.g., Fox, 2016; Graefe et al., 2022; Maxwell et al., 2018). There are several types of SS, namely Type I, Type II and Type III.2 To understand the main differences between these different types, consider a model of the form . An ANOVA table based on Type I SS will contain the following three model comparisons: versus ,3 versus , and versus . In other words, Type I SS corresponds to an incremental procedure, where single terms are added to the model one by one and the model is then compared to the previous model without the new term. For Type II, the ANOVA table will contain the following model comparisons: versus , versus , and versus . The main characteristic of the Type II procedure is the principle of marginality: If a term is removed from the model, then all higher-order terms involving this term will be removed too. Finally, the Type III procedure leads to the following set of model comparisons: versus , versus , and versus . It is always the full model versus a model where a single term is deleted.
Type III SS are used per default in many popular software programs like SPSS (IBM Corp, 2020). Thus, researchers will typically use Type III SS without further deliberation, even though there is a great controversy in the literature about when to use which SS (e.g., Hector et al., 2010; Herr & Gaebelein, 1978; Macnaughton, 1998). Graefe et al. (2022) conducted simulation studies considering the different types of SS in balanced, proportional and non-orthogonal designs. They found that in balanced designs, using either one of the three SS yields main effects that can be interpreted unambiguously. However, in proportional designs, this is only true when using Type I and II SS. In case of Type III SS, the main effect is biased if there are interactions. Finally, in non-orthogonal designs, the main effect is always biased when using Type I SS. And when there are interactions, Types II and III also yield biased main effects. Nevertheless, for the sake of illustration, we will use ANOVA to analyse our simulated dataset.
Using the centered version of the pre-test variable Z, the model for our running example is fitted as follows in R:
Listing 1
Specification of ANOVA Models Including All Two-Way Interactions
# this R code can be found under 01Anova.R on the OSF project site
# read in data
Data <- read.csv("runningExampleData.csv")
# center pre-test for better interpretation of regression coefficients
Data$pretest.cent <- Data$pretest - mean(Data$pretest)
# treat group and treatexp as categorical variables
Data$group <- factor(Data$group)
Data$treatexp <- factor(Data$treatexp)
# load packages
library(car) # for Anova () function
# fit regression model with all 2-way interaction effects
lmod.treat <- lm(posttest ~ (group + treatexp + pretest.cent)^2, data = Data)
summary(lmod.treat)
# Type I ANOVA table
anova(lmod.treat)
# Type II ANOVA table
Anova(lmod.treat, type = 2)
# Type III ANOVA table
# attention: we must use an orthogonal or a sum-to-zero coding scheme
options(contrasts = c("contr.sum", "contr.poly"))
lmod.sum <- lm(posttest ~ (group + treatexp + pretest.cent)^2, data = Data)
summary(lmod.sum)
Anova(lmod.sum, type = 3)The anova( ) function is a function in base R, whereas the Anova( ) function belongs to the car package (Fox et al., 2022). The former uses Type I SS, whereas the latter can handle Type II and III SS. Furthermore, R uses treatment coding per default, which has to be changed to a zero-to-sum coding scheme (for example sum coding) when using Type III SS. This prevents main and interaction effects from overlapping. For more information about coding schemes, see, for example, Cohen et al. (2003) and Hardy (2003). Note, however, that in some statistical packages (for example SAS), coding schemes are automatically taken care of. Table 2 shows the ANOVA results when using the three different SS. In Table 3, the results of the linear model using treatment coding can be seen. Appendix Table A1 shows the results of the linear model when using sum coding.
Table 2
ANOVA Results for Type I, Type II and Type III Sum of Squares (SS)
| Type I SS | Type II SS | Type III SS | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source | Df | SS | F-value | p-value | Df | SS | F-value | p-value | Df | SS | F-value | p-value |
| Intercept | 1 | 115.03 | 113.457 | <0.001*** | ||||||||
| Group | 2 | 85.85 | 42.340 | <0.001*** | 2 | 64.14 | 31.632 | <0.001*** | 2 | 66.02 | 32.558 | <0.001*** |
| Treatexp | 1 | 24.29 | 23.956 | <0.001*** | 2 | 24.02 | 23.694 | <0.001*** | 1 | 12.26 | 12.097 | <0.001*** |
| Pretest | 1 | 33.38 | 32.922 | <0.001*** | 2 | 33.59 | 33.128 | <0.001*** | 1 | 18.33 | 18.081 | <0.001*** |
| Group:treatexp | 2 | 8.41 | 4.148 | 0.016* | 2 | 8.31 | 4.100 | 0.017* | 2 | 8.31 | 4.100 | 0.017* |
| Group:pretest | 2 | 9.76 | 4.813 | 0.008** | 2 | 9.05 | 4.461 | 0.012* | 2 | 9.05 | 4.461 | 0.012* |
| Treatexp:pretest | 1 | 0.02 | 0.016 | 0.899 | 1 | 0.02 | 0.016 | 0.899 | 1 | 0.02 | 0.016 | 0.899 |
| Residuals | 990 | 1003.72 | 990 | 1003.72 | 990 | 1003.72 | ||||||
Note. In our computations, we used the same factor order as is shown in this table.
Table 3
Linear Regression Model Results Using Treatment Coding
| Source | Coefficient | Estimate (SE) | t-value | p-value | |
|---|---|---|---|---|---|
| Intercept | 0.007 (0.071) | 0.091 | 0.928 | ||
| 0.380 (0.109) | 3.490 | < 0.001*** | |||
| Group | |||||
| 0.410 (0.101) | 4.070 | < 0.001*** | |||
| Treatexp | 0.084 (0.123) | 0.680 | 0.497 | ||
| Pre-test | 0.073 (0.067) | 1.096 | 0.273 | ||
| 0.113 (0.206) | 0.548 | 0.584 | |||
| Group:treatexp | |||||
| 0.420 (0.154) | 2.729 | 0.006** | |||
| 0.031 (0.096) | 0.327 | 0.743 | |||
| Group:pre-test | |||||
| 0.217 (0.078) | 2.771 | 0.006** | |||
| Treatexp:pre-test | -0.009 (0.070) | −0.127 | 0.899 | ||
Table 2 shows that the results differ depending on which SS was used. Notably, the results concerning the interaction terms are the same when using Type II and III SS. Generally, the results of the highest order terms (in our case the two-way interactions) are identical between Type II and III SS, but the results of the lower order terms differ. Furthermore, the result of the highest order term when using Type I SS (in our case the treatexp:pretest interaction) corresponds to the result of this term when using Type II and III SS. This term is shown in the last line before the residuals in the results tables and is the only term that has identical results between all three types of SS. Lastly, note that when using Type III SS, the -values of the terms with correspond to the -values of these terms in the linear model when using sum coding (see Table 2 and Appendix Table A1). This is because if .
Using treatment contrasts and assuming Z is mean centered, we can interpret the regression coefficients in Table 3. Recall that is the control group. The intercept corresponds to the mean of Y in the control group given and (see Appendix B for more details), while denotes the difference between the means of Y in the groups and the control group, given and :
2
Similarly, denotes the difference between the means of Y in the groups and the control group, given and :
3
The difference between the means of Y of and (given the control group and ) corresponds to :
4
And denotes the expected change in Y for a unit change in Z in the control group if :
5
The change in the effect of versus () between and while keeping Z constant is denoted by . Similarly, describes the change in the effect of versus () between and while keeping Z constant. The change in the effect of versus () for a unit change in Z when is represented by . Similarly, denotes the change in the effect of versus () for a unit change in Z when . Finally, describes the change in the effect of versus for a unit change in Z when .
Remember that in our example, the researcher assumed an ordering of adjusted means: . After obtaining a significant main effect of the group variable, the researcher will usually use contrasts to compare the means in depth. Note that in the NHST setting, if no hypothesis about the adjusted means has been specified right from the start, this is called post-hoc testing and should only be used in a descriptive or an exploratory manner. Furthermore, we have to control for familywise error rates (Keselman et al., 2011), which can be done by the emmeans package (Lenth et al., 2022). Concerning our simulated data example, we specify the contrasts in R as follows:
Listing 2
Specification of Contrasts
# this R code can be found under 02emmeans.R on the OSF project site
library(emmeans)
# post-hoc tests / contrasts
emmeans(lmod.treat, "group")
emmeans(lmod.treat, "group", weights = "proportional")
emmeans(lmod.treat, "group", contr = "trt.vs.ctrl")
emmeans(lmod.treat, "group", contr = "trt.vs.ctrl", weights = "proportional")
emmeans(lmod.treat, "group", contr = "eff")
emmeans(lmod.treat, "group", contr = "eff", weights = "proportional")The contrasts are based on the marginal means of Y, which are averaged over the levels of K at the mean of Z. Table 4 shows the marginal means of Y when using equal and when using proportional weights. Using proportional weights implies that the marginal means of Y are averaged over the marginal distribution of K at the mean of Z. This leads to slightly different results in our example compared to using equal weights.4
Table 4
Marginal Means (MMs) of the Depression Post-Test Y With Standard Errors in Parentheses and Confidence Intervals
| Equal Weights | Proportional Weights | |||||
|---|---|---|---|---|---|---|
| Group | MM (SE) | Lower CL | Upper CL | MM (SE) | Lower CL | Upper CL |
| 0.048 (0.062) | −0.073 | 0.169 | 0.044 (0.060) | −0.073 | 0.162 | |
| 0.485 (0.083) | 0.323 | 0.646 | 0.475 (0.079) | 0.320 | 0.629 | |
| 0.668 (0.046) | 0.578 | 0.758 | 0.643 (0.047) | 0.551 | 0.735 | |
The first set of contrasts (see Table 5) compares the treatment groups with the control group, where the estimates of the differences in marginal means of Y in the levels of X are computed. The second set of contrasts (see Table 6) gives us effect contrasts, where the marginal means of Y are compared with the equally weighed cell means of Y for . Both contrasts have also been computed using proportional weights.
Table 5
Treatment Versus Control Contrasts Between Marginal Means of Y
| Equal Weights | Proportional Weights | ||||||
|---|---|---|---|---|---|---|---|
| Contrast | Estimate (SE) | Df | t-ratio | -value | Estimate (SE) | t-ratio | -value |
| 0.436 (0.103) | 990 | 4.235 | < .001*** | 0.430 (0.099) | 4.355 | < .001*** | |
| 0.620 (0.077) | 990 | 8.059 | < .001*** | 0.599 (0.076) | 7.859 | < .001*** | |
Table 6
Effect Contrasts Between Marginal Means of Y
| Equal Weights | Proportional Weights | ||||||
|---|---|---|---|---|---|---|---|
| Contrast | Estimate (SE) | Df | t-ratio | p-value | Estimate (SE) | t-ratio | p-value |
| −0.352 (0.052) | 990 | −6.798 | < .001*** | −0.343 (0.050) | −6.824 | < .001*** | |
| 0.084 (0.061) | 990 | 1.387 | .166 | 0.087 (0.058) | 1.500 | .134 | |
| 0.268 (0.046) | 990 | 5.822 | < .001*** | 0.256 (0.046) | 5.620 | < .001*** | |
Considering the results, it would seem that the expectations of the researcher are satisfied in the data. That is, the marginal mean of group is higher than the marginal mean of group and the marginal mean of group is higher than the marginal mean of group (see Table 4). Therefore, it seems that there is a greater improvement in the depression post-test Y for CBT with the new drug compared to CBT with the old drug and that there is a greater improvement for CBT with the old drug compared to CBT only. The contrasts agree with this observation. Here, the treatment contrast between group and group is larger than the treatment contrast between group and (see Table 5). Furthermore, the effect contrast is largest for group and smallest for group (see Table 6).
EffectLiteR
Given the different choices for SS when using ANOVA, it becomes clear that main effects are not defined precisely and unambiguously when using ANOVA. However, they are defined precisely and unambiguously in the causal inference literature (see, e.g., Angrist et al., 1974; Neyman, 1990; Pearl, 2009; Rubin, 1974, 2005; Steyer et al., 2000). Common types of effects are the so-called average effects (for example, the effect of a treatment averaged over treatment experience), conditional effects (for example, the effect of a treatment for those without treatment experience only), effects on the treated, effects on the untreated, and so forth. Unfortunately, although the mathematical definition of these effects is well understood, they can be quite complicated and tedious to compute.
EffectLiteR is a framework and an R package that is built upon the clear definitions of effects in the causal inference literature (Imbens & Rubin, 2015; Steyer et al., 2014). Researchers can use EffectLiteR to estimate various effects of interest as well as their standard errors, when the treatment variable is categorical and the outcome variable is continuous. In addition, Wald or F-tests are used to test different hypotheses, for example that (all) average effects are equal to zero, or that (all) conditional effects are equal to zero in the population. For an in-depth introduction into EffectLiteR, see Mayer et al. (2016). In contrast to ANOVA, EffectLiteR directly provides the kind of effects that most applied researchers are interested in.
An average effect is defined as the unconditional expectation of the difference between expected outcomes under treatment and under control. It corresponds to the average causal effect if there are no further unobserved confounding variables and the regression of the post-test Y on the pre-test Z, given a combination of treatment experience K and treatment group X, is in fact linear. Furthermore, an average effect consists of the difference of adjusted means. Considering our running example and groups and , this is (Mayer et al., 2016):
After obtaining the descriptive statistics, the researcher can first fit the EffectLiteR model to compute the adjusted means and the effects of interest, but disregard the test statistics and -values. For our running example, this can be done as follows in R:
Listing 3
Specification of EffectLiteR Model
# this R code can be found under 03EffectLite.R on the OSF project site
library(EffectLiteR)
elrmod <- effectLite(
y = "posttest",
x = "group",
k = "treatexp",
z = "pretest.cent",
interactions = "2-way",
method = "lm",
data = Data
)
print(elrmod)The full output is shown in Appendix C. The average effects and adjusted means can be found on lines 91–103 and are depicted in Table 7. Note that the average effects correspond to the differences in adjusted means and the subscripts refer to the groups which are compared:
Table 7
Average Effects and Adjusted Means, as Estimated by EffectLiteR
| Group | Adjusted Mean (SE) | Average Effect (SE) |
|---|---|---|
| 0.044 (0.060) | ||
| 0.475 (0.079) | 0.430 (0.099) | |
| 0.643 (0.047) | 0.599 (0.076) |
Furthermore, the adjusted means computed by EffectLiteR are very close, but not identical, to the marginal means with proportional weights computed by emmeans. This is because slightly different computations are used.5
While ANOVA tests whether main effects are significantly different from zero, EffectLiteR tests whether average effects are significantly different from zero. The number of average effects depends on the number of levels of the group variable. In our example, there is one control and two treatment groups, resulting in two average effects. In short, the EffectLiteR approach is more suitable than the ANOVA approach for testing research questions that focus on average effects. There are just a few rare cases, such as balanced designs, where the results are identical. For further applications of the EffectLiteR approach, see, for example, Flunger et al. (2019), Mayer et al. (2020), Mueller et al. (2015), Rek et al. (2022), and Sadovich (2020).
Informative Hypothesis Testing (in the EffectLiteR Framework)
As mentioned before, the researcher may have specific expectations about the data. In our example, the researcher expects that the average effect of CBT with the new drug is larger than the average effect of CBT with the old drug, which in turn is larger than zero:
11
After plugging in the definitions of and (see Equations 9 and 10), this corresponds to:
12
By adding to each part of the equation, this can be simplified to:
13
In other words, the researcher expects a complete ordering of the adjusted means of the treatment groups: It is assumed that the adjusted mean of group is larger than the adjusted mean of group and that the adjusted mean of group is larger than the adjusted mean of group .
Hypotheses that reflect the expectations of the researcher are known as informative hypotheses and both Bayesian and frequentist procedures for IHT are available (Barlow et al., 1972; Hoijtink, 2012; Robertson et al., 1988; Silvapulle & Sen, 2005; Vanbrabant, 2020). In this paper, we will focus on the frequentist approach. IHT is not widely adopted by applied researchers in the social sciences yet. This is unfortunate for multiple reasons. First, compared to classical NHST, IHT allows to formulate hypotheses in a way that can be closer to typical research questions. These typical research questions often contain specific directions or orders regarding regression coefficients, group means or effects of interest. Using NHST, it is not possible to directly test hypotheses about these orders or directions when there is more than one constraint, whereas IHT allows for it. Second, compared to NHST, IHT can lead to a substantial gain of power (up to 50%; see Vanbrabant et al., 2015). This is because the parameter space is restricted according to the directions and orders that are defined in the hypothesis.
If the researcher would like to implement IHT about adjusted means, a precise way to formulate the hypotheses of interest would be:
14
15
Regarding our running example, we use the following R syntax to compute the informative Wald test:
Listing 4
Informative Hypothesis Testing in EffectLiteR
# this R Code can be found under 04IHT.R on the OSF project site
elrmod <- effectLite(
y = "posttest",
x = "group",
k = "treatexp",
z = "pretest.cent",
interactions = "2-way",
method = "sem", # must be "sem" for effectLite_iht()
fixed.cell = TRUE , fixed.z = TRUE,
homoscedasticity = TRUE,
data = Data
)
effectLite_iht(
object = elrmod,
constraints = "adjmean2 > adjmean1; adjmean1 > adjmean0"
)
# $test.stat
# [1] "Fbar"
# $Wald.info
# [1] 62.87448
# $pvalue
# [1] 1.054712e-14For technical reasons, we must use the argument method = "sem" in order to use the effectLite_iht( ) function. In addition, we have specified the arguments fixed.cell = TRUE, fixed.z = TRUE, and homoscedasticity = TRUE, in order to obtain similar results as when using method = "lm". The effectLite_iht( ) function contains a constraints = argument that can be used to specify the informative hypothesis. Here, the keyword adjmean0 refers to the adjusted mean of the group, while adjmean1 and adjmean2 refer to the adjusted means of the and group, respectively. The constraints = argument corresponds to the alternative hypothesis as presented in Equation (15). The function returns the value of the informative Wald test, and a -value. For our simulated dataset, we obtain , which allows us to discard in favor of the ordered hypothesis . This coincides indeed with the expectations of the researcher.
Instead of formulating the informative hypothesis in terms of the adjusted means, we could as well formulate the hypothesis in terms of the average effects, as in Equation (13):
Listing 5
Informative Hypothesis Testing Using Average Effects
# this R Code can be found under 04IHT.R on the OSF project site
effectLite_iht(object = elrmod, constraints = "Eg2 > Eg1; Eg1 > 0")
# $test.stat
# [1] "Fbar"
# $Wald.info
# [1] 62.87448
# $pvalue
# [1] 1.054712e-14Here, the keyword Eg1 represents the average effect of compared to , denoted by in Equation (9). Similarly, the keyword Eg2 represents the average effect of compared to , denoted by in Equation (10). The result is identical: .
Type A and Type B Hypotheses
In IHT, a distinction is often made between two types of hypotheses, which are called Type A and Type B hypotheses. The null and alternative hypotheses in our example are Type A hypotheses, which are usually of main interest. When testing Type A hypotheses, (see Equation 14) states that all restrictions are equality restrictions, whereas the alternative hypothesis (see Equation 15) states that at least one inequality restriction is strictly true. Here, the researcher would typically like to obtain a significant result, as this indicates that at least some of the constraints are not equality constraints and thus must be inequality constraints. In contrast, the Type B null hypothesis states that all inequality restrictions hold, whereas the Type B alternative hypothesis states that at least one inequality restriction is violated. When testing Type B hypotheses, the researcher would typically like to obtain a non significant result, because that would imply that we cannot reject the null hypothesis (and thus the expectations of the researcher) based on the data.
If the researcher observes that the expected constraints are satisfied in the data, testing Type A hypotheses suffices. However, if one or more of the assumed constraints are violated in the data, but only to a very small extent, which might be due to sampling variability, the researcher should conduct a Type B hypothesis test before conducting a Type A hypothesis test and correct for multiple testing. We recommend to pre-register this approach.
In our example, Type B hypotheses should be tested if at least one of the two constraints, either or , is violated to a small extent in the data. For example, instead of obtaining the estimates that satisfy the constraints (, and , see Table 7), suppose we would obtain , meaning that the constraint would be violated to a very small extent. In that case, the researcher should start with testing the Type B hypotheses:
16
17
In case the detected violation of a constraint is small, hypothesis test Type B might still be non significant, in which case the researcher can proceed to test hypothesis test Type A. If hypothesis test Type B is significant, then it is clear that the data is contradicting the hypothesis. Therefore, there is no need for testing hypothesis test Type A.
The following syntax illustrates how we can test this Type B hypothesis for our running example:
Listing 6
Type B Informative Hypothesis Test
# this R Code can be found under 04IHT.R on the OSF project site
effectLite_iht(
object = elrmod,
constraints = "adjmean2 < adjmean1; adjmean1 < adjmean0"
)
# $test.stat
# [1] "Fbar"
# $Wald.info
# [1] 4.309257e-12
# $pvalue
# [1] 0.6304712We formulated the Type B hypothesis in terms of the adjusted means, but we could as well have used the averaged effects. Unsurprisingly, when testing against , we obtain . We are unable to reject the null hypothesis, and this allows us to proceed and test the Type A hypothesis. Lastly, note that testing a Type B hypothesis before testing a Type A hypothesis rarely seems to alter the conclusions, as can be seen from the simulations conducted by Kuiper et al. (2015).
Further Types of Informative Hypotheses
It is also possible to formulate informative hypotheses using other types of constraints (see e.g. Hoijtink, 2012). For example, effect sizes can be incorporated as in:
18
19
20
Then, Equation 18 corresponds to:
21
From a substantive point of view, this means that the researcher assumes that the difference between the conditional effect of receiving CBT together with the new drug (), given , and the conditional effect of receiving CBT together with the old drug (), given , is greater than standard deviations. This may give some indication about the relevance of the difference between the two effects.
“About equality” constraints can be used to test informative hypotheses such as:
22
23
Finally, range constraints are a generalization of “about equality” constraints. They can be used to test informative hypotheses like:
24
Comparison With Equivalence Testing
Equivalence testing (Schuirmann, 1987; Seaman & Serlin, 1998; Wellek, 2010) is a special case of IHT. This is because hypotheses in equivalence testing are formulated using effect sizes, which is also an option in IHT. More specifically, hypotheses in equivalence testing are based on “smallest effect sizes of interest” (SESOIs), which are used to define a range of effect sizes that are of practical interest to the researcher. Equivalence testing became popular in reaction to the replication crisis and is often used in replication studies (see, e.g., Anderson & Maxwell, 2016; Lakens, 2017; Simonsohn, 2015). Here, researchers aim to show that an observed effect is small enough to conclude that its replication was unsuccessful. Generally, equivalence testing can be conducted within the framework of IHT, but IHT allows for a broader range of hypotheses that can be tested.
Regarding our running example, one may apply equivalence testing as follows. Let us assume that our running example is a replication study. We are interested to show that the difference between the raw means of Y in the groups and is small enough to conclude that it is not of practical relevance. In other words, we want to test whether the raw means of Y in the groups and are equivalent (hence the term equivalence testing).6 Furthermore, let us assume that the original study had the same sample size as our running example.
For determining a SESOI, we can use one of multiple approaches (see e.g., Lakens et al., 2018). For a discussion on when to use which approach, see Baguley (2009). Here, we will use the popular small telescopes approach (Simonsohn, 2015). It defines the SESOI as the effect size that would give a certain power (say ) to the original study. Thus, it indicates the extent to which the replication results are consistent with an effect size large enough to have been detected in the original study (Simonsohn, 2015). We use the value of power, which is typical for the approach. In our running example, this leads to equivalence bounds of and .
We can then use the TOST (two one-sided tests) procedure (Goertzen & Cribbie, 2010; Lakens et al., 2018; Meyners, 2012; Quertemont, 2011; Rogers et al., 1993). It is implemented in the TOSTER package (Lakens & Caldwell, 2022). The procedure tests the effect estimate, in our case the difference between the raw means, against values at least as extreme as the lower and the upper equivalence bounds. The computations are implemented in R as follows:
Listing 7
Equivalence Testing and IHT
# this R code can be found under 05TOST.R on the OSF project site
library(TOSTER)
library(pwr)
# frequency table for group
table(Data$group)
n1 <- 200 # sample size group ’1’
n2 <- 500 # sample size group ’2’
# determining the equivalence bounds via the small telescopes approach
d.33 <- (pwr.t.test(
n = (n1 + n2) / 2,
d = NULL,
sig.level = 0.05,
power = 0.33,
type = "two.sample",
alternative = "two.sided"
))$d
# d.33 = 0.1150074
m1 <- mean(Data$posttest[Data$group == 1])
# 0.438377
m2 <- sd(Data$posttest[Data$group == 1])
# 1.017934
sd1 <- mean(Data$posttest[Data$group == 2])
# 0.7120272
sd2 <- sd(Data$posttest[Data$group == 2])
# 1.047107
m2 - m1
# 0.5795566
# using the raw means and sd
tsum_TOST(
m1 = m1, m2 = m2, sd1 = sd1, sd2 = sd2,
n1 = n1, n2 = n2, eqb = d.33, alpha = 0.05, var.equal = FALSE
)
# partial output:
# TOST Results
# t df p.value
# t-test -8.429 533.1 < 0.001
# TOST Lower -6.756 533.1 1
# TOST Upper -10.101 533.1 < 0.001
The difference between the raw means is (about) and the equivalence bounds are set to and . Using the TOST procedure includes two Welch t-tests. The test against the upper equivalence bound tests against and the test against the lower equivalence bound tests against .
In our example, the test against the upper equivalence bound is significant, , whereas the test against the lower equivalence bound is non significant, . Since the conclusion of equivalence can only be drawn if both tests are significant, we cannot reject the presence of a difference in raw means between the groups and . The classical test from NHST is significant, , indicating that the two groups and statistically differ with respect to their raw means.
In the following sections, we present two empirical data examples that serve to further demonstrate IHT in the EffectLiteR framework. The first example is in the context of linear regression and the second example is in the context of the generalized linear model.
Empirical Example on Linear Regression
The empirical example in this section is based on Keck et al. (2022). We used the “ACTG175” data set (Hammer et al., 1996), which comes with the R package speff2trial (Juraska, 2022) and originates from a randomized trial. For the sake of our illustration, we have changed the names of the variables with names that are more common in psychology. More precisely, let us assume that the treatment groups correspond to a group receiving an old, established vocational training program () and a group receiving a promising, novel vocational training program (). The outcome variable Y is a measure of job satisfaction, and, being measured by a freehand continuous line scale, ranges from 0 till 787. As covariates, we consider a categorical variable indicating whether a subject has already completed vocational trainings in the past or not, and a continuous variable describing job satisfaction at baseline. We analyse a subset of the data from subjects currently holding a full-time job and exclude all other subjects as well as cases with missing data, which leads to a total sample size of . Note that our approach has not been fully tested to handle missing data, which is why we exclude the incomplete cases from the data set. The full R code for this example (including renaming the variable names) can be found on the OSF website (see Keck et al., 2024). Below, we only show the most relevant parts for the sake of illustration. To fit this model using EffectLiteR, we can use the following R code:
Listing 8
Fitting the Model Using EffectLiteR
# This R code can be found under 06ex1.R on the OSF project site
elrmod <- effectLite(
y = "jobsatisfaction",
x = "treatment",
k = "past.training",
z = "baseline",
method = "sem",
fixed.cell = TRUE, fixed.z = TRUE,
homoscedasticity = TRUE,
data = Data
)
elrmod@results@adjmeans
# Estimate SE Est./SE
# adjmean0 259.4614 12.83423 20.21636
# adjmean1 299.8482 12.18761 24.60271We again used method = "sem" as this is needed for the effectLite_iht() function. We can observe that the adjusted means for the control group () and treatment group () are and respectively.
Our first hypothesis of interest is that the adjusted mean of the treatment group () is larger than the adjusted mean of the control group (). This is a Type A hypothesis and we have observed that our constraint is indeed satisfied in the data. Therefore, we can test our hypothesis of interest right away without testing a Type B hypothesis first. We test
25
26
27
Listing 9
Informative Hypothesis Test Using Adjusted Means
# This R code can be found under 06ex1.R on the OSF project site
effectLite_iht(object = elrmod, constraints = "adjmean1 > adjmean0")
# $test.stat
# [1] "Fbar"
#
# $Wald.info
# [1] 5.206926
#
# $pvalue
# [1] 0.01170985The hypothesis is expressed in terms of adjusted means (for and ), but because the average effect () is simply the difference between these two adjusted means, we can also formulate our hypothesis test as follows:
Listing 10
Informative Hypothesis Test Using the Average Effect
# This R code can be found under 06ex1.R on the OSF project site
effectLite_iht(object = elrmod, constraints = "Eg1 > 0")
# $test.stat
# [1] "Fbar"
#
# $Wald.info
# [1] 5.206926
#
# $pvalue
# [1] 0.01170985Here, as before, the keyword Eg1 represents the average effect . In both cases, we obtain , , allowing us to reject the null hypothesis in favor of the alternative. Note that if we would ignore the order, the resulting (non-informative) Wald statistic would still be (because the constraints are satisfied in the data). But the -value would be twice as large (in this case). This demonstrates the greater power which is typically obtained when using IHT compared to NHST.
The second hypothesis of interest is that the difference in adjusted means between the treatment () and control group () for subjects who have already completed vocational trainings in the past () is larger than zero. Again, this is a Type A hypothesis, but this time regarding a conditional effect, which is defined as (Mayer et al., 2016):
To obtain the adjusted means for the combinations of the levels of X and K, the following R code can be used:
Listing 11
Fitting the Model Using EffectLiteR
# This R code can be found under 06ex1.R on the OSF project site
elrmod@results@adjmeansgk
# Estimate SE Est./SE
# adjmean0gk0 210.3477 45.02987 4.671293
# adjmean1gk0 287.6858 34.86036 8.252521
# adjmean0gk1 265.0214 13.34676 19.856606
# adjmean1gk1 301.2251 12.98068 23.205644We can observe that our constraint is satisfied in the data: is larger than . Therefore, we do not need to test a Type B hypothesis first, before testing our hypothesis of interest. We test
30
31
32
Listing 12
Informative Hypothesis Test Regarding a Conditional Effect
# This R code can be found under 06ex1.R on the OSF project site
# in terms of adjusted means
effectLite_iht( object = elrmod, constraints = "adjmean1gk1 > adjmean0gk1")
# $test.stat
# [1] "Fbar"
#
# $Wald.info
# [1] 3.781235
#
# $pvalue
# [1] 0.02653014
# in terms of conditional effects
effectLite_iht(object = elrmod, constraints = "Eg1gk1 > 0")
# $test.stat
# [1] "Fbar"
#
# $Wald.info
# [1] 3.781235
#
# $pvalue
# [1] 0.02653014Using both IHT and NHST, we obtain . Furthermore, we obtain when using IHT and when using NHST. In this case, the greater power of IHT compared to NHST does make a difference concerning the significance of the results.
Empirical Example on the Generalized Linear Model
The empirical example in this section is based on Keck et al. (2023). We used the “ProblemDrinking” data set, which is available on the OSF project site (“problemDrinking.sav”) (see Keck et al., 2024). It stems from a randomized study investigating the effectiveness of mobile messaging interventions on problematic drinking behavior (Muench et al., 2017). We consider three groups, namely a control group (), which receives weekly self-tracking texts, a group obtaining static tailored texts () and a group obtaining adaptive, that is individually tailored texts (). The outcome variable Y is the reduction of the sum of weekly drinks. We treat Y as a count variable. As covariates, we consider the sum of weekly drinks at baseline, age and gender.
The full R script of this example, “07ex2.R”, can be found in the project’s OSF repository (see Keck et al., 2024). Note that the computation of the average and conditional effects in this section is based on Poisson regression and thus differs from the computation in linear regression. At the time of writing, the EffectLiteR package does not include (out of the box) support for Poisson outcome variables yet. Instead, we provided a script “effectLite_pois.R” that will take care of the computations for this particular example, and is used in the “07ex2.R” script.
We start with using glm() to fit the Poisson model:
Listing 13
Using glm() to Fit a Poisson Model
# This R code can be found under 07ex2.R on the OSF project site
fit.glm <- glm(formula = drinksum_post ~ treat + drinksum_pre + age + gender +
treat:drinksum_pre + treat:age + treat:gender +
drinksum_pre:age + drinksum_pre:gender + age:gender,
family = "poisson"(link = "log"), data = Data)Our first hypothesis of interest is that the average effect of receiving adaptive tailored texts () is larger than the average effect of receiving static tailored texts (). We first compute the adjusted means:
Listing 14
Adjusted Means for the Three Groups
# This R code can be found under 07ex2.R on the OSF project site
get_adjmeans(fit.glm)
# adjmean0 adjmean1 adjmean2
# 22.02197 17.24442 15.26413From these adjusted means, we can compute the average effects. For , the average effect (in terms of reduction) is (the difference between adjmean0 and adjmean1), while for , the average effect is (the difference between adjmean0 and adjmean2). This is in line with our expectations, and we can proceed with a Type A hypothesis test. We test
33
34
35
Listing 15
Informative Wald Statistic for This Hypothesis
# This R code can be found under 07ex2.R on the OSF project site
Wald.reg.ave <- getStat(fit.glm, type = "regular", effect = "average")
Wald.reg.ave[1]
# 27.93786
# p-value regular Wald
1 - pchisq (Wald.reg.ave[1], df = 1)
# 1.252745e-07
Wald.info.ave <- getStat(fit.glm, type = "informative", effect = "average")
Wald.info.ave[1]
# 27.93786
# informative p-value (warning: takes about 14-18 hours)
# pvalue <- get_informative_pvalue(object = fit.glm, data = Data, R = 1000,
# effect = "average",
# Wald.orig = Wald.info.ave [1])
# pvalue
# 0The informative Wald statistic equals . Because the constraint is satisfied in the data, the regular (non-informative) Wald statistic is the same (). The -value for the regular Wald test is easy to compute, and is very small (). The computation of the -value for the informative test takes a long time (about 14–18 hours), but results again in a very small -value ().
The second hypothesis of interest is that the difference in adjusted means between the group receiving individually tailored texts () and the control group () is larger for females () than for males (). This is a Type A hypothesis concerning conditional effects, which are defined as (Mayer et al., 2016):
We can compute the adjusted means for the various combinations of X and K as follows:
Listing 16
Adjusted Means for the Different X and K Levels
# This R code can be found under 07ex2.R on the OSF project site
get_adjmeansgk(fit.glm)
# adjmean0gk0 adjmean1gk0 adjmean2gk0 adjmean0gk1 adjmean1gk1 adjmean2gk1
# 22.78305 17.89236 14.79266 22.21254 17.24841 18.87397
We observe that our constraint is again satisfied in the data, since (adjmean2gk0 - adjmean0gk0) is larger than (adjmean0gk1 - adjmean2gk1). Therefore, we directly test:
40
against
41
using IHT and against
42
using NHST. To compute the informative and regular Wald statistics, we can use the following code:
Listing 17
Informative and Regular Wald Test for Conditional Effects
# This R code can be found under 07ex2.R on the OSF project site
Wald.reg.cond <- getStat(fit.glm, type = "regular", effect = "conditional")
Wald.reg.cond[1]
# 1.75101
# p-value regular Wald
1 - pchisq(Wald.reg.cond[1], df = 1)
# 0.1857499
Wald.info.cond <- getStat(fit.glm, type = "informative", effect = "conditional")
Wald.info.cond[1]
# 1.75101
# informative p-value (warning: takes about 4–5 hours)
# pvalue <- get_informative_pvalue(object = fit.glm, data = Data, R = 1000,
# effect = "conditional",
# Wald.orig = Wald.info.cond[1])
# pvalue
# 0.104The Wald statistics are . The -value for the regular Wald statistic is . The -value for the informative Wald statistic is . Again, computing the latter -values takes a long time. In both cases, we cannot reject the null hypothesis.
Discussion
This paper provided a condensed outline of the theoretical motivation for using IHT in the EffectLiteR framework as well as practical instructions on how to apply this method in the context of linear regression and the generalized linear model. We hope that this paper will stimulate researchers to question the common practice of using ANOVA in combination with NHST to compare groups. Our critique of this procedure is mainly focused on two aspects: The first point of criticism is focused on the unclear definitions of effects due to the different possible choices of sum of squares (SS) in ANOVA. In contrast, when using our proposed method, effects of interest are defined in a precise and unambiguous way. The second point of criticism refers to the expected order of the effects that is ignored when using NHST. In contrast, when using our proposed method, the order of the effects can be considered directly in the hypotheses.
Snippets of R code were shown in the various code listings included in the paper to illustrate how the EffectLiteR package can be used to test informative hypothesis about adjusted means, average, and conditional effects. The full R code for all examples is available on the OSF project site (see Keck et al., 2024). Only for the generalized linear model example did we provide custom R code that needs to be adapted by the user. In future work, we plan to create easy to use functions (within EffectLiteR) that can handle IHT in the context of generalized linear models.
Together with our past work (Keck et al., 2021, 2022, 2023), we have provided thorough technical explanations as well as useful practical information and instructions for applied researchers who wish to use IHT in the EffectLiteR framework. We have built a solid foundation of our method when using regression models and would like to expand our method to Structural Equation Modeling (SEM) in the future. Some of the ground work for this has already been done in Keck et al. (2021), where we used SEM for parameter estimation when considering stochastic group weights (Mayer & Thoemmes, 2019). Further implementing our method in SEM will be especially useful since most variables of interest in the social and behavioral sciences, such as “quality of life” or “socio-economic status”, are latent in nature and should not be treated as manifest.
Another potential area for further development is extending the presented approach to a Bayesian framework. In this manuscript, we focused on the frequentist approach. Here, the EffectLiteR model is estimated using either OLS or ML in the example with the continuous dependent variable, and IWLS in the example with the count dependent variable. Furthermore, informative test statistics are used. Both aspects have Bayesian counterparts: The regression models used can be estimated using Bayesian techniques. For an example of a Bayesian EffectLiteR application using blavaan (Merkle et al., 2021), see Mayer et al. (2017). Furthermore, informative hypotheses can be considered in a Bayesian framework using Bayes factors (e.g., Hoijtink, 2012; Van Lissa et al., 2020). Combining Bayesian EffectLiteR and Bayesian informative hypothesis testing is promising and may provide even more flexibility, in particular when more and specific prior information is available that can be incorporated in the analysis.
This is an open access article distributed under the terms of the Creative Commons Attribution License (