This note sketches two computational shortcuts for estimating multidimensional item response theory (IRT) models (van der Linden & Hambleton, 1997). We focus on the estimation problem using the expectation-maximization (EM) algorithm using numerical integration. In particular, we consider a more efficient implementation of the expectation step (E-step) in the EM algorithm. The computational amount is estimated by calculating the number of required operations. Additionally, computational gains in a pilot study are reported. Our primary focus is to propose two algorithmic shortcuts that avoid redundant operations. Such approaches can also be found for structural equation models (Pritikin, Hunter, von Oertzen, Brick, & Boker, 2017; von Oertzen & Brick, 2014).
This paper is structured as follows. First, the EM algorithm in IRT models is introduced, and its computational amount is analyzed. Second, we show how computational gains are realized by using so-called pseudo-items instead of original items. Third, we suggest a computational shortcut for multidimensional IRT models whose items possess a simple loading structure (i.e., between item dimensionality). Finally, we conclude with a discussion.
EM Algorithm in IRT Models
For simplicity, we focus on dichotomous item responses in this paper, where denote persons, and denote items. Let denote the vector of observed data for person . In the IRT model, it is assumed that the item response function (IRF) of item depends on item parameters , and the multidimensional distribution of the latent variable depends on some parameter . Let denote the item response probability of item ( ). The total log-likelihood is then given as
where denotes the density of the latent variable . In the following, we approximate the integral in Equation 1 by a finite summation and use a set of fixed quadrature points . The numerically approximated log-likelihood is then given as
where is a discrete density distribution for . The latent variable cannot be directly observed. For this reason, the EM algorithm (see Aitkin, 2016; Fu, 2019; Glas, 2016, for overviews) is often used for maximizing in Equation 2. The EM algorithm alternates between two steps (see Baker & Kim, 2004, for details). In the E-step, the expected value of in Equation 2 is computed by integrating out using individual posterior distributions. In the M-step, the expected log-likelihood is maximized based on computed expected counts from the E-step. Note that the update of item parameters in the M-step is a logistic regression model using case weights.
In the E-step, individual posterior distributions are computed as
For all items, expected counts are computed as
where denotes the indicator function.
We can now compute the total number of operations (multiplications or additions) needed in the E-step. The multiplications of probabilities in the numerator in the fraction in Equation 3 are typically substituted by additions of logarithmized probabilities. The following identity is used:
Notably, additions are typically faster than multiplications. Moreover, calculations on the logarithmic metric prevent numerical underflow (see, e.g., Vermunt, 2003, for an application in multilevel latent class models). First, for the computation of individual posterior distributions (see Equation 3), operations are needed. In addition, operations are needed for normalizing the individual posterior distributions. However, we ignore this amount of operations in our calculations because we are particularly interested in situations in which is large, say larger than 50.
Second, operations are required for computing expected counts in Equation 4. Note that in software implementations, the multiplication with the indicator function in Equation 4 must not be carried out in the summation for the expected counts because one has only to add posterior probabilities for subjects with . In practice, one uses an index vector for storing the necessary information. In total, the number of operations needed in the E-step can be determined as
Computationally Efficient EM Algorithms
One strategy for improving computational efficiency is to use a parallelized version of an EM algorithm (López-de Teruel, García, & Acacio, 1999). This approach distributes the computational work in the E-step and the M-step among multiple cores. There are several applications of this strategy in mixture modeling (Chen, Ostrouchov, Pugmire, Prabhat, & Wehner, 2013; Kwedlo, 2014; Lee, Leemaqz, & McLachlan, 2016; Lee, McLachlan, & Leemaqz, 2020; Yin, Zhang, & Gao, 2018). von Davier (2016) proposed a parallel EM algorithm for multidimensional IRT models and found considerable performance gains, in particular, for large datasets (see also Khorramdel, Shin, & von Davier, 2019).
A different strand of literature relies on simplifying the model structure in multidimensional IRT models (see Rijmen, Jeon, von Davier, & Rabe-Hesketh, 2014, for an overview). In this case, the computation of posterior distributions can be simplified if some trait distributions are (conditionally) independent (Rijmen, 2009). Examples include the bi-factor model (a.k.a. testlet models, Gibbons & Hedeker, 1992) and the two-tier model (Cai, 2010).
Usage of Pseudo-Items
In the following, the concept of pseudo-items is introduced. The main idea is to perform the computationally intensive likelihood computation on pseudo-items instead of original items to reduce computation time. A pseudo-item is composed of original polytomous items. Assume that for an integer , and there are now pseudo-items. The number shall also be noted as the size of pseudo-items. More formally, let denote a vector of polytomous items, each possessing categories, a pseudo-item is defined as
The pseudo-item takes integer values .
In the sequel, we only consider the case of dichotomous items (i.e., ). For example, original dichotomous items are recoded as a pseudo-item that takes integer values . The original data consisting of dichotomous item responses for subjects is recoded into data consisting of polytomous pseudo-items.
Using Pseudo-Items in the EM Algorithm
A computationally more efficient EM algorithm is obtained by evaluating the posterior distributions based on these pseudo-items. Instead of using original items in the posterior, now only pseudo-items are involved, and the number of multiplications needed in the posterior is reduced. However, the number of probabilities that have to be additionally precomputed is given by . Due to local independence, the item response probabilities of pseudo-item are given as
The number of operations needed in the posterior computation in Equation 9 is .
Expected counts can now be computed for pseudo-items :
Therefore, for each pseudo-item and each trait grid point , expected counts are computed. Based on these counts , expected counts for item can be computed by summing over appropriate counts of pseudo-items. The total number of operations needed for computing the expected counts is then given as . Hence, the total number of operations for computing individual posterior distributions and expected counts for items based on the approach using pseudo-items is given as
We can now define the computational gain that describes the factor of decrease in computation time by using pseudo-items:
Hence, for a fixed pseudo-item size , the maximum gain is . It should be emphasized that the gain in the E-step does neither depend on the number of items nor the number of integration points .
In Figure 1, the computational gain is shown as a function of pseudo-item size for a sample size of . It can be seen that there is a maximum at with a corresponding gain of about 5. This means that using pseudo-items that are defined by grouping 6 original items results in the highest efficiency gain. Notably, using a too large value of results in a very inefficient computation.
Due to , it holds that . Hence, the size of pseudo-items has to be optimally chosen, and using a too large value of can lead to an inefficient algorithm. Also, note that for a fixed , it holds that .
Computation of an Optimal Pseudo-Item Size
We now determine the optimal pseudo-item size that maximizes the gain defined in Equation 12. For simplicity, the pseudo-item size is now regarded as a continuous variable. Define . For a fixed sample size , a maximum of is given as a minimum . The derivative of is given by . Hence, the optimal value fulfills
By taking the logarithm in Equation 13, it can be seen that the optimal fulfills
Because the term is faster increasing than the term , the optimal is primarily a function of , i.e., .
We now illustrate the computational gains of the more efficient E-step implementation for sample sizes ranging between and subjects. In Table 1, the computational gain of using pseudo-items in the E-step is shown as a function of sample size using Equation 12. It can be seen that the maximum computational gain increases with increasing sample sizes. Increases in computational efficiency are indeed non-negligible, at least for applications of international large-scale assessment studies like PISA (OECD, 2017) that involve student sample sizes of about half a million.
|1 000 000||14||12.6|
|10 000 000||17||15.3|
Note. = pseudo-item size that results in maximum computational gain; = maximum computational gain.
We now demonstrate how the proposed shortcut using pseudo-items performs in an experimental implementation using the Rcpp package (Eddelbuettel, 2013; Eddelbuettel & François, 2011) in the R software (R Core Team, 2020). The E-step consists of three parts in the computation. First, the logarithms of the item response probabilities are added to the logarithm of the prior distribution. In this first part, an incomplete version of the unnormalized likelihood is computed. Afterward, the operation has to be applied to obtain the complete unnormalized likelihood. Second, the unnormalized likelihood on the logarithmic scale is further processed by taking the operation and normalizing the likelihood contributions for each subject for obtaining individual posterior distributions. Third, expected counts are computed based on individual posterior distributions.
The computational gains were investigated using sample sizes , 1000, 10000, and 100000. The number of items was fixed to 50, 100, and 200. In the EM algorithm, the number of quadrature points was fixed to 21. The R package microbenchmark (Mersmann, Beleites, Hurling, Friedman, & Ulrich, 2019) was used for timing comparisons. In each condition, 500 replications were utilized in the comparison.
In Table 2, computational gains are displayed for different pseudo-item sizes. It can be seen that efficiency gains were only about 2 or 3 if the complete E-step consisting of all three parts is used in the comparison. Notably, the exponentiation step and the normalization of the likelihood was not quantified in our analytically derived computational gains. Interestingly, it turned out that the predictions of computational gains were fulfilled for the computation of the unnormalized likelihood (without taking , first part) and the computation of expected counts (third part). The second part (exponentiation and normalization of the likelihood terms) is unrelated to using pseudo-items. By considering only the first part of the E-step, computational gains range between 3 and 10, meaning that this part is now three times to ten times faster than a previous implementation. For the third step, similar gains were obtained. Very likely, this finding demonstrates that applying the exponentiatial function is much more computationally demanding than an addition or multiplication.
By comparing the computational gains of the computation of the unnormalized likelihood with the prediction of Equation 12 in Table 2, we observe a moderate agreement of maximum computation gains for sample sizes of at least 1 000. Differences between timings obtained from implementations and analytical predictions could also emerge from the computation time required for memory allocation or accessing appropriate matrix entries in functions.
|Sample size: No. of items||Pseudo-item size k
|Unnormalized likelihood without operation|
|Computational gain predicted from Equation 12|
Note. Optimal computational gains are printed in bold. Computational gains computed by Equation 12 do not involve the number of items.
We also investigated computation gains using 61 quadrature points which is the default for unidimensional IRT model in the popular R package mirt (Chalmers, 2012). The findings for computational gains were similar albeit not identical (i.e., some variation in gains of at most 1 in each condition).
It has to be noted that the optimal pseudo-item size from Equation 12 does not coincide with the empirically obtained optimal pseudo-item size. While the computational gains in the computation of the unnormalized likelihood approximately match the analytical predictions (see Table 2), the optimal pseudo-item became smaller in practical implementations. Depending on specific software implementation, it is recommended to always check potential computational gains depending on the sample size, the number of items, and the number of quadrature points for determining the optimal pseudo-item size.
Overall, it has to be admitted that the proposed shortcut might not result in considerable gains in the E-step of the EM algorithm. However, similar gains would also be obtained in parallel computing if only three or four cores were used.
Dimension-Wise Likelihood Computation in Between Item Dimensionality Multidimensional IRT Models
Assume now that in a multidimensional IRT model, each item loads on exactly one dimension. This property is labeled as between item dimensionality (Adams, Wilson, & Wang, 1997). It is assumed that for each dimension , there are items. The item response probability in the case of between item dimensionality now only depends on one dimension. Hence, the total likelihood contribution for subject can be factored as
where denotes the item response of subject for the th on dimension . From Equation 16, it is evident that the likelihood part in individual posterior distributions can be independently computed for all dimensions. In a multidimensional IRT model, one often uses the same grid values across dimensions. For example, let be again the grid of fixed quadrature points for , the multidimensional grid is obtained by defining . With , grid values are defined, and each component of this vector possesses values in . Using Equation 6,
operations are needed for computing the posterior distributions. It turns out that in the computation of the posterior , the product term does not have to be calculated for all dimensions. In contrast, one can rely on the factorization
The expected counts for an item have only to be evaluated based on a dimension due to between item dimensionality. Hence, one can integrate over all dimensions except the th dimension. Using the notation , one can compute dimension-wise individual marginal posterior distributions :
where denotes a summation over all dimensions except . Based on these marginal distributions, the computational demand of expected counts is substantially reduced as one has to only sum over individual probabilities related on exactly one dimension for each item. By employing these computational shortcuts, the number of required operations can be reduced to
This less demanding computation provides a computational gain of
For a large number of items , the computational gain approaches . Note that the computational gain is independent of the sample size .
Combination of Usage of Pseudo-Items and Dimension-Wise Likelihood Computation
Next, we combine the usage of pseudo-items and dimension-wise likelihood computation. For simplicity, we assume an equal number of items per dimension, i.e., . Then, we obtain by using the same computations as in Equation 11
The computational gain can be computed as
We now illustrate the computational gains by applying our proposed computational shortcuts. We consider a multidimensional IRT model with between item dimensionality with items and fixed quadrature points per dimension. Sample sizes are varied between and , and the number of dimensions is chosen as 2, 3, or 4. Two computational shortcuts are compared. First, we use a strategy that only relies on the dimension-wise computation of the likelihood (Method DW). Second, we combine the dimension-wise computation with the pseudo-item approach that selects the optimal pseudo-item size.
In Table 3, computational gains based on Equation 23 of the two methods are shown as a function of sample sizes and dimensions. It can be seen that the dimension-wise method (Method DW) is independent of sample sizes, but especially results in substantial computational gains for larger numbers of dimensions. For dimensions, additional usage of pseudo-items (Method DP) further provides relevant computational gains, while using pseudo-items for more dimensions ( or ) is not as effective. Given that the E-step in numerical integration for multidimensional IRT models is particularly computationally demanding, we think that our proposed computational shortcuts could be beneficial for multidimensional models with between-item dimensionality. However, it has to be shown that the proposed shortcuts for multidimensional IRT models also show comparable computational gains in software implementations.
|Sample size||D = 2
||D = 3
||D = 4
|1 000 000||10.3||28.3||22.5||24.8||19.9||20.0|
|10 000 000||10.3||29.1||22.5||24.8||19.9||20.0|
Note. = number of dimensions; DW = computational gain by using dimension-wise computation; DP = computational gain by using dimension-wise computation and maximum pseudo-item size.
This note shows that by using two computational shortcuts, the number of operations required in an E-step in unidimensional or multidimensional models can be substantially reduced. First, the evaluation of individual posterior distributions can be more efficiently implemented by using pseudo-items that are composed of a set of original items. By pursuing this strategy, additional item response probabilities—evaluated at the grid—for these pseudo-items have to be computed. However, as a consequence, fewer multiplications have to be conducted in the computation of individual posterior distributions. It was shown that the optimal number of original items that are grouped into pseudo-items is a function of the sample size. Second, for multidimensional IRT models with between item dimensionality, EM algorithms that rely on numerical integration can be more efficiently implemented if the joint individual likelihood function is factored as a product of dimension-wise likelihood functions. This strategy avoids the evaluation of multiplications of likelihood contributions on the full multidimensional grid of trait values. The usage of pseudo-items can additionally be applied for multidimensional IRT models, which results in further computational gains.
In our evaluation, we only considered the case of dichotomous item responses. In principle, the same strategy can be applied for datasets containing polytomous item responses. However, using pseudo-items for polytomous items with many categories will typically not result in such substantial computational gains as for dichotomous items because a larger number of probabilities have to be computed. The handling of ignorable missing item responses is relatively simple because one could treat them as a special case of polytomous item responses and define a corresponding probability of a missing category as one. In this strategy, missing item responses are essentially omitted from the computation. For planned missingness designs (Frey, Hartig, & Rupp, 2009; Rhemtulla & Little, 2012), more thoughtful approaches seem reasonable that skip not administered items in the computations. Moreover, computational gains by using pseudo-items directly transfer to latent class or mixture models.
Our proposed computational shortcuts can be combined with parallel computing (von Davier, 2016) or acceleration techniques (Beisemann, Wartlick, & Doebler, 2020) in the EM algorithm. This would lead to additional computational gains, which are particularly important for applications involving large numbers of items. This contribution is primarily meant as a theoretical contribution analyzing the amount of required computational operations. The preliminary evaluation of obtained performance gains of implementations in statistical software led to slightly smaller computation gains in practice. However, these gains are nevertheless noticeable.