A Note on a Computationally Efficient Implementation of the EM Algorithm in Item Response Models

This note sketches two computational shortcuts for estimating unidimensional item response models and multidimensional item response models with between-item dimensionality utilizing an expectation-maximization (EM) algorithm that relies on numerical integration with fixed quadrature points. It is shown that the number of operations required in the E-step can be reduced in situations of many cases and many items by appropriate shortcuts. Consequently, software implementations of a modified E-step in the EM algorithm could benefit from gains in computation time.

gains are realized by using so-called pseudo-items instead of original items. Third, we suggest a computational shortcut for multidimensional IRT models whose items possess a simple loading structure (i.e., between item dimensionality). Finally, we conclude with a discussion.

EM Algorithm in IRT Models
For simplicity, we focus on dichotomous item responses x ni in this paper, where n = 1, …, N denote persons, and i denote items. Let x n = x n1 , …, x nI denote the vector of observed data for person n. In the IRT model, it is assumed that the item response function (IRF) of item i depends on item parameters γ i , and the multidimensional distri bution of the latent variable θ depends on some parameter σ. Let p i x θ; γ i denote the item response probability P X i = x θ of item i (i = 1, …, I ). The total log-likelihood is then given as where f θ denotes the density of the latent variable θ. In the following, we approximate the integral in Equation 1 by a finite summation and use a set of fixed quadrature points θ t t = 1, …, T . The numerically approximated log-likelihood is then given as l γ, σ = ∑ n=1 N log ∑ t=1 T ∏ i=1 I p i x ni θ t ; γ i w θ θ t ; σ , where w θ is a discrete density distribution for θ. The latent variable θ cannot be directly observed. For this reason, the EM algorithm (see Aitkin, 2016;Fu, 2019;Glas, 2016, for overviews) is often used for maximizing l in Equation 2. The EM algorithm alternates between two steps (see Baker & Kim, 2004, for details). In the E-step, the expected value of l in Equation 2 is computed by integrating θ out using individual posterior distributions. In the M-step, the expected log-likelihood is maximized based on computed expected counts from the E-step. Note that the update of item parameters γ i in the M-step is a logistic regression model using case weights.
In the E-step, individual posterior distributions ℎ n θ t x n are computed as ℎ n θ t x n = ∏ i=1 I p i x ni θ t ; γ i w θ θ t ; σ ∑ u=1 T ∏ i=1 I p i x ni θ u ; γ i w θ θ u ; σ . (3)

Computationally Efficient EM Algorithm
For all items, expected counts are computed as e i x θ t = ∑ n=1 N ℎ n θ t x n 1 x ni = x x = 0, 1, where 1 denotes the indicator function.
We can now compute the total number of operations (multiplications or additions) needed in the E-step. The multiplications of probabilities in the numerator in the fraction in Equation 3 are typically substituted by additions of logarithmized probabilities. The following identity is used: Notably, additions are typically faster than multiplications. Moreover, calculations on the logarithmic metric prevent numerical underflow (see, e.g., Vermunt, 2003, for an application in multilevel latent class models). First, for the computation of individual posterior distributions (see Equation 3), N T I operations are needed. In addition, N T operations are needed for normalizing the individual posterior distributions. However, we ignore this amount of operations in our calculations because we are particularly interested in situations in which I is large, say larger than 50. Second, N T I operations are required for computing expected counts in Equation 4. Note that in software implementations, the multiplication with the indicator function 1 in Equation 4 must not be carried out in the summation for the expected counts e i x θ t because one has only to add posterior probabilities for subjects n with x ni = x. In practice, one uses an index vector for storing the necessary information. In total, the number of operations O 1 needed in the E-step can be determined as
A different strand of literature relies on simplifying the model structure in multidi mensional IRT models (see Rijmen, Jeon, von Davier, & Rabe-Hesketh, 2014, for an overview). In this case, the computation of posterior distributions can be simplified if some trait distributions are (conditionally) independent (Rijmen, 2009). Examples include the bi-factor model (a.k.a. testlet models, Gibbons & Hedeker, 1992) and the two-tier model (Cai, 2010).

Pseudo-Items
In the following, the concept of pseudo-items is introduced. The main idea is to perform the computationally intensive likelihood computation on pseudo-items instead of origi nal items to reduce computation time. A pseudo-item is composed of k original polyto mous items. Assume that I = I k k for an integer k, and there are now I k pseudo-items. The number k shall also be noted as the size of pseudo-items. More formally, let x 1 , …, x k denote a vector of k polytomous items, each possessing H categories, a pseudo-item y j is defined as The pseudo-item y j takes H k integer values 0, 1, …, H k − 1.
In the sequel, we only consider the case of dichotomous items (i.e., H = 2). For example, k = 3 original dichotomous items are recoded as a pseudo-item y that takes integer values 0, 1, …, 7. The original data consisting of I dichotomous item responses x n = x n1 , …, x nI for subjects n = 1, …, N is recoded into data y n = y n1 , …, y nI k consisting of polytomous pseudo-items.

Using Pseudo-Items in the EM Algorithm
A computationally more efficient EM algorithm is obtained by evaluating the posterior distributions based on these pseudo-items. Instead of using I original items in the posterior, now only I k pseudo-items are involved, and the number of multiplications needed in the posterior is reduced. However, the number of probabilities that have to be additionally precomputed is given by I k 2 k kT = 2 k I T . Due to local independence, the item response probabilities of pseudo-item y j are given as Computationally Efficient EM Algorithm

The computation based on individual posterior distributions is now given as (insert Equation 8 into Equation 3)
ℎ n θ t x n = ℎ n θ t y n = ∏ j=1 I k p̃j y nj θ t ; γ f θ θ t ; σ The number of operations needed in the posterior computation in Equation 9 is N T I k = N T I /k. Expected counts can now be computed for pseudo-items y j : ẽj y θ t = ∑ n=1 N ℎ n θ t y n 1 y nj = y , y = 0, 1, …, 2 k − 1 .
Therefore, for each pseudo-item j and each trait grid point θ t , 2 k expected counts are computed. Based on these counts ẽj, expected counts for item i can be computed by summing over appropriate counts of pseudo-items. The total number of operations needed for computing the expected counts is then given as N T I /k + 2 k I T . Hence, the total number of operations O 2 k for computing individual posterior distributions and expected counts for items based on the approach using I k pseudo-items is given as We can now define the computational gain G N , k that describes the factor of decrease in computation time by using pseudo-items: Hence, for a fixed pseudo-item size k, the maximum gain G N , k is k. It should be emphasized that the gain in the E-step does neither depend on the number of items I nor the number of integration points T . In Figure 1, the computational gain is shown as a function of pseudo-item size k for a sample size of N = 1000. It can be seen that there is a maximum at k = 6 with a corresponding gain of about 5. This means that using pseudo-items that are defined by grouping 6 original items results in the highest efficiency gain. Notably, using a too large value of k results in a very inefficient computation.

Computational Gain as a Function of Pseudo-Item size k for a Sample Size of N = 1000
Due to lim k ∞ N /k + 2 k = ∞, it holds that lim k ∞ G N , k = 0. Hence, the size of pseudo-items has to be optimally chosen, and using a too large value of k can lead to an inefficient algorithm. Also, note that for a fixed k, it holds that lim N ∞ G N , k = k.

Computation of an Optimal Pseudo-Item Size
We now determine the optimal pseudo-item size k 0 that maximizes the gain G N , k defined in Equation 12. For simplicity, the pseudo-item size k is now regarded as a continuous variable. Define g k = N k −1 + 2 k − 1 . For a fixed sample size N , a max imum of G N , k is given as a minimum g k . The derivative of g is given by g′ k = − N k −2 + log 2 2 k − 1 . Hence, the optimal value k 0 fulfills Inserting Equation 13 into Equation 12 provides the maximum gain G 0 that can be achieved By taking the logarithm in Equation 13, it can be seen that the optimal k 0 fulfills 2 log k + log 2 k = log N + log 2 − log log 2 .
Because the term k is faster increasing than the term log k, the optimal k 0 is primarily a function of log N , i.e., k 0 = O log N .

Analytical Illustration
We now illustrate the computational gains of the more efficient E-step implementation for sample sizes ranging between N = 100 and N = 10 000 000 subjects. In Table 1, the computational gain of using pseudo-items in the E-step is shown as a function of sample size N using Equation 12. It can be seen that the maximum computational gain G 0 increases with increasing sample sizes. Increases in computational efficiency are indeed non-negligible, at least for applications of international large-scale assessment studies like PISA (OECD, 2017) that involve student sample sizes of about half a million. Note. k max = pseudo-item size that results in maximum com putational gain; G 0 = maximum computational gain.

Computational Illustration
We now demonstrate how the proposed shortcut using pseudo-items performs in an experimental implementation using the Rcpp package (Eddelbuettel, 2013;Eddelbuettel & François, 2011) in the R software (R Core Team, 2020). The E-step consists of three parts in the computation. First, the logarithms of the item response probabilities are added to the logarithm of the prior distribution. In this first part, an incomplete version of the unnormalized likelihood is computed. Afterward, the exp operation has to be applied to obtain the complete unnormalized likelihood. Second, the unnormalized likelihood on the logarithmic scale is further processed by taking the exp operation and normalizing the likelihood contributions for each subject for obtaining individual posterior distribu tions. Third, expected counts are computed based on individual posterior distributions. The computational gains were investigated using sample sizes N = 100, 1000, 10000, and 100000. The number of items was fixed to 50, 100, and 200. In the EM algorithm, the number of quadrature points was fixed to 21. The R package microbenchmark (Mersmann, Beleites, Hurling, Friedman, & Ulrich, 2019) was used for timing compari sons. In each condition, 500 replications were utilized in the comparison.
In Table 2, computational gains are displayed for different pseudo-item sizes. It can be seen that efficiency gains were only about 2 or 3 if the complete E-step consisting of all three parts is used in the comparison. Notably, the exponentiation step and the normalization of the likelihood was not quantified in our analytically derived computa tional gains. Interestingly, it turned out that the predictions of computational gains were fulfilled for the computation of the unnormalized likelihood (without taking exp, first part) and the computation of expected counts (third part). The second part (exponen tiation and normalization of the likelihood terms) is unrelated to using pseudo-items. By considering only the first part of the E-step, computational gains range between 3 and 10, meaning that this part is now three times to ten times faster than a previous implementation. For the third step, similar gains were obtained. Very likely, this finding demonstrates that applying the exponentiatial function is much more computationally demanding than an addition or multiplication.
By comparing the computational gains of the computation of the unnormalized likeli hood with the prediction of Equation 12 in Table 2, we observe a moderate agreement of maximum computation gains for sample sizes of at least 1 000. Differences between timings obtained from implementations and analytical predictions could also emerge from the computation time required for memory allocation or accessing appropriate matrix entries in functions. We also investigated computation gains using 61 quadrature points which is the default for unidimensional IRT model in the popular R package mirt (Chalmers, 2012). The findings for computational gains were similar albeit not identical (i.e., some variation in gains of at most 1 in each condition). It has to be noted that the optimal pseudo-item size from Equation 12 does not coincide with the empirically obtained optimal pseudo-item size. While the computation al gains in the computation of the unnormalized likelihood approximately match the analytical predictions (see Table 2), the optimal pseudo-item became smaller in practical implementations. Depending on specific software implementation, it is recommended to always check potential computational gains depending on the sample size, the number of items, and the number of quadrature points for determining the optimal pseudo-item size.

Computational Gain in an Rcpp Implementation Using Pseudo-Items in the E-step as a Function of Pseudo-Item Size, Sample Size, and Number of Items
Overall, it has to be admitted that the proposed shortcut might not result in consid erable gains in the E-step of the EM algorithm. However, similar gains would also be obtained in parallel computing if only three or four cores were used.

Dimension-Wise Likelihood Computation in Between Item Dimensionality Multidimensional IRT Models
Assume now that in a multidimensional IRT model, each item loads on exactly one θ dimension. This property is labeled as between item dimensionality (Adams, Wilson, & Wang, 1997). It is assumed that for each dimension d d = 1, …, D , there are I d items. The item response probability in the case of between item dimensionality now only depends on one dimension. Hence, the total likelihood contribution for subject n can be factored as operations are needed for computing the posterior distributions. It turns out that in the computation of the posterior ℎ n θ t ; x n ∝ L n θ t ; x n , the product term does not have to be calculated for all dimensions. In contrast, one can rely on the factorization L n θ t ; x n = ∏ d=1 D L nd θ td ; x nd .
The expected counts for an item i have only to be evaluated based on a dimension d due to between item dimensionality. Hence, one can integrate over all dimensions except the d th dimension. Using the notation θ t = θ td , θ t −d , one can compute dimension-wise individual marginal posterior distributions ℎ nd * : where ∑ −d denotes a summation over all dimensions except d. Based on these marginal distributions, the computational demand of expected counts is substantially reduced as one has to only sum over individual probabilities related on exactly one dimension for each item. By employing these computational shortcuts, the number of required operations can be reduced to For a large number of items I , the computational gain approaches Q D − 1 . Note that the computational gain is independent of the sample size N .

Combination of Usage of Pseudo-Items and Dimension-Wise Likelihood Computation
Next, we combine the usage of pseudo-items and dimension-wise likelihood computa tion. For simplicity, we assume an equal number of items per dimension, i.e., I d = I /d. Then, we obtain by using the same computations as in Equation 11 O 2 k = 2N QI /k + D + 1 N Q D + 2 ⋅ 2 k I Q .
In the minimization of O 2 with respect to k, the same optimal value k 0 is obtained as in Equation 13. Inserting Equation 13 into Equation 23 leads to an optimal gain of G 0 = G N , k 0 = k 0 Q D − 1 1 1 + k 0 D + 1 Q D − 1 / 2I + 1/ log 2 k 0 . (24)

Analytical Illustration
We now illustrate the computational gains by applying our proposed computational shortcuts. We consider a multidimensional IRT model with between item dimensionality with I = 50 items and Q = 15 fixed quadrature points per dimension. Sample sizes are varied between N = 100 and N = 10 000 000, and the number of dimensions D is chosen as 2, 3, or 4. Two computational shortcuts are compared. First, we use a strategy that only relies on the dimension-wise computation of the likelihood (Method DW). Second, we combine the dimension-wise computation with the pseudo-item approach that selects the optimal pseudo-item size.
In Table 3, computational gains based on Equation 23 of the two methods are shown as a function of sample sizes and dimensions. It can be seen that the dimension-wise method (Method DW) is independent of sample sizes, but especially results in substantial computational gains for larger numbers of dimensions. For D = 2 dimensions, additional usage of pseudo-items (Method DP) further provides relevant computational gains, while using pseudo-items for more dimensions (D = 3 or D = 4) is not as effective. Given that the E-step in numerical integration for multidimensional IRT models is particularly computationally demanding, we think that our proposed computational shortcuts could be beneficial for multidimensional models with between-item dimensionality. However, it has to be shown that the proposed shortcuts for multidimensional IRT models also show comparable computational gains in software implementations. Note. D = number of dimensions; DW = computational gain by using dimension-wise computation; DP = computational gain by using dimension-wise computation and maximum pseudo-item size.

Discussion
This note shows that by using two computational shortcuts, the number of operations required in an E-step in unidimensional or multidimensional models can be substantially reduced. First, the evaluation of individual posterior distributions can be more efficiently implemented by using pseudo-items that are composed of a set of original items. By pursuing this strategy, additional item response probabilities-evaluated at the θ grid-for these pseudo-items have to be computed. However, as a consequence, fewer multiplica tions have to be conducted in the computation of individual posterior distributions. It was shown that the optimal number of original items that are grouped into pseudo-items is a function of the sample size. Second, for multidimensional IRT models with between item dimensionality, EM algorithms that rely on numerical integration can be more efficiently implemented if the joint individual likelihood function is factored as a product of dimension-wise likelihood functions. This strategy avoids the evaluation of multiplica tions of likelihood contributions on the full multidimensional grid of trait values. The usage of pseudo-items can additionally be applied for multidimensional IRT models, which results in further computational gains. In our evaluation, we only considered the case of dichotomous item responses. In principle, the same strategy can be applied for datasets containing polytomous item responses. However, using pseudo-items for polytomous items with many categories will typically not result in such substantial computational gains as for dichotomous items because a larger number of probabilities have to be computed. The handling of ignorable missing item responses is relatively simple because one could treat them as a special case of polytomous item responses and define a corresponding probability of a missing category as one. In this strategy, missing item responses are essentially omitted from the computation. For planned missingness designs (Frey, Hartig, & Rupp, 2009;Rhemtulla & Little, 2012), more thoughtful approaches seem reasonable that skip not administered items in the computations. Moreover, computational gains by using pseudo-items direct ly transfer to latent class or mixture models.
Our proposed computational shortcuts can be combined with parallel computing (von Davier, 2016) or acceleration techniques (Beisemann, Wartlick, & Doebler, 2020) in the EM algorithm. This would lead to additional computational gains, which are particularly important for applications involving large numbers of items. This contribu tion is primarily meant as a theoretical contribution analyzing the amount of required computational operations. The preliminary evaluation of obtained performance gains of implementations in statistical software led to slightly smaller computation gains in practice. However, these gains are nevertheless noticeable.

Funding:
The author has no funding to report.