Chain Graph Reduction Into Power Chain Graphs

Reduction of graphs is a class of procedures used to decrease the dimensionality of a given graph in which the properties of the reduced graph are to be induced from the properties of the larger original graph. This paper introduces both a new method for reducing chain graphs to simpler directed acyclic graphs (DAGs), that we call power chain graphs (PCG), as well as a procedure for structure learning of this new type of graph from correlational data of a Gaussian graphical model. A definition for PCGs is given, directly followed by the reduction method. The structure learning procedure is a two-step approach: first, the correlation matrix is used to cluster the variables; and then, the averaged correlation matrix is used to discover the DAGs using the PC-stable algorithm. The results of simulations are provided to illustrate the theoretical proposal, which demonstrate initial evidence for the validity of our procedure to recover the structure of power chain graphs. The paper ends with a discussion regarding suggestions for future studies as well as some practical implications.

A graph is an ordered pair, G = (V, E), consisting of V, a set of nodes (also called vertices), and E, a set of edges, which are formally defined as pairs of nodes.Reduction, or simplification, of graphs is a class of methods used to reduce the dimensionality (defined as the total number of nodes and edges) of a graph.For a smaller graph to be considered a reduction of a larger graph, the properties of the reduced graph have to be induced from the properties of the larger original graph (Aref et al., 2014).In areas such as statistical learning, neuroscience, distributed control of networked dynamical systems, and others that use high-dimensional datasets (Xu et al., 2011), and in which the use of proxy variables is common, these procedures are of particular interest.
It appears from the literature on statistical and machine learning, as well as from the literature on probabilistic graphical models, that motif discovery (Hu et al., 2005), community detection (Javed et al., 2018) and other methods (e.g., Chao et al., 2019;Shah et al., 2017) are used as the main techniques to identify ways of combining variables or simplifying the graphical representation of a high-dimensional set of variables when the true graph structure is unknown.Graph reduction methods are used to reduce directed graphs or undirected graphs.For a third category of graphs, chain graphs, which utilize both directed and undirected edges, there is still no procedure to reduce them effectively.
The main objective of this research is to present a novel type of graph, called a power chain graph (PCG), which is a directed acyclic graph (DAG) derived from the reduction of a chain graph (CG; Cox & Wermuth, 1993), in which all nodes and undirected edges are contracted into a set of power nodes (i.e., nodes that represent groups of nodes), and all the original directed edges are contracted into a set of power edges (i.e., edges that represent groups of edges).Moreover, we propose a PCG structure learning procedure that can be used when the observed undirected graph is a Gaussian graphical model (GGM; Lauritzen, 1996;Uhler, 2017) and it is assumed that the true graph is a chain graph with unknown structure.Lastly, we evaluate the effectiveness of our PCG structure learning procedure by performing three simulation studies and an empirical example with real data.
The remainder of this paper is organized as follows.The next section presents a definition of PCGs and how to reduce CGs into PCGs.In the third section, we present a method for empirically learning the structures of PCGs.Following that, we conduct three simulation studies to test how well our structure learning procedure can recover the structures of simulated PCGs.Next, an empirical example is presented to illustrate how the PCG structure learning procedure can be used to derive causal hypotheses from the data.The paper concludes with a discussion and a few closing remarks.

Probabilistic Graphical Models, Graph Reduction and Power Chain Graphs
Probabilistic graphical models (Koller et al., 2009) are graphical representation of multi variate distributions that satisfy a set of conditional independence relations.In cases where the set of edges V and the set of nodes E are high dimensional, it may be of interest to represent the graph G in a reduced form.A graph H is called a graph minor (i.e., a reduction or an induced minor) of graph G if H can be formed from G by deleting edges and nodes, and by contracting edges (Lovász, 2006).The operation of edge contraction consists in removing an edge from a graph while simultaneously merging the two nodes that it previously connected (Asano & Hirata, 1983).
If G is an undirected graph (i.e., all edges in E are symmetric relations), then a power graph (Royer et al., 2008) PG = (V', E') is a reduction of a graph G, with edges contracted from motifs.Motifs are recurrent or statistically significant sub-graphs or patterns in a graph (Holland & Leinhardt, 1976).For PGs, V' is a set of power nodes (i.e., a set of contracted nodes), and E' is a set of power edges (i.e., a set of contracted edges).PGs do not require all edges to be contracted, and the PG derived from an undirected graph is also an undirected graph.
A similar approach is also used for contraction of directed graphs (i.e., graphs where all edges in E are directional, or causal, relations; Harary et al., 1965).A directed graph is a graph DG = (V, A), where A is a set of directed edges (also called arrows; Scutari & Denis, 2021).A node i is called a parent (or a cause) of j if the arrow from i to j is an element of A and a child (or an effect) of j if the arrow from j to i is an element of A. The reduction of a directed graph by contracting all of the nodes in each strongly connected component (i.e., a subgraph such as every node is reachable from every other node) always results in a directed acyclic graph (DAG; Lauritzen, 1996).A DAG is a type of directed graph such that there is no way to start at any node i and follow a consistently-directed sequence of edges that eventually goes back to i again.
A special type of graph known as chain graph (Lauritzen, 1996), defined as CG = (V, A, E), includes both a set of undirected edges E and a set of directed edges A. In the probabilistic graphical model literature, chain graphs are also known as mixed graphs with no semi-directed cycles and no bidirectional edges (Højsgaard et al., 2012), being thus a generalization of both undirected graphs and DAGs.CGs, like DAGs, can be used to make causal interpretations between variables connected by directed edges.However, the undirected edges in CGs can be interpreted in several different ways, depending on the assumed or real conditional dependencies and the presence or absence of latent variables.For instance, Lauritzen and Richardson (2002) used the example presented in Figure 1 to argue that undirected edges in CGs should not be interpreted as causal relations, unless extra information is presented.At the top of Figure 1   The DAGs and the DG on the bottom are commonly used to interpret, or statistically model, the undirected edge on the CG.However, none of these graphs result in the same conditional distribution, as the joint density of DAG1, DAG2, and DG are, respectively, represented by where ¬⫫ represents dependence.These relations show that CGs can, and should, be used when correlations have no clear causal interpretation.Despite this fact, it is not rare for causal models to be assumed, for instance, in psychometrics and fMRI studies, where the correlated data (e.g., test responses and voxels) are assumed to have a common latent cause (e.g., psychological construct and original neural signal).In these research areas, statistical modeling usually requires additional assumptions about how observed data relates to the latent causes (e.g., Mair, 2018;Roberts & Everson, 2001).
In the literature on CGs, when the structure is high dimensional, chain graphs can be divided in a path of blocks of variables, known as dependence chains (Wermuth & Lauritzen, 1990).Dependence chains are used to illustrate and properly condition causal effects in CGs because within blocks the relations are undirected, but between blocks the relations are directed.The interpretation of these dependence chains relies on what interpretation of CGs is assumed in the given study.There are three main interpretations of CGs, each with its own peculiarities (Javidian & Valtorta, 2018): (i) Lauritzen-Wer muth-Frydenberg (LWF) CGs; (ii) Andersson-Madigan-Perlman (AMP) CGs; and (iii) multivariate regression (MVR) CGs.
The first two, AMP and LWF CGs, are characterized by having directed and undir ected edges, but differ on how they establish independence relations between parents of vertices that are linked by an undirected edge.MVR CGs, on the other hand, use directed edges together with bidirectional edges.MVR CGs have more in common with AMP CGs, as each node depends only on its own parents according to both interpreta tions.Both interpretations can also be modelled with linear equations, being the main difference between MVR CGs and AMP CGs the modelling of noise (Sonntag, 2016).The bidirected edges in MVR CGs are used to model latent (or hidden) variables (Javidian & Valtorta, 2021), being related, then, to DAG1 represented in Figure 1.LWF CGs, on the other hand, assume that each node is affected by the parents of the whole block and, therefore, the causal effects propagate bidirectionally between undirected edges, being related, then, to DG represented in Figure 1.
Despite the specific CG interpretation, if all nodes in a subgraph (i.e., a group of variables) have directed relations with all nodes of another subgraph, the nodes within each subgraph can be considered as part of the same block, even if the block represents a disconnected subgraph (i.e., a graph which not all nodes are directly connected).On the other hand, if the blocks are defined by clusters of variables, and if it is possible to test or assume the disjointness condition (i.e., that it is possible to define the direction of the relations between blocks), then we have a special type of CG which can be reduced to a PCG, as illustrated in Figure 2. Despite the fact that the factorization of both the CG and the PCG in Figure 2 are identical, the PCG representation allows further inference about the clustering process of the predictor variables and the causal chains as a whole, without having to assume, a priori, a specific CG interpretation.
The undirected part of the chain graphs can be reduced by graph minor operations, while the directed part can be reduced by the contraction of an undirected equivalent of the strongly connected component to achieve a power chain graph (PCG).The PCG can be defined as the graph P = (V', A'), where all original nodes and undirected edges are contracted into a set of power nodes, V', and all the original directed edges are contrac ted into a set of directed power edges, A'. Figure 3 depicts this reduction process, where it is possible to see that a PCG is equivalent to a DAG that implies an underlying CG.MVR CGs can, at this point, again be contrasted with PCGs, in the sense that they also operate a type of graph reduction.However, MVR CGs assume that strongly connected subgraphs result from a latent common cause, which have to be marginalized from the original CG before causal discovery can occur (Javidian & Valtorta, 2018;Sonntag & Peña, 2015).PCG does not require this additional assumption.Note.X' is the power node representing the cluster (or community) of nodes x 1 , x 2 , and x 3 .Similar interpretation is given for power nodes Z' and Y'.
In the case of CGs, it has been shown that the structure of CGs, just like DAGs, can be, under the correct circumstances and assumptions, recovered from observational data (Peña & Gómez-Olmedo, 2016).A possible method for CGs is a PC-like algorithm, based on the Peter-Clark (PC) algorithm (Spirtes et al., 2000), which uses partial correlations Power Chain Graphs 6 and theorems of causality theory (Pearl, 2009) to infer a causal structure from the data.However, this PC-like approach for CGs, just as for DAGs, will recover, in most cases, only a CG which is in the same equivalent class as the graph that generated the data (Peña & Gómez-Olmedo, 2016).On top of that, as CGs are more complex and general than DAGs, these equivalence classes tend to contain more options than for DAGs, so it would be in interest of a researcher if some CG structures could be reduced to DAG structures by PCG reduction.
If the CG is given or assumed, then reducing it to a PCG requires only three steps.First, one must define or choose a measurement function for assessing the conditional dependencies between the nodes.Then, the strongly connected undirected subgraphs are theoretically defined or empirically estimated from data.In the final step, a method for ensembling the weights of the directed edges of similar nodes in A into directed power edges in A' is applied to the graph.However, more usually than not, the CG is not given nor should be assumed to be known.In this more realistic scenario, structure learning procedure (also known as causal discovery algorithm) can be used to first estimate the PCG and then the relations in the CG can be implied from the PCG.

Structure Learning of Power Chain Graphs
We propose a structure learning procedure for PCGs where the dependencies between variables are assumed to be linear and, therefore, the variables are assumed to follow a multivariate normal distribution.Gaussian graphical models (GGMs; Lauritzen, 1996;Uhler, 2017) are a usual decision for modeling network models in this scenario (e.g., Drton & Perlman, 2008;Epskamp et al., 2018;Yin & Li, 2011).GGMs are mathematical ly convenient for modeling linear dependencies and conditional linear dependencies between variables, as GGMs present maximal entropy configuration and the possibility for intuitive maximum likelihood estimation.
The first step of our structure learning procedure for learning the structure of PCGs is the identification of the groups of similar nodes by some data-driven procedure.In a chain graph that can be represented by a GGM, any node v a is more similar to any node v b than to any node v c if and only if λ ρ a , ρ b > λ ρ a , ρ c .The measure λ ρ a , ρ b assesses the similarity between the vectors of dependencies ρ a and ρ b .The vector of dependencies ρ a has as its elements the dependencies involving the v a node.The vector of dependencies ρ b is interpreted similarly to ρ a in respect to v b , and λ ρ a , ρ c is interpreted similarly to λ ρ a , ρ b in respect to ρ a and ρ c .With GGMs, Pearson correlation coefficient and regularized partial correlation are the most commonly used measures of dependency between nodes (Epskamp et al., 2018).Clusters (i.e., groups or communities) can then be found by conventional methods, such as exploratory factor analysis with parallel analysis (Horn, 1965), exploratory graph analysis (Golino & Epskamp, 2017), and motif discovery.
In the second step of our structure learning procedure, after finding the clusters of variables, the directions of the relations between power nodes can be estimated using causal discovery algorithms (Malinsky & Danks, 2018).More specifically, we propose that correlations between different clusters can be averaged and the resulting averaged correlation matrix can be fed to a causal discovery algorithm, resulting in the PCG.Because the steps in our structure learning procedure are complex and dependent on a series of intricate assumptions, we summarize the steps as well as the main assumptions in Table 1.This assumption is necessary so the weights of the edges estimated to be directed can be contracted by taking the arithmetic mean of the correlations between clusters.

Causal Discovery Causal Markov condition
A variable x is independent of every other variable (except x's effects) conditional on all of its direct causes.This assumption means that, in the absence of measurement error, all data generated from the same process will result in the same set of conditional dependencies.

Faithfulness
A graph is faithful to the data generating process if no conditional independence relations other than the ones entailed by the Markov condition are present.This assumption means that correlations are not spurious and lack of correlations are not due to suppression effects, but rather to true independence.

Causal sufficiency
The dataset includes all of the common causes of pairs of dependent variables in the dataset.This assumption means that there are no confounders (i.e., latent causes) that explain away the correlations between pairs of variables.

Multivariate
Gaussianity of variables Same as above.But in this case, this assumption makes it possible to test the significance of the conditional dependencies of triplets of variables.
Power Chain Graphs

Finding Similar Nodes and Contracting Nodes
The clustering procedure that we propose departs from three main elements: (i) our defi nition of similar nodes; (ii) the fact proved by Rao (1979) that correlations estimated by maximum likelihood from the same sample asymptotically follow a multivariate normal distribution; and (iii) the use of model-based clustering as an extension of block modeling (Breiger et al., 1975).From these elements, we propose that model-based clustering, parameterized with finite Gaussian mixture models, can be used to find the clusters of similar nodes from the correlation matrix estimated from data.In our application of mod el-based block modeling, which we call correlation Gaussian mixture model (CGMM), the estimated correlations p = p 1 , …, p k are assumed to be generated by a mixture model with C clusters: where f l p i θ l is the normal probability distribution with parameter θ l , and τ l is the probability of belonging to the l th cluster.The parameters of the model can be estimated by maximum likelihood using the Expectation-Maximization algorithm (Dempster et al., 1977), with model comparison methods used to find the value of C which better fits the estimated correlation matrix.In multivariate settings, the volume, shape, and orientation of the covariances can be constrained to be equal or to vary across clusters, resulting in 14 possible models with different geometric characteristics.A modified version of the Bayesian information criterion can be used to choose the best solution (Fraley & Raftery, 2007).
The clusters identified by our procedure are the strongly connected undirected subgraphs of the CG.By our definition of a PCG, nodes from different clusters can be connected only by directed edges and, if a node from a cluster is connected to a node in another cluster, all nodes in the first cluster are connected to all nodes in the second cluster, with the edges directed in the same direction.Based on the normality assumption, contraction of edges, resulting in the weighted power edges E′ lℎgj , can be conducted by calculating where N l and N ℎ represents the number of nodes included in the clusters l and h, respectively, z gj is the Fisher-transformed correlation of the gth variable of a cluster l with the jth variable of a cluster h, and r() is equal to Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383 where z' is equal to N ℎ z gj .This procedure will generate an C × C aver aged correlation matrix that, we propose, can be used to learn the direction of the directed power edges of a PCG using a causal discovery algorithm.

Finding the Direction of the Relations Between Power Nodes
After calculating the weights of the directed power edges (i.e., after contracting what was estimated to be directed relations), one can use causal discovery methods to estimate causal relations from correlational data (Pearl, 2009).Causal discovery from correlational data is based on the fact that different causal paths (i.e., causal data generating processes) result in different conditional distributions on a set of variables.This can be shown by the factorization of the three fundamental connections, represented in Figure 4: serial connection; divergent connection; and convergent connection (also known as collider or v-structure).The factorization of the probability distributions of these connections can be expressed, respectively, as serial connection Pr X i X j Pr X j Pr X k X j ; (9) and v-structure Pr X i Pr X j X i , X k Pr X k ; (10) it being easy to see that Equation 8 is equivalent to Equation 9.The fact that v-structures have a different factorization from the other types of connections implies that-assuming non-confounders and no cycles (i.e., mutual causa tion)-it is possible to identify causal effects even if the data is correlational, as long as v-structures, as expressed in Figure 4, can be found in the graph (Rohrer, 2018).However, it is important to note that the direction of some relations may not be identifiable due to the fact that some data generating processes entail the same dependence structurei.e., are part of the same equivalence class-and, therefore, result in the same set of conditional dependencies.For instance, serial and divergent connections are part of the same equivalence class and, therefore, it is impossible to distinguish these types of dependencies from correlational data alone.Therefore, causal discovery algorithms will usually return complete partially DAGs instead of DAGs, given that a complete partially Power Chain Graphs DAG express an equivalence class consisting of the same directed and undirected edges (Verma & Pearl, 1990).

Figure 4
Three Fundamental Connections Between Three Variables Note.X i , X j , and X k represent any three nodes, or power nodes, and the data generating processes, in terms of causal relations.
There exist three classes of approaches for learning causal graphs (i.e., structure learning) from data (Scutari & Denis, 2021): constraint-based; score-based; and hybrid.One of the first constraint-based algorithms to be implemented was the PC algorithm, proposed by Spirtes et al., (2000), which we use to learn the directions of the edges in the PCG.The output of the original PC algorithm depends on the order in which the possible v-structures are tested.A simple modification proposed by Colombo and Maathuis (2014), called PC-stable, yields order-independent adjacencies in the skeleton, which is the partial correlations generalized from a DAG.This means that PC-stable finds a partial correlations graph before trying to direct the edges, reducing the computational expense of testing the possible v-structures due to the size of the graph.
Score-based algorithms, on the other hand, apply heuristic optimization techniques to the problem of structure learning (Russell & Norvig, 2009).This class of algorithms assigns network scores (i.e., a goodness-of-fit index) to possible structures, which are then selected based on how well they fit the data.The greedy equivalence search (GES; Chickering, 2003) and the hill climbing (HC; Daly & Shen, 2007) algorithms use locally optimal choices at several iterations, until a solution is found.They have shown good performance when compared to the PC algorithm or others score-based algorithms (e.g., Russell & Norvig, 2009).However, they do not evaluate the existence of v-structures and, therefore, may be less conservative than constraint-based algorithms.
Hybrid learning algorithms, as the name may suggest, combine both constraint-based and score-based algorithms to trying to overcome the limitations of single class algo rithms (Friedman et al., 1999;Tsamardinos et al., 2006).One simple approach for a hybrid learning procedure would be, for example, to learn the skeleton of a DAG by means of any constraint-based algorithm (what is called a restriction phase), and then to direct the edges accordingly to a score-based algorithm (called maximization phase).The Max-Min Hill Climbing (MMHC; Tsamardinos et al., 2003) algorithm performs this exact procedure by combining both the PC-stable and the HC algorithms.
Despite the fact that the PC-stable is the most adequate algorithm to learn the directions of the edges in a PCG, as it can be applied using the averaged correlation matrix estimated from the first step of our structure learning procedure, the other algorithms can be used for "fine-tuning" the CG implied by the PCG.This means that other algorithms, or even the PC-stable itself, can be used, after learning the structure of PCG and generating the implied CG, to remove directed edges that may not hold true causal relations between nodes from different power nodes.Therefore, this step of "fine-tuning", despite not essential for learning the structure of a PCG, can shed a light on how to improve structure learning of CGs (Peña, 2018).
Finally, it should be noted that current procedures used for learning the structure of CGs, as well as the structure of DAGs, can only recover an equivalence class of directed, or partially directed, graphs.This fact will hold true even when the underlying distribution (i.e., the data generating process) is known, as any two CGs (or DAGs) that belong to the same equivalence class are causally indistinguishable (Ma et al., 2008;Peña & Gómez-Olmedo, 2016).This characteristic is not a limitation of these procedures, but a fundamental restriction due to the nature of the relations between causality and observed correlations (Pearl, 2009).Therefore, our procedure will also return complete partially DAGs whenever the underlying distribution results in a causally indistinguishable set of conditional dependences.

Method Data Generating Process (DGP)
The DGP demanded three characteristics to properly address our aim: variables should be normally distributed in groups of similar nodes; different groups should be causally related; and the simulated PCG should be the only element on its equivalence class (so it is causally distinguishable).The first two characteristics are simply characteristics of PCGs.The third characteristic guarantees that, in the absence of both unmeasured confounders and measurement error, there is only one best causal model to explain the dependencies in the data (Verma & Pearl, 1988).
The DGPs were based on three different GGM chain graphs, with their corresponding PCGs represented in Figure 5. GGMs are usually difficult to simulate as they require the correlation matrix to be positive definite (Williams & Mulder, 2020).To simplify the simulation of these DGPs, we used a factor analytical model (i.e., an MVR CG).For doing this, for each l th independent power node (V'1, V'2, V'4, and V'6), a vector f l of size n was randomly drawn from a normal distribution with mean 0 and standard deviation 1.The dependent power nodes (V'3, V'5, and V'7) were created by simply summing the vectors of the parents of the given power node (e.g., f 3 = f 1 + f 2 ).Observations x ij for the variables for each power node were created using a linear factorial model: where u j is the intercept for variable j (which was drawn from a normal distribution with mean 0 and standard deviation 1), λ jl is the correlation of variable j with the latent variable l (and these correlations were randomly drawn from a uniform distribution with lower bound .4 and upper bound .8),and ε ij is the random component for simulated individual i on variable j, which was also drawn from a normal distribution with mean 0 and standard deviation 1. DGP1 was chosen because it represents a v-structure, the simplest possible model to be unique in its equivalence class.Both DGP2 and DGP3 were built upon DGP1, increasing the number of nodes, but keeping the same nodes and directed edges, as well as the characteristic of being unique in its equivalence class.For instance, V'2 causes V'3 in all the DGPs.On the other hand, V'2 only causes V'5 in DGP2 and DGP3.This strategy reflects an important aspect of DAGs (Pearl, 2009): as long as confounders do not change the dependence relations in the graph, they can be estimated with efficiency by causal Franco, Barros, Wiberg, & Laros discovery algorithms.Therefore, even in the absence of some nodes, the algorithm applied to DGP1 should perform at least as well as when compared to its applications to DGP2 and DGP3.Two other conditions were used: the number of variables per power node (or cluster size; v = 5 or v = 10) and the sample size (n = 100, n = 250 or n = 500).Each triplet-composed by a DGP, a cluster size, and a sample size-of these conditions was simulated 1,000 times.

Performance Indices
Our proposed clustering procedure was compared to two other procedures: Exploratory Factor Analysis (EFA; Goretzko et al., 2021) with Parallel Analysis (PA; Horn, 1965), hereafter called PA-EFA; and the Exploratory Graph Analysis (EGA; Golino & Epskamp, 2017).The PA-EFA is used in many fields, such as test development in psychology and dimension reduction in machine learning.EGA is a promising new procedure that com bines two procedures from the probabilistic graphical models' framework for identifying the correct number of graph communities (i.e., clusters of variables): the regularized partial correlations GGM, and the walktrap algorithm for community detection.
For assessing the quality of the clustering procedures, we used cluster performance indices (Hennig & Liao, 2013).For comparing the structure learning algorithms, we used accuracy performance indices (Glaros & Kline, 1988).The accuracy indices were accuracy (Acc), positive predictive value (PPV), and true positive and true negative rates (TPR and TNR, respectively).The cluster indices were the graph adjusted Rand index (GARI; Terada & von Luxburg, 2014), variation of information (VI; Meilă, 2007), normalized mutual information (NMI; Haghighat et al., 2011), and whether the clustering procedure identified the correct number of dimensions (HitND; also known as cluster accuracy).The formulas for calculating these indices, their numerical ranges, and theoretical inter pretations are presented in Table 2.

Implementation of the Simulation
All simulations and data analyses were conducted in R version 4.2.0 (R Core Team, 2021).The simulations start by choosing a DGP scenario (i.e., a 3 × 2 × 3 factorial design with three possible DGPs, two possible cluster sizes, and three possible sample sizes, in a total of 18 conditions) and generating data accordingly.With these data, first the nodes/variables are clustered using one of the three clustering procedures (PA-EFA, EGA, or CGMM).Then, the performance of each procedure is evaluated by all the performance indices.Finally, using the true clustering (which is known by design) of the simulated observational data, the averaged correlation matrix is used in combination with the PCalgorithm to discover the direction of the edges.The performance of the PC-algorithm is then evaluated by all the accuracy-related performance indices.
For the cluster detection part of the simulation, the PA used was the one implemented in the paran package version 1.5.2 (Dinno, 2018) and the EFA, with oblimin rotation (to allow the latent variables to be correlated), implemented in the psych package version 2.2.5 (Revelle, 2019).EGA was estimated using the implementation in the EGAnet pack age, version 1.2.3 (Golino et al., 2022).For CGMM, the Gaussian mixture model for clus tering analysis was the one implemented in the mclust R package version 5.4.10 (Scrucca et al., 2016).The PC-algorithm used was the one implemented on the pcalg package version 2.7-6 (Kalisch et al., 2021).VI and NMI were calculated by the implementation in the fpc package version 2.2-9 (Hennig, 2018) and GARI by the implementation in the loe package version 1.1 (Terada & von Luxburg, 2016).All the other computations were made with authors' own code.The R codes with the full simulation are available at Franco et al. (2022).
The Results section is divided in three subsections.In the first subsection, we investi gate the performance of clustering algorithms for finding clusters (i.e., the subgroups of undirected edges) in a PCG setting.In the second subsection, we test the performance of algorithms in finding the directed edges of the PCG.We start with the assumption that the clusters are all known so the PC-stable performance is evaluated independently from the performance of the clustering procedure.The last table in the second subsection presents the results when we also consider an experimental setting in which the clusters are not assumed to be known and were first estimated with the clustering algorithms.At last, the application of fine-tuning is illustrated to evaluate the average change of density (i.e., the quantity of arrows kept) after the application of each of the alternative algorithms.

Comparison Between Clustering Procedures
Table 3 presents the results for all the 18 conditions.Bold values represent the method that had the best performance in the specific index and condition.It is possible to see

Power Chain Graphs 16
Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383 that EGA performed the best in 6 conditions, while CGMM performed the best in the other 12.The PA-EFA procedure was never the best procedure, showing performance indices considerably worse than the other two methods.Table 4 presents the results for the comparison between the clustering procedures aver aged over the three different DGPs' conditions represented in Figure 5.We used bold values to emphasize the better performing procedure at each condition.It is possible to see that, overall, CGMM performed better in the most complex DGPs in almost all indices, but EGA performed better in the simplest DGP.It is also possible to see that PA-EFA and CGMM had somewhat stable performances across the different DGPs, while the performance of EGA seemed to decay with the complexity of the DGP.This pattern is also shown in Figure 6, with the size of the points proportional to the standard errors.Table 5 shows the effect of the sample size (n, the number of simulated cases) on the per formance of the clustering procedures.Again, CGMM had the best performance over all conditions and over all indices.PA-EFA is again the worst performer over all conditions and indices.There also seems to be an influence of the sample size on the performance of the procedures: increasing from 100 cases to 250 cases will increase the performance notably.Increasing the sample from 250 to 500 also increases the performance, but by a lesser magnitude.These results are also shown in Figure 7.

Power Chain Graphs 20
Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383For evaluating the influence of the clusters' sizes (v, the number of nodes per cluster), results in Table 6 show, again, that overall CGMM outperforms the other procedures.But this time, EGA has the largest power (HitND) when the cluster size is equal to 10 variables per cluster.PA-EFA is, once more, the worst performing procedure in all the cases.Finally, increasing the cluster size will, in general, increase the performance of EGA and CGMM, but will decrease the performance of PA-EFA when measured by NMI.These results are also represented in Figure 8.

Power Chain Graphs 22
Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383 Finally, analyses of variance were conducted in order to assess how each condition affect the performance of the methods.We report only the partial eta squared effect size in Table 7, as our aim with this analysis is only to verify the tendency of effect of the conditions on the performance of each procedure.As the independent variables, we included the sample size (SS), the cluster size (CS), the data generating process (DGP), and all the higher order interactions.The results show that, in general, the conditions had only very small effects on the methods' performance.Only the DGP affected the performance in some of the indices, and only for the EGA and the CGMM.

Comparison Between Structure Learning Algorithms
Using the PC-stable algorithm with the averaged correlation matrix of the clusters and assuming the true clusters are known we generated the results shown in Table 8.The overall results, averaged over all conditions and shown in Table 9, show that the PC-sta ble algorithm will recover most of the arrows correctly.This means that all edges that should be identified as directed was identified as such, as well as their correct directions were also identified.We see that the DGP affects the efficiency of the algorithm, as the results are almost perfect when there are only three clusters of variables with a simple v-structure between them; i.e., DGP 1.  Regarding the effects of the sample size, the results are somewhat ambiguous: both the conditions with the largest sample and the condition with the smallest sample outperform the other in two indices.We also found that increasing the cluster size will improve the average performance of the algorithm.Analyses of variance were conducted in order to assess how each condition affect the performance of the PC-stable algorithm when the true clusters are known.Once again, we report only the partial eta squared effect size in Table 10, and the independent variables were the sample size (SS), the cluster size (CS), the data generating process (DGP), and all the higher order interactions.The results show that, again, the DGP affected the performance of the procedure, with larger effect sizes.The sample size and the two-way interaction between the sample size and the DGP also affected the performances, but less consistently.We performed the same simulation again, but this time without assuming the true clusters are known.After applying the PC-stable algorithm, we recovered the adjacency matrix for the induced chain graphs.From those, we calculated the accuracy for all the conditions, as shown in Table 11.Because PA-EFA had the worst performance in the previous analyses, it was not included in this analysis.As it was expected due to the results shown in Table 4, EGA performed better than CGMM in all the DGP1 conditions.However, CGMM performed better in all the other conditions.The results shown in Table 12 demonstrate that, in average, the CGMM performs better than the EGA procedure.Increasing both the cluster (averaged over sample sizes) and the sample (averaged over cluster sizes) size seems to improve the performance of both algorithms.However, caution should be taken as the cluster size increases, given that the sparsity of the chain graph also increases, which can result in overestimated effects.We conducted the final analyses of variance in order to assess how each condition affect the performance of the PC-stable algorithm when the true clusters are unknown.As usual, we report only the partial eta squared effect size in Table 13, and the independent variables were the sample size (SS), the cluster size (CS), the data generating process (DGP), and all the higher order interactions.The results show that the CGMM is strongly affected by the sample size.The EGA, on the other hand, seems to be affect by the sample size, the cluster size, and the DGP.The effects of the higher order interactions are also somewhat relevant, but comparably less meaningful.

Network Density After Fine-Tuning
The final analysis was used to evaluate how applying "fine-tuning" causal discovery algorithms would change the total number of arrows in the CG implied by the PCG.Table 14 shows our findings in terms of the sparsity of the CGs discovered by the algorithms, where sparsity is simply the false negative rates (i.e., 1 -TPR).The first algorithm column, named CG, is simply the CG implied by the PCG.Because of how our simulations were conducted, this column should always be the one with the lowest sparsity and, therefore, it is used as a benchmark for the other procedures.It is possible to see that the HC algorithm is always the one that keeps most of the arrows, followed by the PC-stable algorithm and finally, the MMHC algorithm.No best-worst comparison is adequate in this case.However, these results show that further tuning of the CG implied by the PCG will be considerably sparser and, therefore, causal analysis of CGs estimated by these procedures will need to be sensible to causal assumptions.

Empirical Example
For illustration purposes, the present empirical example will apply PCG to derive causal hypotheses related to the empathy construct.Davis (1980) proposed a multidimensional measurement of empathy, operationalized in a questionnaire composed by four seven-item subscales: perspective-taking; fantasy; empathic concern; and personal distress.Fantasy is the tendency to get involved in the actions of fictional characters from diverse media.Perspective-taking is the tendency to comprehend others' point of view.Empathic concern is the tendency of feeling concern and sympathy for people in distress.Personal distress is the tendency of feeling unease in difficult, tense or emotional situations.As usual in the psychological literature, the psychometric properties of the question naire were evaluated by Davis (1980) by means of exploratory factor analysis, without explicit assumptions regarding causal relations between the latent factors.This analysis resulted in what is usually considered as good evidence of validity (i.e., items theorized to be together were indeed clustered by the same latent factor) and reliability (i.e., Cronbach's alpha above 0.70).Briganti et al. (2018) applied the EGA procedure and found the same structure as the original study by Davis (1980).In the present example, we will apply our PCG procedure to the open dataset, originally from Braun et al. (2015), provided by Briganti et al. (2018).

Method Participants
The dataset is composed of 1,973 French-speaking students from different colleges, from a diverse set of courses.The age of the participants ranged from 17 to 25 years (M = 19.6 years, SD = 1.6 years), with 57% of these participants identifying themselves as females and the other 43% as males.From the original sample of 1,973 individuals, only 1,270 answered all the items of the questionnaire.Missing data was imputed by Briganti et al. (2018) using a maximum likelihood method based on the Gaussian graphical model that they fitted to the data.The items of the empathy measurement are displayed in Table 15, as well as the factors identified by the EGA procedure.

Analysis
Because the structure of the instrument was validated extensively in the literature, we proceeded by assuming that the true clusters are known.In the next step, the averaged correlation matrix was calculated and then the PC-stable algorithm was applied for discovering the directions of the relations in the PCG.Finally, the tuning algorithms were applied to decrease the density of the CG implied by the PCG, with the results displayed both graphically and with the percentage of removed relations textually described.The R code for doing these analyses is also included in the Supplementary Material.

Results
We calculated the averaged correlation matrix based on the clusters identified by Briganti et al. (2018).Then, we applied the PC-stable algorithm to this averaged correla tion matrix, resulting in the PCG shown to the left of Figure 9.It is possible to see that Note.V'1 = First power node, Personal distress.V'2 = Second power node, Fantasy.V'3 = Third power node, Perspective-taking.V'4 = Fourth power node, Empathic concern.V01 to V28 are the 28 variables according to Table 15.
For the final analysis, we proceeded to the fine-tuning procedure with the CG implied by the PCG.For doing that, first, we fixed the direction of the edges according to the estimated PCG.We also fixed the absence of relations between power nodes (e.g., no direct relations between the variables in V'3 and V'2).After that, the PC-stable, the HC, and the MMHC algorithms were applied.Fine-tuning with the PC-stable algorithm resulted in 74.49% of the edges between clusters removed.With the HC algorithm, 65.31% of the edges between clusters were removed.With the MMHC algorithm, 86.22% of the edges between clusters were removed.Each corresponding unweighted CG is displayed in Figure 10.From this point on, an empathy researcher could test any of the fine-tuned models by means of path analysis, a strategy we are not approaching here as it is beyond our scope.

Discussion
The main aim of the present study was to propose a procedure of structure learning of PCGs.The secondary aim of the study was to compare clustering algorithms and causal discovery algorithms that can be used for learning the structure of the PCGs.
To achieve these aims we used simulated data.To evaluate the practical value of our procedure, we did a full analysis of an empirical example.Our simulations showed that our procedure can properly recover both the clusters of variables, by means of the CGMM procedure, and the correct direction of the underlying causal relations, by means of averaged correlation matrix and the PC-stable algorithm.Our simulations have also illustrated how PCGs can be used to further investigate the structure of CGs, through tuning of the CG implied by the PCG using three different classes of causal discovery algorithms.To assess the performance of the CGMM procedure in finding the correct number of clusters, we used a number of accuracy related indices and cluster comparison indices.We also compared the CGMM performance with two other more traditional procedures in statistical analysis: PA-EFA and EGA.Our results show that, in general, CGMM out performed both procedures in a number of different conditions of sample sizes, number of variables per cluster, and the total number of clusters.These results are somewhat unexpected.Previous studies (e.g., Golino et al, 2020) have shown high accuracy for both PA-EFA and EGA procedures.However, PA-EFA performed very poorly and EGA performed poorer than expected (despite the fact that it performed better than the CGMM in some conditions).One possible reason for this result is the differences in design between our study and the previous studies.In our study, the simulation has an underlying causal structure that forces independence between some power nodes.Previous studies used correlated underlying structures.This difference causes data to be factorized in a different manner (Lauritzen, 1996), probably resulting in the differences of performance found in our study.This design difference also means that more studies are necessary to evaluate how well CGMM will perform when there are dependencies between all power nodes.
The PC-stable algorithm performed well in a number of different conditions, even when the clustering was not known a priori.Changing the sample sizes, number of variables per cluster and the total number of clusters had little to no clear effect in the accuracy related indices.The second part of the causal discovery analysis, which compared different tuning algorithms, has some implications for structure learning pro cedures of CGs (e.g., Drton & Perlman, 2008;Ma et al., 2008;Peña et al., 2014).Most of these procedures try to find the best fitting CG by brute force (Peña, 2018): fitting a large number of different models and choosing the best fitting one.Our results indicate that learning the structure using the full PCG procedure and then using its implied CG as an initial guess for the estimated true CG can enhance CG's structure learning procedures.
Regarding the empirical example, although it was used more as an illustration of our procedure than a true investigation of the causal structures of empathy as proposed by Davis (1980), our findings look promising.In sum, the results indicate that it is reasonable to infer that our own feelings of uneasiness in difficult situations and our capacity to comprehend others' point of view causes how we feel for people in distress.Also, how we feel for people in distress and our own feelings of uneasiness in difficult situations causes how we feel for fictional characters in distress.These inferences may have implications that might be of interest for replication and extension from researchers of empathy.For instance, longitudinal (e.g., Zhou et al., 2002) studies could be used to test if the causal structure holds in a developmental setting.Neurological (e.g., Singer et al., 2004) studies, on the other hand, could help to validate the estimated causal structure by identifying which areas of the brain are related to which factor, as well as if neural relations reflect the causal structure found in the present study.For researchers from other areas of study can also benefit from these results as they provide insight that maybe constructs first hypothesized to be only associational can actually present some identifiable underlying causal structure.
There are two main limitations of the present study.The first is the CGMM proce dure.Despite its performance in the simulation study, the results may be limited to the particular setting of the data generating process.Theoretical studies are necessary to better understand in what conditions CGMM may perform the best so the present results regarding CGMM can be properly generalized.However, this has little impact on the PCG procedure as a whole, as any clustering procedure can be used in place of the CGMM, as long as it respects the "similarity" condition for clustering the variables.The second main limitation is related to the averaging of the correlation matrix.Despite its relation with the definitions of a power edge and similarity between nodes, there is neither logical implication nor model constraint that requires for the power edges to be estimated like this.Therefore, it is also necessary for future studies to evaluate which procedures for estimating or calculating power edges can improve the performance of the PCG procedure.
Future studies on PCG can focus on both theoretical aspects of CGMM and how causal discovery algorithms can be extended for learning the causal structure from the averaged correlation matrix.For instance, what would happen if instead of correlations, measures of dependence (Paul & Shill, 2017;Zhao et al., 2016), such as the maximal information coefficient or distance correlations, were used in CGMM?What would be the impact of using partial distance correlations' tests (Székely & Rizzo, 2014) instead of partial correlations' tests in the PC-stable algorithm?Is it possible to incorporate a clustering procedure such as CGMM directly as a step in the score-based algorithms?These questions help to set the environment for prolific research about PCGs and all its constituent elements: clustering; causal discovery; and CG structure learning.
we present an observed CG, with the following factorization of the joint density of variables a, b, c, and d: a ⫫ b, a ⫫ d b, c , b ⫫ c a, d .(1)where ⫫ represents independence.

Figure 1 A
Figure 1 A CG (top) and Three Different DAGs That Have Different Factorizations

Figure 2
Figure 2Comparison Between Dependencies RepresentedWith PG, CG, and PCG

Figure 3 An
Figure 3 An Example of CG (to the left) and the PCG (to the right) Attained by Contracting the CG's Edges, Arrows, and Nodes Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383

Figure 5
Figure 5DGPs of the PCGs in the Present Study that the clusters being compared have completely different structures and the value of 1 means that they have exactly the same structure VI 2H ϑ, ϑ − H ϑ − H ϑ VI ≤ 2log(K*) Smaller values are preferred as they mean all variables that should be clustered together were properly clustered together Note.TP = the number of correctly identified edges that existed in the true graph.TN = the number of correctly identified lack of edges that existed in the true graph.FP = the number of incorrectly identified edges that existed in the true graph.TN = the number of incorrectly identified lack of edges that existed in the true graph.I = the mutual information function.ϑ = the vector of the estimated clusters for each node.ϑ = the vector of the true clusters for each node.H = the entropy function.R = the number of simulation iterations.Cor i = a binary indicator for each i th simulation indicating if the number of identified clusters was correct.M j = the number of nodes estimated to be clustered together.D = the hypergeometric function.K* = the maximum number of clusters.
Note.DGP = data generating process; PA-EFA = parallel analysis with exploratory factor analysis; EGA = exploratory graph analysis; CGMM = correlation Gaussian mixture model; GARI = graph adjusted Rand index; VI = variation of information; NMI = normalized mutual information; HitND = simulation rate of hits of number of clusters.

Figure 6
Figure 6Performance Comparison Between Different DGPs

Figure 7
Figure 7Performance Comparison Between Different Sample Sizes Note.PA-EFA = parallel analysis with exploratory factor analysis; EGA = exploratory graph analysis; CGMM = correlation Gaussian mixture model; GARI = graph adjusted Rand index; VI = variation of information; NMI = normalized mutual information; HitND = simulation rate of hits of number of clusters.

Figure 9
Figure 9Estimated Power Chain Graph (on the Left) and the Implied Chain Graph (on the Right)

Figure 10 The
Figure 10The Original CG Implied by the Estimated PCG (Top-Left), a CG Tuned by the PC-Stable Algorithm (Top-Right), a CG Tuned by the HC Algorithm (Bottom-Left) and a CG Tuned by the MMHC Algorithm(Bottom-Right) Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383

Table 1
Assumptions of Our Structure Learning Procedure for Power Chain Graphs

Table 2
Performance Indices Used in This Study for Comparing Algorithms Effectiveness

Table 3
Performance Comparison Over All Possible Combinations of Experimental Conditions Quantitative and Computational Methods inBehavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383Note.DGP = data generating process; PA-EFA = parallel analysis with exploratory factor analysis; EGA = exploratory graph analysis; CGMM = correlation Gaussian mixture model; GARI = graph adjusted Rand index; VI = variation of information; NMI = normalized mutual information; HitND = simulation rate of hits of number of clusters.Bold values indicate the best value per row.

Table 5
Performance Comparison Between Different Sample Sizes Notes: PA-EFA = parallel analysis with exploratory factor analysis; EGA = exploratory graph analysis; CGMM = correlation Gaussian mixture model; GARI = graph adjusted Rand index; VI = variation of information; NMI = normalized mutual information; HitND = simulation rate of hits of number of clusters.

Table 6
Performance Comparison Between Different Cluster Sizes

Table 7
ANOVA's Partial Eta Squared Effect Sizes for Clustering Procedures

Table 8
Performance Comparison Over All Possible Combinations of Experimental Conditions, When the Cluster Are Assumed to Be Known Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383

Table 9
Performance of the PC-Stable Algorithm Applied to Learning the Power Arrows of the PCG

Table 10
ANOVA's Partial Eta Squared Effect Sizes for Causal Discovering With Known True Clusters

Table 11
Performance Comparison Over All Possible Combinations of Experimental Conditions, When the Cluster Are Estimated

Table 12
Accuracy of the PC-Stable Algorithm Applied to Learning the Power Arrows of the PCG When the True ClustersAre notKnown, but Estimated With CGMM or EGA Quantitative and Computational Methods in Behavioral Sciences 2022, Article e8383, https://doi.org/10.5964/qcmb.8383