Structural equation modeling
From Wikipedia, the free encyclopedia
This article or section has multiple issues. Please help improve the article or discuss these issues on the talk page.

This article includes a list of references or external links, but its sources remain unclear because it lacks inline citations. Please improve this article by introducing more precise citations where appropriate. (April 2009) 
Structural equation modeling (SEM) is a statistical technique for testing and estimating causal relationships using a combination of statistical data and qualitative causal assumptions. This view of SEM was articulated by the geneticist Sewall Wright (1921), the economists Trygve Haavelmo (1943) and Herbert Simon (1953), and formally defined by Judea Pearl (2000) using a calculus of counterfactuals.
Structural Equation Models (SEM) encourages confirmatory rather than exploratory modeling; thus, it is suited to theory testing rather than theory development. It usually starts with a hypothesis, represents it as a model, operationalises the constructs of interest with a measurement instrument, and tests the model. The causal assumptions embedded in the model often have falsifiable implications which can be tested against the data. With an accepted theory or otherwise confirmed model, SEM can also be used inductively by specifying the model and using data to estimate the values of free parameters. Often the initial hypothesis requires adjustment in light of model evidence, but SEM is rarely used purely for exploration.
Among its strengths is the ability to model constructs as latent variables (variables which are not measured directly, but are estimated in the model from measured variables which are assumed to 'tap into' the latent variables). This allows the modeler to explicitly capture the unreliability of measurement in the model, which in theory allows the structural relations between latent variables to be accurately estimated. Factor analysis, path analysis and regression all represent special cases of SEM.
In SEM, the qualitative causal assumptions are represented by the missing variables in each equation, as well as vanishing covariances among some error terms. These assumptions are testable in experimental studies and must be confirmed judgmentally in observational studies.
An alternative technique for specifying Structural Equation Models using partial least squares path modeling has been implemented in software such as LVPLS (Latent Variable Partial Least Square), PLSGraph, SmartPLS (Ringle et al. 2005) and XLSTAT (Addinsoft, 2008). Some feel this is better suited to data exploration. More ambitiously, The TETRAD project aims to develop a way to automate the search for possible causal models from data.
Contents 
[edit] Steps in performing SEM analysis
[edit] Model specification
Since SEM is a confirmatory technique, the model must be specified correctly based on the type of analysis that the modeller is attempting to confirm. When building the correct model, the modeller uses two different kind of variables, namely exogenous and endogenous variables. The distinction between these two types of variables is whether the variable regresses on another variable or not. Like in a linear regression the dependent variable (DV) regresses on the independent variable (IV), meaning that the DV is being predicted by the IV. Within SEM modelling this means that the exogenous variable is the variable that another variable regresses on. Exogenous variables can be recognized in a graphical version of the model, as the variables sending out arrowheads, denoting which variable it is predicting. A variable that regresses on a variable is always an endogenous variable even if this same variable is used as an variable to be regressed on. Endogenous variables are recognized as the receivers of a arrowhead in the model. The fact that a variable can play a double role in a SEM model (independent as well dependent), makes that SEM is more useful than the linear regression, since instead of performing two regressions one SEM model will do. There are usually two main parts to SEM: the structural model showing potential causal dependencies between endogenous and exogenous variables, and the measurement model showing the relations between the latent variables and their indicators. Confirmatory factor analysis models, for example, contain only the measurement part; while path diagrams (to be distinct from linear regression) can be viewed as an SEM that only has the structural part. Specifying the model delineates causal (in fact counterfactual) relationships between variables that are thought to be possible (and therefore want to be 'free' to vary) and those relationships between variables that already have an estimated relationship, which can be gathered from previous studies (these relationships are 'fixed' in the model).
A modeller will often specify a set of theoretically plausible models in order to assess whether the model proposed is the best of the set. Not only must the modeller account for the theoretical reasons for building the model as it is, but the modeller must also take into account the number of data points and the number of parameters that the model must estimate to identify the model. An identified model is a model where a specific parameter value uniquely identifies the model, and no other equivalent formulation can be given by a different parameter value. A data point is a variable with observed scores, like a variable containing the scores on a question or the number of times respondents buy a car. The parameter is the value of interest, which might be a regression coefficient between the exogenous and the endogenous variable or the factor loading (regression coefficient between a indicator and its factor). If the number of data points is smaller than the number of estimated parameters an unidentified model is the result, since there are too few reference points to account for all the variance in the model. The solution is to constrain one of the paths to zero, which means that it is no longer part of the model.
[edit] Estimation of free parameters
Parameter estimation is done comparing the actual covariance matrices representing the relationships between variables and the estimated covariance matrices of the best fitting model. This is obtained through numerical maximization of a fit criterion as provided by maximum likelihood, weighted least squares or asymptotically distributionfree methods.
This is often accomplished by using a specialized SEM analysis program, such as SPSS' AMOS, EQS, LISREL, Mplus, Mx, the sem package in R, or SAS PROC CALIS (more information on SAS PROC CALIS: see UCLA or UCR).
[edit] Assessment of fit
Using a SEM analysis program, one can compare the estimated matrices representing the relationships between variables in the model to the actual matrices. Formal statistical tests and fit indices have been developed for these purposes. Individual parameters of the model can also be examined within the estimated model in order to see how well the proposed model fits the driving theory. Most, though not all, estimation methods make such tests of the model possible.
However, the model tests are only correct provided that the model is correct. Although this problem exists in all statistical hypothesis tests, its existence in SEM has led to a large body of discussion among SEM experts, leading to a large variety of different recommendations on the precise application of the various fit indices and hypothesis tests.
For each measure of fit, rules of thumb have evolved regarding what represents good fit between model and data. These rules of thumb often need to be updated based on contextual factors such as the sample size, the ratio of indicators to factors, and the overall size of the model. Measures of fit differ in several ways. Some of them reward more parsimonious models (i.e., those with more constrained parameters). Because different measures of fit capture different elements of the fit of the model, it is appropriate to report a selection of different fit measures.
Some of the more commonly used measures of fit include:
 ChiSquare: A fundamental measure of fit used in the calculation of many other fit measures. Conceptually it is a function of the sample size and the difference between the observed covariance matrix and the model covariance matrix.
 Root Mean Square Error of Approximation (RMSEA)
 Standardized Root Mean Residual (SRMR)
 Comparative Fit Index (CFI)
[edit] Model modification
The model may need to be modified in order to improve the fit, thereby estimating the most likely relationships between variables. Many programs provide modification indices which report the improvement in fit that results from adding an additional path to the model. Modifications that improve model fit are then flagged as potential changes that can be made to the model. In addition to improvements in model fit, it is important that the modifications also make theoretical sense.
[edit] Interpretation and communication
The model is then interpreted and claims about the constructs are made based on the best fitting model.
Caution should always be taken when making claims of causality even when experimentation or timeordered studies have been done. The term causal model must be understood to mean: "a model that conveys causal assumptions," not necessarily a model that produces validated causal conclusions. Collecting data at multiple time points and using an experimental or quasiexperimental design can help rule out certain rival hypotheses but even a randomized experiment cannot rule out all such threats to causal inference. Good fit by a model consistent with one causal hypothesis does not rule out equally good fit by another model consistent with a different causal hypothesis. However careful research design can help distinguish such rival hypotheses.
[edit] Replication and revalidation
All model modifications should be replicated and revalidated before interpreting and communicating the results.
[edit] Comparison to other methods
In machine learning, SEM may be viewed as a generalization of LinearGaussian Bayesian networks which drops the acyclicality constraint, i.e. which allows causal cycles.
[edit] Advanced uses
 Invariance
 Multiple group comparison: This is a technique for assessing whether certain aspects of a Structural Equation Model or Confirmatory Factor Analysis is the same across groups (e.g., gender, different cultures, test forms written in different languages, etc).
 Latent growth modeling
 Relations to other types of advanced models (hierarchical/multilevel models; item response theory models)
 Mixture model (latent class) SEM
 Alternative estimation and testing techniques
 Robust inference
 Interface with survey estimation
 MultiMethod MultiTrait Models
[edit] See also
 List of publications in statistics
 List of statistical topics
 List of statisticians
 Multivariate statistics
 Misuse of statistics
 Regression analysis
[edit] References
 Books
 Bartholomew, D J, and Knott, M (1999) Latent Variable Models and Factor Analysis Kendall's Library of Statistics, vol. 7. Arnold publishers, ISBN 034069243X
 Bollen, K A (1989). Structural Equations with Latent Variables. Wiley, ISBN 0471011711
 Bollen, K A, and Long, S J (1993) Testing Structural Equation Models. SAGE Focus Edition, vol. 154, ISBN 0803945078
 Byrne, B. M. (2001) Structural Equation Modeling with AMOS  Basic Concepts, Applications, and Programming.LEA, ISBN 0805841040
 Haavelmo, T. (1943) "The statistical implications of a system of simultaneous equations," Econometrica 11:12. Reprinted in D.F. Hendry and M.S. Morgan (Eds.), The Foundations of Econometric Analysis, Cambridge University Press, 477490, 1995.
 Hoyle, R H (ed) (1995) Structural Equation Modeling: Concepts, Issues, and Applications. SAGE, ISBN 0803953186
 Kaplan, D (2000) Structural Equation Modeling: Foundations and Extensions. SAGE, Advanced Quantitative Techniques in the Social Sciences series, vol. 10, ISBN 0761914072
 Kline, R. B. (2005) Principles and Practice of Structural Equation Modeling. The Guilford Press, ISBN 1572306904
 Pearl, Judea (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. ISBN 0 521 77362 8.
 Simon, Herbert (1953), "Causal ordering and identifiability", in Hood, W.C.; Koopmans, T.C., Studies in Econometric Method, New York: Wiley, pp. 49–74
 Wright, Sewall S. (1921). "Correlation of causation". Journal of Agricultural Research 20: 557–85.
 Software
 Addinsoft, Inc., (2008), "XLSTAT  PLSPM", Paris,
 Ringle, C. M./Wende, S./Will, A. (2005) SmartPLS 2.0 Beta. Hamburg,
[edit] Further reading
 Joreskog, K. and F. Yang (1996). Nonlinear structural equation models: The KennyJudd model with interaction effects. In G. Marcoulides and R. Schumacker, (eds.), Advanced structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage Publications.
[edit] External links
 EQS homepage at Multivariate Software
 Mx homepage of the crossplatform Mx software for Structural Equation Modeling.
 sem package for R
 SEMNET, the main mailing list
 Structural equation modeling page under David Garson's StatNotes, NCSU
 student version of AMOS software for Structural Equation Modeling
 Issues and Opinion on Structural Equation Modeling, SEM in IS Research
 Academic SEMWorkshops in Europe, Freie Universität Berlin, Germany
 The causal interpretation of structural equations (or SEM survival kit) by Judea Pearl 2000.