Abstract
In this article, the authors define a methodological framework for analyzing the relationship between state sequences and covariates. Inspired by the principles of analysis of variance, this approach looks at how the covariates explain the discrepancy of the sequences. The authors use the pairwise dissimilarities between sequences to determine the discrepancy, which makes it possible to develop a series of statistical significance–based analysis tools. They introduce generalized simple and multifactor discrepancy-based methods to test for differences between groups, a pseudo-R2 for measuring the strength of sequence-covariate associations, a generalized Levene statistic for testing differences in the within-group discrepancies, as well as tools and plots for studying the evolution of the differences along the time frame and a regression tree method for discovering the most significant discriminant covariates and their interactions. In addition, the authors extend all methods to account for case weights. The scope of the proposed methodological framework is illustrated using a real-world sequence data set.
References
|
Abbott, Andrew . 1990. “A Primer on Sequence Methods.” Organization Science 1:375-92. Google Scholar | Crossref | ISI | |
|
Abbott, Andrew, Forrest, John. 1986. “Optimal Matching Methods for Historical Sequences.” Journal of Interdisciplinary History 16:471-94. Google Scholar | Crossref | ISI | |
|
Abbott, Andrew, Hrycak, Alexandra. 1990. “Measuring Resemblance in Sequence Data: An Optimal Matching Analaysis of Musician’s Carrers.” American Journal of Sociolgy 96:144-85. Google Scholar | Crossref | ISI | |
|
Anderson, Marti Jane . 2001. “A New Method for Non-Parametric Multivariate Analysis of Variance.” Austral Ecology 26:32-46. Google Scholar | ISI | |
|
Anderson, Marti Jane . 2006. “Distance-Based Tests for Homogeneity of Multivariate Dispersions.” Biometrics 62:245-53. Google Scholar | Crossref | Medline | ISI | |
|
Bartlett, Maurice Stevenson . 1937. “Properties of Sufficiency and Statistical Tests.” Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160:268-82. Google Scholar | Crossref | |
|
Batagelj, Vladimir . 1988. “Generalized Ward and Related Clustering Problems.” Pp. 67-74 in Classification and Related Methods of Data Analysis, edited by Bock, Hans H. Amsterdam, the Netherlands: North-Holland. Google Scholar | |
|
Billari, Francesco Candeloro . 2001a. “The Analysis of Early Life Courses: Complex Description of the Transition to Adulthood.” Journal of Population Research 18:119-42. Google Scholar | Crossref | |
|
Billari, Francesco Candeloro . 2001b. “A Log-Logistic Regression Model for a Transition Rate With a Starting Threshold.” Population Studies 55:15-24. Google Scholar | Crossref | ISI | |
|
Billari, Francesco Candeloro . 2005. “Life Course Analysis: Two (Complementary) Cultures? Some Reflections With Examples From the Analysis of Transition to Adulthood.” Pp. 267-88 in Towards an Interdisciplinary Perspective on the Life Course, edited by Levy, René, Ghisletta, Paolo, Le Goff, Jean-Marie, Spini, Dario, Widmer, Eric. Amsterdam, the Netherlands: Elsevier. Google Scholar | Crossref | |
|
Blossfeld, Hans-Peter, Rohwer, Götz. 2002. Techniques of Event History Modeling, New Approaches to Causal Analysis. 2nd ed. Mahwah NJ: Lawrence Erlbaum. Google Scholar | |
|
Breiman, Leo, Friedman, Jerome H., Olshen, R. A., Stone, C. J. 1984. Classification and Regression Trees. New York: Chapman & Hall. Google Scholar | |
|
Brown, Morton B., Forsythe, Alan B. 1974a. “Robust Tests for the Equality of Variances.” Journal of the American Statistical Association 69:364-67. Google Scholar | Crossref | ISI | |
|
Brown, Morton B., Forsythe, Alan B. 1974b. “The Small Sample Behavior of Some Statistics Which Test the Equality of Several Means.” Technometrics 16:129-32. Google Scholar | Crossref | ISI | |
|
Cuadras, Carles M. 2008. “Distance-Based Association and Multi-Sample Tests for General Multivariate Data.” In Advances in Mathematical and Statistical Modeling, edited by Barry C. Arnold, N. Balakrishnan, Jose-Maria Sarabia, and Roberto Minguez. Boston: Birkhäuser. Google Scholar | Crossref | |
|
Delicado, Pedro . 2007. “Functional k-Sample Problem When Data Are Density Functions.” Computational Statistics 22:391-410. Google Scholar | Crossref | ISI | |
|
Dijkstra, Will, Taris, Toon. 1995. “Measuring the Agreement Between Sequences.” Sociological Methods and Research 24:214-31. Google Scholar | SAGE Journals | ISI | |
|
Elder, Glen H. 1999. Children of the Great Depression. Boulder, CO: Westview. Google Scholar | |
|
Elzinga, Cees H. 2003. “Sequence Similarity: A Non-Aligning Technique.” Sociological Methods and Research 31:214-31. Google Scholar | |
|
Elzinga, Cees H. 2007. “Sequence Analysis: Metric Representations of Categorical Time Series.” Unpublished manuscript, Department of Social Science Research Methods, Vrije Universiteit, Amsterdam, the Netherlands. Google Scholar | |
|
Elzinga, Cees H. 2010. “Complexity of Categorical Time Series.” Sociological Methods and Research 38:463-81. Google Scholar | SAGE Journals | ISI | |
|
Elzinga, Cees H., Liefbroer, Aart C. 2007. “De-Standardization of Family-Life Trajectories of Young Adults: A Cross-National Comparison Using Sequence Analysis.” European Journal of Population 23:225-50. Google Scholar | Crossref | ISI | |
|
Gabadinho, Alexis, Ritschard, Gilbert, Müller, Nicolas S., Studer, Matthias. 2011. “Analyzing and Visualizing Sate Sequences in R with TraMineR.” Journal of Statistical Software 40(4):1-37. Google Scholar | Crossref | ISI | |
|
Gabadinho, Alexis, Ritschard, Gilbert, Studer, Matthias, Müller, Nicolas S. 2009. “Mining Sequence Data in R With the TraMineR Package: A User’s Guide.” Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva, Switzerland. Google Scholar | |
|
Gabadinho, Alexis, Ritschard, Gilbert, Studer, Matthias, Müller, Nicolas S. 2010. “Indice de Complexité pour le Tri et la Comparaison de Séquences Catégorielles.” Revue des Nouvelles Technologies de l’Information E-19:61-66. Google Scholar | |
|
Gabadinho, Alexis, Ritschard, Gilbert, Studer, Matthias, Müller, Nicolas S. 2011. “Extracting and Rendering Representative Sequences.” Pp. 94-106 in Knowledge Discovery, Knowledge Engineering and Knowledge Management, edited by Fred, Ana, Dietz, Jan L. G., Liu, Kecheng, Filipe, Joaquim. Berlin, Germany: Springer-Verlag. Google Scholar | Crossref | |
|
Gansner, Emden R., North, Stephen C. 1999. “An Open Graph Visualization System and Its Applications to Software Engineering.” Software—Practice and Experience 30:1203-33. Google Scholar | Crossref | ISI | |
|
Geurts, Pierre, Wehenkel, Louis, Buc, Florence d’Alché. 2006. “Kernelizing the Output of Tree-Based Methods.” Pp. 345-52 in ICML, edited by Cohen, William W., Moore, Andrew. New York: Association for Computing Machinery. Google Scholar | Crossref | |
|
Gower, John Clifford . 1966. “Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis.” Biometrika 53:325-38. Google Scholar | Crossref | ISI | |
|
Gower, John Clifford . 1982. “Euclidean Distance Geometry.” Mathematical Scientist 7:1-14. Google Scholar | |
|
Gower, John Clifford, Krzanowski, Wojtek J. 1999. “Analysis of Distance for Structured Multivariate Data and Extensions to Multivariate Analysis of Variance.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 48:505-19. Google Scholar | Crossref | ISI | |
|
Jobson, J. D. 1991. Applied Multivariate Data Analysis, Volume I: Regression and Experimental Design. New York: Springer-Verlag. Google Scholar | Crossref | |
|
Lesnard, Laurent . 2010. “Setting Cost in Optimal Matching to Uncover Contemporaneous Socio-Temporal Patterns.” Sociological Methods and Research 38:389-419. Google Scholar | SAGE Journals | ISI | |
|
Lodhi, Huma, Saunders, Craig, Shawe-Taylor, John, Cristianini, Nello, Watkins, Chris. 2002. “Text Classification Using String Kernels.” Journal of Machine Learning Research 2:419-44. Google Scholar | ISI | |
|
Manly, Bryan F. J. 2007. Randomization, Bootstrap and Monte Carlo Methods in Biology. 3rd ed. New York: Chapman & Hall. Google Scholar | |
|
McArdle, Brian H., Anderson, Marti J. 2001. “Fitting Multivariate Models to Community Data: A Comment on Distance-Based Redundancy Analysis.” Ecology 82:290-97. Google Scholar | Crossref | ISI | |
|
McVicar, Duncan, Anyadike-Danes, Michael. 2002. “Predicting Successful and Unsuccessful Transitions From School to Work Using Sequence Methods.” Journal of the Royal Statistical Society A 165:317-34. Google Scholar | Crossref | ISI | |
|
Mielke, Paul W., Berry, Kenneth J. 1983. “Asymptotic Clarifications, Generalizations, and Concerns Regarding an Extended Class of Matched Pairs Tests Based on Powers of Ranks.” Psychometrika 48:483-85. Google Scholar | Crossref | ISI | |
|
Mielke, Paul W., Berry, Kenneth J. 2007. Permutation Methods: A Distance Function Approach. 2nd ed. New York: Springer. Google Scholar | |
|
Morgan, J. N., Sonquist, J. A. 1963. “Problems in the Analysis of Survey Data, and a Proposal.” Journal of the American Statistical Association 58:415-34. Google Scholar | Crossref | ISI | |
|
Piccarreta, Raffaella . 2010. “Binary Trees for Dissimilarity Data.” Computational Statistics and Data Analysis 54:1516-24. Google Scholar | Crossref | ISI | |
|
Piccarreta, Raffaella, Billari, Francesco Candeloro. 2007. “Clustering Work and Family Trajectories by Using a Divisive Algorithm.” Journal of the Royal Statistical Society A 170:1061-1078. Google Scholar | Crossref | ISI | |
|
Pollock, Gary . 2007. “Holistic Trajectories: A Study of Combined Employment, Housing and Family Careers by Using Multiple-Sequence Analysis.” Journal of the Royal Statistical Society A 170:167-83. Google Scholar | Crossref | ISI | |
|
R Development Core Team . 2008. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. Google Scholar | |
|
Reiss, Philip T. M., Stevens, Henry H., Shehzad, Zarrar, Petkova, Eva, Milham, Michael P. 2009. “On Distance-Based Permutation Tests for Between-Group Comparisons.” Biometrics 66:636-43. Google Scholar | Crossref | Medline | ISI | |
|
Scherer, Stefani . 2001. “Early Career Patterns: A Comparison of Great Britain and West Germany.” European Sociological Review 17:119-44. Google Scholar | Crossref | ISI | |
|
Shaw, Ruth G., Mitchell-Olds, Thomas. 1993. “ANOVA for Unbalanced Data: An Overview.” Ecology 74:1638-45. Google Scholar | Crossref | ISI | |
|
Späth, Helmuth . 1975. Cluster Analyse Algorithmen. Munich, Germany: R. Oldenbourg Verlag. Google Scholar | |
|
Studer, Matthias, Ritschard, Gilbert, Gabadinho, Alexis, Müller, Nicolas S. 2009. “Analyse de Dissimilarités par Arbre d’Induction.” Revue des Nouvelles Technologies de l’Information E-15:7-18. Google Scholar | |
|
Studer, Matthias, Ritschard, Gilbert, Gabadinho, Alexis, Müller, Nicolas S. 2010. “Discrepancy Analysis of Complex Objects Using Dissimilarities.” Pp. 3-19 in Advances in Knowledge Discovery and Management, edited by Guillet, Fabrice, Ritschard, Gilbert, Zighed, Djamel A., Briand, Henri. Berlin, Germany: Springer. Google Scholar | Crossref | |
|
Widmer, Eric, Ritschard, Gilbert. 2009. “The De-Standardization of the Life Course: Are Men and Women Equal?” Advances in Life Course Research 14:28-39. Google Scholar | Crossref | ISI | |
|
Wu, Lawrence L. 2000. “Some Comments on ’Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect.’” Sociological Methods Research 29:41-64. Google Scholar | SAGE Journals | ISI | |
|
Yujian, Li, Bo, Liu. 2007. “A Normalized Levenshtein Distance Metric.” IEEE Transactions on Pattern Analysis and Machine Intelligence 29:1091-95. Google Scholar | Crossref | Medline | ISI | |
|
Zapala, Matthew A., Schork, Nicholas J. 2006. “Multivariate Regression Analysis of Distance Matrices for Testing Associations Between Gene Expression Patterns and Related Variables.” Proceedings of the National Academy of Sciences of the United States of America 103:19430-35. Google Scholar | Crossref |
