School Leavers Study for Latent Code Identification Replication

Published: 18 November 2022| Version 1 | DOI: 10.17632/gzhfdtmhcm.1
Manuel Gonzalez Canche


Essay database for replication purposes for the study: Latent Code Identification [LACOID]: A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias Manuel S. González Canché Accepted version available here: Accepted version available here: Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all text inputs into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the textual database required to replicate all the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence. Replication steps and outcomes (pages 12 and 13 in the paper) First download and extract the data from this repository


Steps to reproduce

Software Access: Mac users can download LACOID here, PC users here Download and extract the data, then load it to the software 1. Select the type of text decomposition to be applied when uploading documents. Here the options are sentences or paragraphs 2. Upload documents . The text decomposition is automatically applied based on our selection in point 1. 3. Execute text normalization and cleaning (i.e., text mining or natural language processing—see appendix) 4. Decide whether to remove or retain common words 5. Select parameters for machine learning. Recommended values are 500 samples to establish baseline parameters (also known as burning samples), and learning process based on 5,000 iterations (Raftery & Lewis, 1991)—see appendix. 6. Execute Metrics assessments to find the optimal number of codes. See appendix and applied example. 7. Select optimal number of latent codes based on the result of the metrics executed in point 6 above. See applied example. 8. Execute the classification that will render the Nth number of codes selected in point 7. See applied example. LACOID Outputs Before further elaborating on more details of these steps, let us discuss the set of outputs that LACOID will make automatically available. 1. After step 6 is executed LACOID will generate a plot to ease the detection of the optimal number of codes After step 8 is executed, all the following outputs will be generated: 2. An interactive distribution of the words configuring each latent code 3. Two databases containing a. all codes with the original and cleaned texts and b. up to the top 20 most representative texts per each of the latent codes. 4. A statistical test of group-to-code association (see Hypothesis test section below), 5. An interactive network visualization of texts and code association and relevance (related to the statistical test mentioned in point 4) 6. A database measuring file-to-code strength that may be merged for posterior quantitative modeling. In addition to these outputs, all texts that did not meet an inclusion criteria (by being too vague to convey meaning, as explained below) are also available for download and analyses. Although this database is added for transparency, it is not necessarily considered an output of LACOID, but a byproduct of the steps required to conduct LACOID.


University of Pennsylvania


Data Science, Machine Learning, Software Development, Applied Computer Science, Democratization, Textual Database, Qualitative Methodology, Text Mining, Mixed Research Method Design


Sage Foundation

Spencer Foundation

National Academy of Education