Structured Data on AI & Data Science Degree Programs from 20 Universities Across Five U.S. States
Description
This dataset is a normalized collection of information about Artificial Intelligence (AI) and Data Science (DS) programs offered by universities in the United States. It includes data from 21 universities across multiple states, covering both public and private institutions, collected from the Integrated Postsecondary Education Data System (IPEDS) 2023 data cycle. The 21 universities were selected based on their explicit listing of a Data Science major in IPEDS. Some recently established programs were not captured as they had not yet been recorded in the 2023 cycle. Well-known institutions such as Harvard and MIT also do not appear, not because they lack data science programs, but because they classify them under different department names or CIP codes, placing them outside the direct Data Science major search criteria. The dataset is organized into four main tables: University, Degree, Admission, and Graduation_Rate. The University table contains general information such as school name, location, and institution type. The Degree table covers program details including CIP description, credits, and credential level. The Admission table includes application and enrollment data, SAT score ranges, and in-state and out-of-state tuition. The Graduation_Rate table tracks student outcomes by cohort, including overall and gender-based graduation rates. One institution is an all-women school, so its male graduation rate is recorded as not applicable rather than missing. The database is designed using normalization up to Third Normal Form to reduce redundancy and maintain consistency. Each table is linked through keys, making it easy to query and analyze. This dataset can be used to compare universities, explore trends in AI and Data Science education, and better understand differences in admissions, cost, and student outcomes across institutions.
Files
Steps to reproduce
This project collected institutional data from the Integrated Postsecondary Education Data System (IPEDS), a national database maintained by the National Center for Education Statistics (NCES). Three datasets were used: the Admissions and Test Scores file (ADM2023_RV.csv), the Academic Year Tuition file (IC2023_AY.csv), and the Graduation Rates file (GR2023_RV.csv), all from the 2023 data cycle. The 21 universities were selected based on their offering of a Data Science major as recorded in NCES. During selection, some institutions with recently established programs did not appear in the records. Notable universities like Harvard and MIT were also absent, not because they lack data science programs, but because they list them under different classification codes or department names, causing them to not appear in a direct Data Science major search. The final sample therefore represents institutions that explicitly list Data Science under that classification in IPEDS. Data merging was initially attempted using Microsoft Excel's XLOOKUP function, which searches a range for a value and returns a result from another range using a shared key, in this case, the UnitID. However, XLOOKUP repeatedly returned #N/A errors due to data type inconsistencies in how Excel imported the CSV files, treating the same UnitID values as either text or numbers depending on the file. Multiple fixes were attempted including TEXT(), VALUE(), MATCH() diagnostics, and Text to Columns, but the issue persisted. The process was then moved to Python using the pandas library, which handles data types consistently. Each IPEDS file was filtered using the isin() method to extract only the 21 target institutions. Admissions columns were mapped to TotalApplicants, TotalEnrolled, SAT_EBRW_25th, and SAT_EBRW_75th. Tuition columns TUITION2 and TUITION3 were mapped to in-state and out-of-state tuition. Graduation rates required additional filtering by GRTYPE and CHRTSTAT to isolate the correct cohort and completer rows, then calculated as a percentage by dividing completers by cohort size. One notable finding was that UnitID 166939 had no male graduation rate data. This was because the institution is an all-women school, making the value not applicable rather than missing. It was left blank in the final dataset to accurately reflect this.
Institutions
- Wentworth Institute of TechnologyMassachusetts, Boston