Artificial Intelligence in the Public Sector Database
Description
This Final_Deliverable.xlsx dataset contains the metadata of a sample of N=1,923 academic texts related to the use of Artificial Intelligence in the Public Sector. Its contents are split into the tabs below: Datamap: provides a complete list of metadata variables including their codes, definitions and where they were pulled from (i.e. OpenAlex, BERTopic, or Manually Created). Full_Results: contains full metadata of all N=1,923 academic texts. Topic_Counts_Framework_Assign: contains variables that illustrate how we renamed and classified / integrated BERTopic produced topics into our functional framework categories (based identifying representative documents and qualitative review). Intertopic_map: corresponds to Table 2 in the manuscript. Cosine_Similarity_Matrix: shows pairwise cosine similarity scores for all 22 topics generated by BERTopic. A variation of this is included in Figure 6 within the manuscript. Hyperparameter_Combos: records the 720 hyper parameter combinations we tested while fitting our BERTopic model, organized in descending order based on variable 'coherence_score'. Our final hyperparamter combination is in Record 2.
Files
Steps to reproduce
To reproduce this research, please follow the below steps and reference the scripts in this GitHub repository: https://github.com/Zandermintz/AI_in_public_sector There are two repositories: 1. Python Scripts 2. Excel and CSV files to reference or troubleshoot **Download the contents of both onto your local machine and combine repositories before proceeding** 1. Open and download: final_results_corpus_building.ipynb. Be sure to also run the requirements.txt file. 2. Download and save all related excel files in the same repository you will be working out of in case the script fails to produce these outputs (Base_Corpus_QCed_N99.xlsx, all_citations_final.csv, all_citations_works_74_75.csv, final_citations_en_abstracts_dedupe.xlsx, final_citations_en_abstracts_dedupe_clean.xlsx, full_corpus_filtered_LLM_v9.xlsx, final_modeling_data.xlsx, merged_final_corpus.xlsx) 2. Follow the instructions within the .ipynb until step IIII. Filtering via Llama3 as you will need to have access to a remote server to complete this step. 3. After setting up your remote server (we used Texas Advanced Computing Center's Chameleon Cloud), follow the steps outlined in the final_results_corpus_building.ipynb for this section by accessing the 'process_abstracts_v9.py' script. 4. Once you have your output (~8-10 hours), return to the final_results_corpus_building.ipynb and follow the instructions there. 5. To reproduce how we arrived at our final hyper parameter combination, access a remote server and run the 'BERTopic_gridsearch.ipynb' before executing the 'Topic Modeling with BERT' code section. 6. Return to the final_results_corpus_building.ipynb and proceed to run the 'BERTopic_gridsearch.ipynb' chunk making sure to read the instructions. 7. The final output should be 'final_topics_to_be_classified_outliers_integrated.xlsx, which serves as the foundation of the 'Full_Results' sheet within the Final_Deliverable.xlsx file, or our database.