Calculating Hazard Indices for PFAS in California's Drinking Water
Description
Per- and polyfluoroalkyl substances (PFAS) are a group of synthetic chemicals widely used for their water- and grease-resistant properties. PFAS compounds have strong chemical bonds, consisting of fluorine, carbon, oxygen, and hydrogen, which vary in chain lengths. Their persistence and potential adverse health effects have made them significant environmental pollutants. This data investigates the extent and impact of PFAS contamination across California by calculating the Hazard Index from Geotracker, California State Water Resources Control Board's data management system. The dataset was processed to generate an extensive dataset with latitude and hazard index within California consisting of 1) data sorting, 2) splitting data, 3) Location Specific Manipulations, 4) Hazard Index Calculations, and 5) Aggregated Hazard Index Calculations. Python was used with the Pandas library to analyze, manipulate, and process the data. The open source library in Python, xlswriter was then utilized to read and write the data into Excel files.
Files
Steps to reproduce
First, the original Excel file is loaded in the pandas DataFrame using the pd.read_excel( ) function. Later, the DataFrame is sorted by "Latitude," "Global ID," "Chemical Abbreviation," "Location ID," & "Date" to ensure a systematic organization. After sorting of the data within the DataFrame it was exported to a different Excel File, named, “Sorted_ALL_PFAS_CHEMICALS_rc.xlsx”. Then, the data was to processed by chemical type in column, “Chemical Abbreviation” within the newly generated Excel file, “Sorted_ALL_PFAS_CHEMICALS_rc.xlsx” individually. Hence, for each unique abbreviation in the column “Chemical Abbreviation,” a separate sheet is created in a new Excel file named Chemical_Specific_Data.xlsx. The “Chemical Abbreviation” column consists of six unique chemicals that include HFPA-DA, PFBSA, PFHXSA, PFNA, PFOA, and PFOS. The next step involves independently computing location-specific statistical analysis for each sheet in Chemical_Specific_Data.xlsx, i.e., for each chemical abbreviation. The following statistical metrics were computed for the “Value” column for each “Location ID,” which includes average and standard deviation. The computed value provides a measurement of the dispersion of the data points around the average. The MIN and MAX columns are added in each of the sheets in Chemical_Specific_Data.xlsx besides the Average and STD DEV, which represent the smallest and largest "Value" in the dataset for each "Location ID," respectively. In each of the sheets, the “Hazard Index (Internal Calc)” column is added. This column represents a potential risk for each of the “Location ID” which is computed using the below equation: Hazard Index=(Computed Mean Value of each PFAS)/(Health-based Value of each PFAS) Where, Average represents the mean concentration of the chemical at a location, and the "Standard" is the regulatory or health-based standard for that chemical. All the statistical measurement and hazard index columns with all the existing columns are moved to the new Excel file named Chemical_Specific_Data_with_Statistics.xlsx. In the next step, the data is further processed by adding all the hazard indexes for all the “Chemical Abbreviation” for each “Location ID.” Therefore, this newly added column, “Aggregated Hazard Index” where all the hazard indices are summed to create an overall hazard assessment. This aggregated data which is all the computed columns, average, STD DEV, MIN, Max, Hazard Index and Aggregated Hazard Index and other columns are copied to new Excel file, “Chemical_Specific_Data_with_Statistics_Final.xlsx”. However, in this new Excel file using Python, each “LOCATION ID” has multiple rows with the same data. Therefore, in the final step, duplicate rows were removed from each sheet to ensure data accuracy, and the cleaned data was saved as Chemical_Specific_Data_with_Statistics_Final_Aggregated.xlsx.