Kentucky Drug and Sex Crimes
Three crime data sources were collected and merged for this study. All three crime sources were either only reporting on the U.S. state of Kentucky (KOOL and Louisville Open Data), or filtered to only contain results for the U.S. state of Kentucky (FBI). Each data source contains unique features such as crime classifications, and unique challenges in collection and cleaning. The United States Federal Bureau of Investigation (FBI) issues a variety of query-able crime related data on their website. This data is sourced from law enforcement agencies across the U.S. as part of their National Incident-Based Reporting System (NIBRS) and its standards. The goal of gathering, standardizing, and providing this information is to facilitate research into crime and law enforcement patterns. The information is provided as a collection of CSV files with instructions and code for importing into a SQL database. For the purposes of this research, we utilized the the crime databases for the years 2017, 2018 and 2019, containing a total of 1,939,990 unique incidents. The NIBRS_code property denotes the type of crime as assigned by the reporting agency. The human trafficking codes are 40A (Prostitution), 40B (Assisting or Promoting Prostitution), and 370 (Pornography/Obscene Material). The drug incidents were found using codes 35A (Drug/Narcotic Violations) and 35B (Drug Equipment Violations). The Kentucky Department of Corrections, as a service to the public, provides an online lookup of people currently in its custody called Kentucky Offender Online Lookup (KOOL). This web application offers users tools to search for sets of inmates based on features such as name, crime date, crime name, race, and gender. The data that KOOL searches contains only people who are currently under supervision of the state of Kentucky (or should be under supervision in the case of escape). The Louisville Open Data Initiative (LOD) is a program from the city of Louisville, Kentucky, U.S.A. to increase the transparency of the city government and promote technological innovation. As part of LOD, a dataset of crime reports is made available online. The records contained within the LOD dataset represent any call for police service where a police incident report was generated. This does not necessarily mean a crime was committed, as an incident report can be generated before an investigation has taken place.
Steps to reproduce
The merging strategy for the three datasets will be elaborated. Data was collected from Federal Bureau of Investigation (FBI), Louisville Open Data (LOD) and Kentucky Offender Online Lookup (KOOL). Some fields from the original data have not been included in the final dataset for anonymization purposes. These features, however, were collected and used for de-duplication. Race, Gender, Person Identifier (PID), and Height were removed. All records without county data were imputed with the string "None" in place of the county name. Premise Type from LOD, which corresponds to Location in FBI, were normalized using mapping. A similar mapping process was applied to all Race and Gender variables across the tables. LOD and KOOL De-Duplication for Drug Crimes We start by grouping together crime incidents in the KOOL and LOD tables if they shared both incident date and county into a subset. Next, the crime descriptions for all incidents in the subset were compared against each other. If two records matched on incident dates, county and crime descriptions, the data were all combined into one record. Crime descriptions were matched using the token_set_ratio method from Python's FuzzyWuzzy fuzzy string matching library. The edit distance between the two strings is then calculated using Levenshtein distance with a threshold set to 0.94. LOD and FBI De-Duplication for Drug Crimes All records present in the LOD and FBI datsets were first blocked together if incident date, county, premise type and NIBRS code values matched exactly. Crime descriptions in the blocked set were then compared using a two-layer method. The first layer uses the token_set_ratio method with a threshold set to 0.7. All pairs of records passing the first layer are passed to a Phrase2Vec embedding layer, trained on our crime description data, with a threshold of 0.5 for the cosine similarity metric. KOOL and FBI De-Duplication for Drug Crimes Records were first blocked together if they matched exactly on incident date, county, sex and race. Then crime descriptions for the blocked records were passed through the same two-layer method as described above. De-Duplication for Human Trafficking Related Crimes Due to the smaller size of the set of crimes related to human trafficking in the KOOL table, the de-duplication process for any records that occurred in the KOOL tables utilized a human-assisted approach. Crime descriptions for sex trafficking related crimes in the FBI tables were not used in the de-duplication process. With the exception of the aforementioned modifications, the de-duplication process for this subset was identical to the process for drug trafficking related crimes detailed previously. Feature Generation 21 features of drug classes were generated. All crime descriptions were run against every rule in table. Some of the drug classes were combined to create larger sets. The drug class for each drug record is one-hot encoded and appended to the dataset.