Notre Dame fire text processing - word libraries

Published: 17-09-2020| Version 1 | DOI: 10.17632/ktzx8rvv9h.1
Lingyao Li


The dataset presents the word libraries that we've used for processing tweet data for the Notre Dame fire investigation project. It includes three major word libraries: 1) fire cause filtering seeds, 2) removing texts libraries, and 3) lexicon-based libraries. 1. Downloaded tweets with the search phrase “Notre Dame” were written in 62 different languages, in which the top 13 most commonly used languages are English, French, Spanish, Portuguese, German, Italian, Japanese, Polish, Turkish, Dutch, Swedish, Catalan, and Thai. This library presents the filter seeds translated in these 13 languages with the aid of Google Translation API. 2. The phrase “Notre Dame” in a tweet for the present dataset can be referred to the University of “Notre Dame” in the United States, so a filtering word library was developed containing university-related words. Another word library was developed to filter out those records not indicating the fire causes of Notre Dame Cathedral (“false” fire causes filters). 3. The research team randomly selected a sampled set of n=1,500 unique tweets from the dataset and manually identified and collected these cues to detect the negation from a tweet. The full libraries of these word patterns are attached. This file exhibits the lists of word patterns that we input into python to filter the tweets and build the opinion detection model.