FairDataset for Personalized Information Retrieval (PIR) Systems
Description
# Dataset Description ## Overview This dataset contains search result data collected from Personalized Information Retrieval (PIR) systems, reflecting individual user search preferences. It is specially designed to investigate issues of diversity and bias in information access. ## Dataset Composition The dataset includes search result data collected from Google News and Google Search based on various queries and then refined by applying personalization factors. After removing duplicates, the consolidated Fair Dataset is provided in CSV format. - PIR: Google News, Google Search - Query: Abortion, Covid 19 pandemic, Russia Ukraine conflict, Ukraine war, US China trade war, World Cup 2022, election results, recreational marijuana - Personalization Factor: geo-information, accept_language, user_agent - Data Collection Period: January 1, 2022, to December 31, 2022 ## Data Format Each data file is available in CSV (Comma-Separated Values) format and contains the following columns: - titles: The title of the search result. - contents: The content visible directly on the webpage. - detail_content: Detailed content collected by accessing the URL additionally. - urls: The original URL of the search result. ## Data Size The dataset is extensive and detailed, offering a comprehensive view of the search result data: - Total File Size: The dataset has a substantial size of 4.33GB, ensuring a rich collection of data for analysis. - Individual File Size: Each file varies in size, ranging from 10MB to 40MB, allowing for manageable yet detailed datasets. - Number of Files: The dataset comprises 338 files, with an additional consolidated Fair Dataset, providing a wide array of data points.