Loay & Safa Dataset

Published: 9 December 2016| Version 2 | DOI: 10.17632/ggh75fd25f.2


The goal of this project is to provide large availability of textual data in electronic form. In order to test system performance such as retrieval systems, search engine, plagiarism checking systems… etc. Hence, we collect data from papers, books and articles with the possibility of recurrence of the file content and the size of files ranging within [1KB-20,374KB]. The overall size of the “Raw Dataset” are formed 4.64GB. This dataset also provide “Modified Dataset” after lexical analysis to analyze the input file and extract words that contain only English alphabet characters. Because this files were took from many resources (For each word check each letter for handling some situations such as (we ’re → we are, don’t → do not, bi-cycle → bicycle, B.S. → BS and up/down → up down). Despite our best efforts to clean this dataset, it contains a very small percentage of non-English words and non-words. No automatic spelling correction was performed. The size of the resulted data will be 4.27GB with filtering of the resulted files for passing only non-empty files. Note: the raw dataset contains a text file that may contain (English alphabet, other symbols, non-printable character, and numbers). Last Update: November, 2016.


Steps to reproduce

Process the dataset in your project This section describes how to process your project using the downloaded dataset. 1. Download the dataset (either the Raw Dataset or the Modified Dataset), decompress the downloaded file. 2. Start your project using any programing language. 3. (Optional) in your project: Click on Browse in the dialog box, and on the select the dataset or a specific file number (may be one file) from the dataset. Or simply determine the location and the name of the dataset in the source code of your program.


University of Baghdad


Text Editing, Big Data, Text Extraction, File Searching, Text Processing, Text Mining