Normative Documents Interactive Question Answering Dataset (NDIQAD)

Published: 20 December 2021| Version 2 | DOI: 10.17632/b64ggb36ht.2


This dataset consists of questions and answers based on selected normative documents. It includes 12 normative documents from different universities and banks, e.g. Study Rules of Mendel University in Brno, MIT Term Regulations and Examination Policies or Terms and Conditions for Personal Line of Credit (PLOC) of HSBC bank. These documents are manually annotated with 1767 questions by 15 annotators. The average document length is 14 pages. The dataset has 12.8 questions per page, 1.1 questions per paragraph, 33% coverage of paragraphs on average. Questions are formed as one sentence. Answers are exact pieces of document texts. The searched area of one question is always one whole document. Each question-answer pair is also accompanied with a path leading over headings of the document from the document's root to the section containing the answer. This structural information enables testing of interactive question answering when the QA system is asking supplementary questions to limit the number of answers by disambiguating the document section. The dataset contains the original documents in PDF together with questions and answers in CSV files. The CSV files use semicolon [;] as the separator, and optionally double quotes ["] to escape strings. The double quote is doubled when used inside a text [""]. Each row has four attributes, see an example of a dataset item below. Document: Study Rules of Mendel University in Brno Path: Study Rules / 2 Study in Bachelor’s and Master’s Degree Programs / 2.11 Study interruption Question: How can I interrupt my study? Answer: Student’s study may be interrupted at the student’s request or ex officio. You can find all questions twice in the dataset. The first version is the original question as written by the annotator. The second version marked "optimized" contains questions where the first person ("I", "me", "my") has been replaced by the actor of the document (a student or a client). This replacement is done automatically and can improve QA performance. This dataset is unique mainly due to the type of the documents. Our focus is on normative documents with strict consistent formatting, numbered headings and paragraphs, low use of pronouns, etc. We do not focus on national laws as the majority of other research in this field, rather we focus on documents used to support internal processes in large organisations. We found that current QA methods do not perform well on this kind of documents. Hence, we have introduced this dataset to address the issues and we have also presented a new approach to QA on normative documents, see


Steps to reproduce

The dataset was collected by 15 annotators. They were given random pieces of text from random documents and asked to write a few questions related to the given piece of text from the point of view of the document's actor (e.g., a student or a client). It was ensured that one annotator is not given the same piece of text twice. However, the same piece of text may be annotated by different annotators to increase the diversity of questions. Similar, non-sense and not-atomic questions were removed manually afterwards. Spelling errors were corrected manually and contractions (e.g., he's) were rewritten to the full form. The methodology of the dataset creation and a QA system using this dataset is described in our paper Preprocessing of Normative Documents for Interactive Question Answering, see:


Mendelova univerzita v Brne Provozne ekonomicka fakulta


Natural Language Processing, Machine Learning