HQA-data: A historical Question Answer Generation dataset From previous multi perspective conversation

Published: 15 December 2022| Version 1 | DOI: 10.17632/p85z3v45xk.1
Sabbir Hosen,


This is a Question Answering dataset based on the user's chat log. We found a dataset that contains two or multiple persons' conversations in text format; the dataset name is "The Ubuntu Dialog Corpus". From that dataset, we analyze the user’s chats based on dialogueID, which represents a unique chat room. Based on the dialougeID, we have merged those chats and converted them into context. We derived questions and answers from the context. Then, based on that context, we determine the starting and ending positions of the answer. Our dataset is available in two different formats: 1. Comma Separated Values (CSV), 2. JSON-formatted data. Each format contains 7323 contexts and 29150 QA pairs in the Train file. And there are 2041 contexts and 7288 QA pairs in the Test file. In total, there are 9364 contexts and 36438 QA pairs in our dataset.


Steps to reproduce

First, we consider all conversations from The Ubuntu Dialogue Corpus as Raw data. We transform context based on the dialogueID derived from raw data. The context is passed to T5, a pre-trained QA-generated model. This model creates questions and answers based on context. From the contexts, we find the answers' starting and ending positions.


University of Asia Pacific


Natural Language Processing, Chat, Text Extraction, Questionnaire, Deep Learning