BanSum: A Dataset for Bangla Abstractive Article Summarization with Multiple Sentences

Published: 17 September 2024| Version 2 | DOI: 10.17632/rxhj7g6y2k.2
Contributors:
,
,

Description

Our research hypothesis is to evaluate the effectiveness of different Bangla text summarization methods compared to the original text ('main'). The data shows that: - The average length of the main text is 2482.72 characters. - The average length of the summaries are: - sum1: 293.75 characters, - sum2: 506.10 characters, - sum3: 688.50 characters. The compression ratio of each summary method (summary length divided by main length) reveals that: - sum1's mean compression ratio is 0.14, - sum2's mean compression ratio is 0.24, and - sum3's mean compression ratio is 0.33. Notable findings: - sum1 appears to be the shortest summary on average, with a higher degree of compression. - sum2 produces summaries of medium length, while sum3 tends to generate the longest summaries. Data Gathering and Interpretation: The data can be interpreted to assess which method produces the most concise, yet meaningful, summaries. Researchers can use these findings to evaluate the trade-offs between summary length and completeness of information conveyed.

Files

Steps to reproduce

The data was gathered by scraping Bangla news articles from publicly available sources using the BeautifulSoup and requests libraries in Python. Each record in the dataset represents an article (main) and its corresponding summaries (sum1, sum2, sum3). The dataset underwent several preprocessing steps, including text cleaning (removal of HTML tags, special characters), normalization, and handling missing values. The final dataset was saved in CSV format and metadata was generated to document important statistics about text lengths and missing data. All preprocessing and summarization scripts were written in Python using pandas, numpy, and BeautifulSoup libraries.

Institutions

University of Dhaka Faculty of Engineering and Technology

Categories

Natural Language Processing, Natural Language Generation

Licence