Onneshon: A Hybrid Bengali Resume Dataset for Summarization and Classification

Published: 26 May 2026| Version 1 | DOI: 10.17632/4md7bx6fd7.1
Contributors:
,
,
,
,

Description

Onneshon is a hybrid Bengali resume corpus developed for NLP tasks including multi-class text classification, information extraction, and automatic summarization. This dataset supports research on resume summarization, classification, recruitment analysis, HR process optimization, and other NLP tasks. It contains resume texts at sentence and segment level, each labelled with a standard resume section to help NLP models understand Bangla resume structure. Dataset Composition: Total text segments: 1,739 Language: Bangla Labels: Experience, Skill, Education, Objective. Format: CSV (.csv) Columns: text – Bangla resume text; label – Resume section label Data Sources: Texts were collected in two ways: manually written and AI-generated using GPT-4, Claude and Gemini with carefully designed prompts. The dataset contains 100 resumes — 50 synthetic and 50 manually created. Annotation Process: Divided into four labels: Objective (100), Experience (823), Education (370), Skill (446). Labels assigned based on meaning and purpose, not just specific words. All labels were checked multiple times for correctness and consistency. Data Preprocessing: Sentences and segments extracted from full resumes. Personal information removed for privacy. Text cleaned, normalized, and freed of unnecessary symbols and repetition. Consistent labelling rules applied throughout. File Structure: A CSV file containing 1,739 Bangla resume text segments with section labels (Objective, Experience, Skill, Education) in a clean, easy-to-use format. Research Applications: Bangla resume section classification and summarization. Automatic resume parsing for Bangla. Training ML and deep learning models. Comparing AI-generated and human-written resumes. Testing Bangla NLP models on real resume data. Out-of-Scope Use: Non-technical recruitment (dataset focuses on software engineer and data analyst roles). Personalized hiring decisions without further validation.

Files

Institutions

Categories

Artificial Intelligence, Information Retrieval, Natural Language Processing, Bengali Language

Licence