CareerCorpus : A Comprehensive Dataset of Annotated Resumes
Description
CareerCorpus is a comprehensive dataset of 302 annotated resumes spanning six occupational categories designed for natural language processing research in automated recruitment and human resource analytics. DATASET COMPOSITION: - Total resumes: 302 - Categories: Teacher (50), Finance (50), Apparel (50), Accountant (51), Banking (50), Research Assistant (51) - Format: Single Excel file (.xlsx) containing all six categories - Annotation: Dual expert annotations preserved for all resumes DATA SOURCES: Resumes collected from (1) Kaggle dataset (LiveCareer.com professionally crafted resumes) for five categories, and (2) LinkedIn public profiles for Research Assistant category. HTML-formatted resumes processed via ChatGPT (GPT-5) for text extraction and standardization. EXPERT ANNOTATION: Each resume independently annotated by two domain experts: - Financial categories (Finance, Accountant, Banking): Certified accountants with 5+ years experience and ICMAB certifications - Apparel: Textile/fashion industry practitioners - Academic categories (Teacher, Research Assistant): University lecturers with teaching and research experience Dual annotations preserved to support soft-label training, annotation confidence modeling, and disagreement-aware evaluation metrics. DATA PREPROCESSING: - HTML-to-text conversion via AI-assisted summarization - PII removal and anonymization (names, emails, phone numbers replaced with placeholders) - Text normalization and standardization - Duplicate elimination - Format standardization across all categories FILE STRUCTURE: Single Excel workbook containing: - All 302 resumes across six occupational categories - Anonymized resume text - Dual annotation scores from independent experts - Category labels - Resume metadata - Organized in tabular format for easy access and analysis INTER-ANNOTATOR AGREEMENT: Pearson correlations range from 0.35-0.89 across categories (Finance: 0.68, Banking: 0.38, Accountant: 0.35, Apparel: 0.89, Teacher: 0.56, Research Assistant: 0.67). Overall mean correlation: 0.59, mean MAE: 0.106, indicating moderate agreement with low scoring error. RESEARCH APPLICATIONS: - Resume classification and categorization models - Automated recruitment system development - Skill extraction algorithms - Job-candidate matching systems - NLP benchmark evaluation - Recruitment bias and fairness research - Annotation quality and human-AI collaboration studies ASSOCIATED PUBLICATION: This dataset supports the Data in Brief article "CareerCorpus: A Comprehensive Dataset of Annotated Resumes" by Md Sagor Chowdhury, Adiba Fairooz Chowdhury, Ayesha Banu, and Riad Hossain (2025). LICENSE: Released under CC-BY-4.0 for open research use with appropriate citation. CONTACT: For questions: riad.h@eastdelta.edu.bd Institution: Department of Computer Science and Engineering, East Delta University, Chattogram, Bangladesh
Files
Institutions
- Chittagong University of Engineering and Technology
- East Delta University