Hausa-English Code-Switched Dataset

Name: Hausa-English Code-Switched Dataset
Creator: Umar Baba Umar
Published: 2024-07-19T18:13:05.429Z
Keywords: Machine Translation

Umar, Umar Baba

doi:10.17632/3xjyjsf4sb.1

Hausa-English Code-Switched Dataset

Published: 19 July 2024| Version 1 | DOI: 10.17632/3xjyjsf4sb.1

Contributor:

Umar Baba Umar

Description

Hausa-English Code-Switched Dataset Overview The Hausa-English Code-Switched Dataset contains comments collected from Facebook, Instagram, YouTube, and Twitter. These comments exhibit code-switching between Hausa and English, providing a rich resource for linguistic research, natural language processing (NLP), and machine translation. Features Platform Support: Includes comments from Facebook, Instagram, YouTube, and Twitter. Multilingual Data: Captures code-switching between Hausa and English, reflecting real-world multilingual usage. Customizable: Adaptable for other language combinations and specific data collection needs. Data Collection Process The dataset was collected using a custom scraper designed to gather code-switched comments from social media platforms. Here’s a brief overview of the process: Platform Integration: Configured to work with Facebook, Instagram, YouTube, and Twitter APIs. Multilingual Data Capture: Identified comments with code-switching between Hausa and English. Configuration: Set up API keys and platform-specific settings. Execution: Ran the scraper on each platform, collecting and aggregating comments. Applications The dataset supports various research and application domains: Linguistic Analysis: Study code-switching patterns between Hausa and English. NLP: Train and evaluate models for tasks like language identification and part-of-speech tagging. Machine Translation: Provides parallel data for training translation systems. Sociolinguistic Studies: Explore social and cultural factors influencing code-switching on social media. Dataset Structure The dataset is organized into a CSV file with the following columns: Platform: The social media platform (Facebook, Instagram, YouTube, Twitter). Date: The date the comment was posted. Time: The time the comment was posted. User ID: A unique identifier for the user. Comment: The code-switched comment containing Hausa and English text. English Translation: The correct English translation of the code-switched comment. Example Entries Platform Date Time User ID Comment English Translation Facebook 2023-06-15 14:23:45 user123 Ina son wannan song, it's really great! I love this song, it's really great! Twitter 2023-06-15 14:23:45 user124 Yau ne zamu je gidan abinci, can't wait! Today we are going to the restaurant, can't wait! Instagram 2023-06-15 14:23:45 user125 Kai, wannan video is so funny! Wow, this video is so funny! YouTube 2023-06-15 14:23:45 user126 Na gode for sharing this, very informative! Thank you for sharing this, very informative! Conclusion The Hausa-English Code-Switched Dataset is a valuable resource for researchers and practitioners in linguistics, NLP, and machine translation. It provides real-world examples of code-switching, supporting the development of robust models and tools for handling multilingual text in diverse contexts. Explore the dataset and contribute to its ongoing development and application.

Files

Steps to reproduce

To reproduce the dataset similar to the one described in the repository, start by cloning the repository from GitHub and navigate to the directory. Ensure Python 3.0 is installed, and then install the required packages listed in requirements.txt using pip. Obtain API keys for Facebook, Instagram, YouTube, and Twitter, and configure these in the repository's configuration files. Next, run the provided Jupyter notebooks for each social media platform: FacebookExtractor.ipynb, Instagramcommenextractor.ipynb, youtubeextractor.ipynb, and twitter_replies_scraper_.ipynb. Extract the scraped comments from these notebooks, clean the data by removing duplicates and irrelevant entries, and normalize the text as needed. For a multilingual dataset, translate comments into the target languages and align them accordingly. Save the cleaned and aligned data in CSV format, ensuring proper formatting.

Institutions

Abubakar Tafawa Balewa University

Hausa-English Code-Switched Dataset

Description

Files

Steps to reproduce

Institutions

Categories

Licence