English-Pashto Language Dataset (EPLD)

Published: 15 January 2025| Version 1 | DOI: 10.17632/vmgv4s6vrn.1
Contributors:
Rabia Khan,
,

Description

Introduction: The English-Pashtu Language Dataset (EPLD) is a comprehensive resource aimed to provide linguistic insights into the Pashtu language. It contains the knowledge and study of Pashtu language with the basics of communication like counting, alphabets, pronoun, basic sentences used in everyday life. Every data is translated from English to Pashtu for better human understanding and clarity. The data is carefully proofread and verified by the native speakers and the language experts. Pashto language has multiple variations and accents depending on the geographical factors. This dataset explains and addresses the key differences of words and sounds of Pashto, which may sound similar or different from English on the basis of gender, tense of the statement, relationship of the speaker etc. This dataset is designed to support language learning, natural language processing (NLP) research and computational linguistic studies focusing on Pashto language. Dataset Format: This dataset is consist of four .xml files. Each XML file is structured with tags for easy parsing and integration into computational systems. Data organization within the files ensures seamless extraction and manipulation for research or application purposes. Dataset Structure: The Dataset contain the four .xml files, each file addresses and focus on a specific Pashto language aspect. 1. Counting.xml: • Contain numeric data counts starting from 0 to 100 • Define how numbers are called in Pashtu. (Like in English- “10” is called “Ten” and in Pashtu it is called as “Lass”) • Number is represented in English and then translated into Pashtu. 2. Alphabets.xml: • Contain Pashtu alphabets and also the alphabet sound. • The dataset includes the alphabets which sounds similar and different from English. 3. Pronouns.xml: • Dataset showcase the variation of pronouns used in Pashtu Language on the basis on gender (masculine and feminine). • The pronouns also vary from 1st person, 2nd person and 3rd person. 4. Sentences.xml: • The dataset contain 104 basic sentences. • The sentences are diverse in nature. • The English sentences are translated into Pashto following all the language rules and grammar of Pashtu.

Files

Categories

Computer Vision, Optical Character Recognition, Natural Language Processing, Machine Learning, Gene Translation, Asian Language

Licence