Turkish Sentence Dataset for Unsupervised Morphological Disambiguation
Published: 20 August 2024| Version 1 | DOI: 10.17632/hfs83tpvm2.1
Contributor:
Hayri Volkan AgunDescription
The dataset contains at most 500 unambiguous samples for distinct tag sequence of length 5 in morphological parses of the words in Turkish.
Files
Steps to reproduce
The dataset contains sentences separated by lines. Each sentence must be parsed by Zemberek morphological analyzer and grouped by distinct morpheme sequences. The number of sentences in each unambiguous morpheme sequence is 500 at most.
Institutions
Bursa Teknik Universitesi
Categories
Natural Language Processing, Morphological Analysis, Text Mining