Turkish Sentence Dataset for Unsupervised Morphological Disambiguation

Published: 20 August 2024| Version 1 | DOI: 10.17632/hfs83tpvm2.1
Contributor:
Hayri Volkan Agun

Description

The dataset contains at most 500 unambiguous samples for distinct tag sequence of length 5 in morphological parses of the words in Turkish.

Files

Steps to reproduce

The dataset contains sentences separated by lines. Each sentence must be parsed by Zemberek morphological analyzer and grouped by distinct morpheme sequences. The number of sentences in each unambiguous morpheme sequence is 500 at most.

Institutions

Bursa Teknik Universitesi

Categories

Natural Language Processing, Morphological Analysis, Text Mining

Licence