Published: 23 May 2022| Version 1 | DOI: 10.17632/g4tnnf683m.1
Pejman Gholami Dastgerdi,
, Mehrdad Ranjbar-Khadivi, Elnaz Zafarani Moattar,


1. Introduction In the field of computational linguistics and probability, n-gram is a continuous sequence of n pieces in a given sequence of text or speech and depending on the application, these items can be phonemes, syllables, letters, words, ext. In this corpus, n-gram on words (more precisely in language processing, tokens) has been calculated for Persian texts in Hamshahri corpus and Telegram messages. In these calculations, the tokenizer developed in Computerized Intelligence Systems Laboratory of University of Tabriz has been used to tokenize the texts. This tokenizer also performs sentence separation. 1-1. The corpus of Hamshahri Hamshahri statue includes texts published in Hamshahri online newspaper. This corpus is in the form of xml files, each file containing the texts of the day, in which the separation of news text and categories, as well as the date of submission in the form of xml tags is done. This corpus is presented in two versions. In Sep_Ngram_Tel-Ham01, the first version of Hamshahri has been used, which includes more than 80,000 articles from July 23, 1996 to June 20, 2003. 1.2 Telegram corpus The main problem with the Hamshahri statue is the lack of new words (such as Telegram, Barjam, etc.) due to the old statue, so in Sep_Ngram_Tel-Ham01, in addition to the Hamshahri statue, the new Telegram statue is used in the calculations. The corpus of Telegram includes posts published in prestigious and famous Persian groups and channels from March 15, 2017 to December 31, 2017, which includes more than 1,800,000 posts, and with thanks to telegram API in Computerized Intelligence Systems Laboratory of University of Tabriz has been collected. It should be noted that a period of one month from this statue is manually labeled thematic and is available [1].



University of Tabriz


Natural Language Processing