Published: 23 May 2022| Version 1 | DOI: 10.17632/tn22s9kvrt.1
Pejman Gholami Dastgerdi, Mohammad Reza Feizi Derakhshi, Mehrdad Ranjbar-Khadivi


1. Introduction Most natural language processes, especially text processing, require you to somehow extract meaningful expressions from the heart of the text. For example, event recognition, subject recognition, trend, emotion analysis, recognition of famous entities, text production, question and answer systems, etc. In general, to design systems that are supposed to have an understanding of textual content, one of the requirements for phrase recognition Meaningful and widely used in any language. This has been implemented by statistical processing systems such as n-grams, graph-based systems, etc. On the other hand, when statistical systems can be assured that the data under study are large enough, the Sep_Anchor-Title_Fawiki01 corpus with the help of the Wikipedia corpus (accessible from the Wikimedia), which includes a collection of Persian Wikipedia pages until May 3, 2016 (May 24, 2020) and contains more than 3 million articles. In the Sep_Anchor-Title_Fawiki01 corpus, the focus is on the titles of the articles and the texts of the links on the Wikipedia pages, which are described below. 2. Database The Wikipedia corpus is used as input data in the Sep_Anchor-Title_Fawiki01 corpus. The structure of the Wikipedia corpus is an XML file in which pages, categories, titles and texts are marked with xml tags and the content of the related tags. The main corpus of texts is articles in hypertext format in which links and some internal page structures are specified. The Sep_Anchor-Title_Fawiki01 corpus is compiled in the form of a SQL database, and to process and prepare this corpus, the focus is on the titles of the articles and the texts of the links on the Wikipedia pages. The links are stored in the Anchors table. On the other hand, in order to speed up the data retrieval operation, first all the words are mapped to the codes and the operation is done with the word codes in the database. This mapping is done in a table called Dictionary. In addition, all the processed expressions are stored in tokens by tokens in a table called AllContent in order to run other processes faster on the Wikipedia corpus and as a foresight, all of these tables will be explained below. It should be noted that all processing is done after normalization and deletion of keywords.



University of Tabriz


Natural Language Processing