Russian dataset for the reply recovery

Published: 12 June 2023| Version 1 | DOI: 10.17632/xm86yszck2.1
Contributor:
Igor Buyanov

Description

This dataset is constructed from several Telegram chats in order to teach the model of prediction whether one message can be a reply for another or not. **Note:** the messages that actually replies are label with **zero**. The positive replies was aquaired based on natural `reply_to` Teleram markup. The negative case was aquaired by random sampling, which is suprisungly notably give some possibly `reply_to` combination, thus, making negative examples noisy. There are several chats: * balichat_woman - chats with woman from Bali * borussia_chat - football chat * chat_suicidnikov - the chat that dedicated the suicidal game "Siniy kit" * cotedazurchat - chat of immigrants in France * easypeasycodechat - chat of programmers * openwrt_ru - chat that dedicated to openWRT * orange_sosedi - chat of neighbors * sling38 - chat of yong moms * terrariaphone - chat of Terraria gamers The `test_data` was validated by crowdsource with Toloka.ai. The final validation was done by the authors, so it considered as gold test set.

Files

Steps to reproduce

See the paper for the more details.

Categories

Natural Language Processing, Russian Language, Natural-Language Understanding

Licence