Published: 23 September 2022| Version 1 | DOI: 10.17632/fj5pbdf95t.1
Antonius Rachmat C,


Data post-comment pairs were collected from 13 selected Indonesian public figures (artists) / public accounts with more than 15 million followers and categorized as famous artists. It was collected from Instagram using an online tool and Selenium. Two persons labeled all pair data as an expert in a total of 72874 data. The data contains Unicode text (UTF-8) and emojis scrapped in posts and comments without account profile information. It contains several fields: -igid: Account ID, -comment: Comment of a post, -post: Post from an ID, -emoji: Whether the data contains emojis or not (1 or 0), -spam: Whether the data is spam or not (1 or 0), -lengthcomment: The character length of the comment, -lengthpost: The character length of the post, -countemojicomment: Number of emoji symbol characters in comments, -countemojicommentuniq: Number of emoji symbol characters in comments (unique), -countemojipost: Number of emoji symbol characters in posts, -countemojipostuniq: Number of emoji symbol characters in the post (unique)


Steps to reproduce

This data is collected from Indonesian language posts and comments. We used the dataset to detect spam comments based on the relevancy between comments and posts. It also contains a lot of emoji symbol that comes from social media. We can apply some pre-processing techniques, normalization, machine learning, and deep learning methods to train using this dataset and then detect spam / not. This dataset contains two folders: a raw folder in CSV and XLSX format that contains only comment and post pairs. The other is an analyzed folder also in CSV and XLSX format containing more attributes.


Universitas Gadjah Mada, Universitas Kristen Duta Wacana Fakultas Teknologi Informasi


Natural Language Processing, Applied Computer Science, Text Mining