Dataset and IndoBERTweet-CRF Model for Indonesian Cryptocurrency Named Entity Recognition

Published: 1 June 2026| Version 1 | DOI: 10.17632/rtw4d5c6hj.1
Contributor:
Jonathan Davin

Description

This dataset contains annotated Indonesian tweets for Named Entity Recognition (NER) in the cryptocurrency domain. It includes a Gold Standard dataset of 1,461 tweets manually annotated using the BIO schema, and a Silver Standard dataset of 27,226 tweets annotated via Distant Supervision using a Local Knowledge Base (LKB) derived from CoinGecko. The repository also includes the compiled LKB files (ambiguous and unambiguous) and the fine-tuned IndoBERTweet-CRF model weights which achieved an F1-Score of 70.76% in detecting crypto entities amidst informal social media text.

Files

Institutions

Categories

Computer Science, Artificial Intelligence, Natural Language Processing

Licence