IDTheftCase-JudgmentCorpus: Indonesian Theft Case Judgment Corpus - Levels of Court

Published: 9 April 2025| Version 3 | DOI: 10.17632/48x9xm7rkf.3
Contributors:
,
,
,
,
,
,

Description

IDTheftCase-JudgmentCorpus: Indonesian Theft Case Judgment Corpus – Levels of Court is a dataset containing the full-text documents of written judgments handed down by Indonesian courts in criminal theft cases at three levels: the court of first instance, the appellate court, and the cassation court. The dataset was created to support research and development activities in information extraction and Natural Language Processing (NLP), specifically about the processing and understanding the legal texts and court documents. All annotated entity names have been standardized and translated into English, making the dataset more suitable for international NLP research and development of multilingual or cross-lingual models. Available Annotated Files: • 1-first-instance.csv – Contains tokenized and BIO-tagged court decisions from the district courts (Pengadilan Negeri). • 2-appellate.csv – Contains tokenized and BIO-tagged decisions from the appellate courts (Pengadilan Tinggi). • 3-cassation.csv – Contains tokenized and BIO-tagged decisions from the cassation level (Mahkamah Agung). • metadata.csv – Contains contextual and hierarchical information about the judgment documents, structured into the following columns: o decision_id, o original_id, o court_level, o court_name, o year, o verdict_type, and o cross-referenced case identifiers (first_id, appellate_id, cassation_id). Entity Annotations: The dataset is annotated using a BIO tagging format, identifying over 56 legal entities that appear in court documents. All entity labels are expressed in English, covering information such as: • Parties and roles: Defendant, Lawyer, Prosecutor, Witness, PresidingJudge • Legal process: ProsecutionDate, DecisionDate, ArrestDate, CassationReason • Legal references: ChargeArticles, ProsecutionArticles, CourtRuling, DecisionCosts • Case identifiers and metadata: DecisionNumber, ChargeType, CaseLevel, IncidentLocation All documents in this dataset were obtained from public records on the official website of the Supreme Court of the Republic of Indonesia (https://putusan3.mahkamahagung.go.id/). As such, the dataset represents real-world cases and reflects the legal form of Indonesian court documents. IDTheftCase-JudgmentCorpus is an essential dataset for research in named entity recognition and extraction, punishment imposition pattern analysis, and automatic document classification in the Indonesian legal context. Moreover, the dataset is useful for developers and researchers who aim to build and implement machine learning-based models to extract, group, and analyze judgment documents at different court levels.

Files

Categories

Artificial Intelligence, Text Extraction, Text Processing, Corpus Analysis

Funding

Direktorat Riset Dan Pengabdian Kepada Masyarakat

0459/E5/PG.02.00/2024

Licence