NER dataset related to legal texts

Published: 16 September 2024| Version 1 | DOI: 10.17632/scpttyz6t5.1
Contributor:
tianyue huang

Description

The following data pertains to Named Entity Recognition for legal judgment documents related to the crime of assisting in information network crimes. The dataset consists of a total of 4,236 samples, including both training and validation data, with a total of 8 labels. The file train1.json contains the raw data in JSON format, which is not divided into training and validation sets. The ner_data folder contains processed data in .txt file format, with the dataset split into training and validation sets at a ratio of 5:1. This folder also includes all label names. Ultimately, the model is trained using the processed dataset.

Files

Steps to reproduce

The dataset was downloaded from the Peking University Law Database and includes legal judgment documents related to the crime of assisting in information network crimes from recent years. After preprocessing these legal documents, a total of 4,236 data samples were obtained. Important entity types in the text were then identified. Finally, the dataset samples were manually annotated using the text annotation tool Doccano.

Institutions

Shandong University of Science and Technology

Categories

Law, Natural Language Processing, Recognition

Licence