Published: 5 February 2021| Version 2 | DOI: 10.17632/yzyxtpcnxm.2
Bruno Leme


Datasets of economic-featured documents for legal defense of merger cases. The target feature for these documents is the legal procedural rite the process must follow. In order to apply k-fold cross validation, we randomly created the train-valid sample combinations in advance, e.g. valid_index_final_split_1 is the validation set for training set train_index_final_split_1. We employed k-fold with k = 5. Collected from The Administrative Council of Economic Defense (CADE) website.


Steps to reproduce

All files are python pickle files, X_train_final and X_test_final are the input training and test datasets, each one with a list of word ids, Y_train_final and Y_test_final are its target labels (legal procedural rite). The dictionary of ids and words is in file word_index_final. The 5 combinations of train-valid indexes splits (k-fold cross validation) are denoted by train_index_final_split_1 and valid_index_final_split_1 (for the 1st split set), train_index_final_split_2 and valid_index_final_split_2 (for the 2nd split set), and so on. Each pair of train and valid indexes is applied in X_train_final in order to split this dataset in train and valid sets. In following there are some examples of importing in python. ... with open('X_train_final', 'rb') as file: X_train = pickle.load(file) with open('Y_train_final', 'rb') as file: Y_train = pickle.load(file) with open('X_test_final', 'rb') as file: X_test = pickle.load(file) with open('Y_test_final', 'rb') as file: Y_test = pickle.load(file) with open('word_index_final', 'rb') as file: word_index = pickle.load(file) with open('train_index_final_split_1', 'rb') as file: train_index = pickle.load(file) with open('valid_index_final_split_1', 'rb') as file: valid_index = pickle.load(file) #splitting train data between train and valid datasets X_train = X_train[train_index] Y_train = Y_train[train_index] X_valid = X_train[valid_index] Y_valid = Y_train[valid_index] ...


Universidade de Sao Paulo


Natural Language Processing