MLPA-400

Published: 20 April 2019| Version 1 | DOI: 10.17632/mvkcpkx9ww.1
Contributors:
Dainis Boumber,
,

Description

we considered a realistic problem of multilabel AA in the realm of scientific publications by creating a publicly available dataset consisting of 400 Ma- chine Learning papers, Machine Learning Papers’ Authorship 400 (MLPA-400). To the best of our knowledge, multi-label AA of scientific publications has not received a lot of attention. It deserves more attention because automatic resolution of authorship issues in papers can have a variety of downstream applications in intellectual property managements, citation analysis, archival systems, and author disambiguation. The task is challenging: papers have many authors whose writing style can evolve or influenced by colleagues, they contain direct quotes from other works, authors’ contribution to the paper in terms of the amount of text written is unknown; the number of papers and authors is large. Considerations Many approaches to creating a suitable corpus exist. For example, papers can be chosen across domains. However, even within one domain the stylistic differences between venues are significant enough to make individual style hard to detect. A random sample of authors can be taken, but the number of multi-labeled documents would be few. Another possibility is taking the transitive closure of the set of co-authors and extracting at least k papers per author. However, creation of such a dataset for any reasonable k results in a very large transitive set. Design Using Google Scholar as a source, we created a list of top 20 authors in Machine Learning, ranked by the number of citations. We ensured a reasonable number of papers had an overlap of authors (i.e., we also included pa-pers that were jointly authored by the set of authors). For each author, 20 papers were downloaded for a total of 400 publications for the entire dataset. Each work is assigned 20 binary labels. The labels indicate which of the authors contributed to the paper’s creation. 100 papers out of 400 have more than one author from the 20 listed. The number of authors ranged from 1 to 3 and the average was 1.2925. The text was extracted from the PDF files using pdfminer (Hinyama, 2017) and pre-processed. The title, authorship information, and bibliography fields were removed from each paper to ensure the classifier abides by the rules of blind review instead of simply using author list while learning authorship. Formulas, table and figure captions were retained as they may contain valuable author specific style and topic information. The dataset is available here: URL to repo If you find this dataset useful, please cite as follows: Dainis Boumber, Yifan Zhang and A. Mukherjee. “Experiments with convolutional neural networks for multi-label authorship attribution.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, 2018. European Language Resources Association (ELRA).

Files

Steps to reproduce

Code available: https://github.com/dainis-boumber/AA_CNN

Categories

Computational Linguistics, Document Analysis, Natural Language Processing, Human Identification

Licence