Research data supporting "Machine learning in the processing of historical census data"

Published: 23-01-2020| Version 1 | DOI: 10.17632/p4zptr98dh.1
Contributors:
Piero Montebruno,
Robert J. Bennett,
Harry J. Smith,
Carry van Lieshout

Description

This collection of data contains ground-truth (gold standard) datasets for the employment status reconstruction problem of historical census data. Different machine learning methods can be tested and compared with these datasets as described in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C., an outcome of the ESRC project ES/M010953: Drivers of Entrepreneurship and Small Businesses lead by PI Prof. Robert J. Bennett. The material consists of three raw text files (1. and 2. are random samples). No census identification of individuals variable (RecID) is given so that the datasets are fully anonymised and it is not possible to track the individuals in each of the files. Below the variables descriptors: 1."1891 1000 Ent". 1891 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own account business proprietors) and 1,000 labelled workers. Labelling derives from the known employment status reported on the night of the Census, for the later 1891-1911 censuses; using the reported crosses in the columns of the 1891 Census Enumerators' Books (CEBs). 2."1851 1000 Ent". 1851 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own accounts) and 1,000 labelled workers. Labelling using clerical control of the occupational strings for the extracted Groups of business proprietors in the 1851 Census. 3."1851 MAX(Extracted)". 1851 Census of England and Wales economically active individuals: 70,872 labelled Entrepreneurs (35,436 labelled Employers and 35,436 labelled Own accounts) and 70,872 labelled workers. A maximum possible balanced dataset, from all the employers and own account identified by extracted Groups (1 for Employers and 3 and 5 for Own account). Labelling using clerical control of the occupation strings for the extracted Groups of the 1851 Census. It is also included the key variable OccString with full occupation strings. A detailed explanation of how these datasets were obtained and how to use them in the context of machine learning reconstruction of the employment status problem of historical census data can be found in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C. (2020) Information Processing & Management. This dataset should be cited as: Montebruno, Piero; Bennett, Robert J.; Smith, Harry J.; van Lieshout, Carry (2020), “Research data supporting "Machine learning in the processing of historical census data" ”, Mendeley Data, http://dx.doi.org/10.17632/p4zptr98dh.1

Files