Research data supporting "Machine learning in the processing of historical census data"

Published: 23 Jan 2020 | Version 1 | DOI: 10.17632/p4zptr98dh.1
Contributor(s):

Description of this data

This collection of data contains ground-truth (gold standard) datasets for the employment status reconstruction problem of historical census data. Different machine learning methods can be tested and compared with these datasets as described in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C., an outcome of the ESRC project ES/M010953: Drivers of Entrepreneurship and Small Businesses lead by PI Prof. Robert J. Bennett.

The material consists of three raw text files (1. and 2. are random samples). No census identification of individuals variable (RecID) is given so that the datasets are fully anonymised and it is not possible to track the individuals in each of the files. Below the variables descriptors:

1."1891 1000 Ent". 1891 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own account business proprietors) and 1,000 labelled workers. Labelling derives from the known employment status reported on the night of the Census, for the later 1891-1911 censuses; using the reported crosses in the columns of the 1891 Census Enumerators' Books (CEBs).
2."1851 1000 Ent". 1851 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own accounts) and 1,000 labelled workers. Labelling using clerical control of the occupational strings for the extracted Groups of business proprietors in the 1851 Census.
3."1851 MAX(Extracted)". 1851 Census of England and Wales economically active individuals: 70,872 labelled Entrepreneurs (35,436 labelled Employers and 35,436 labelled Own accounts) and 70,872 labelled workers. A maximum possible balanced dataset, from all the employers and own account identified by extracted Groups (1 for Employers and 3 and 5 for Own account). Labelling using clerical control of the occupation strings for the extracted Groups of the 1851 Census. It is also included the key variable OccString with full occupation strings.

A detailed explanation of how these datasets were obtained and how to use them in the context of machine learning reconstruction of the employment status problem of historical census data can be found in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C. (2020) Information Processing & Management.

This dataset should be cited as:
Montebruno, Piero; Bennett, Robert J.; Smith, Harry J.; van Lieshout, Carry (2020), “Research data supporting "Machine learning in the processing of historical census data" ”, Mendeley Data,
http://dx.doi.org/10.17632/p4zptr98dh.1

Experiment data files

This data is associated with the following publication:

Machine learning classification of entrepreneurs in British historical census data

Published in: Information Processing and Management

Latest version

  • Version 1

    2020-01-23

    Published: 2020-01-23

    DOI: 10.17632/p4zptr98dh.1

    Cite this dataset

    Montebruno, Piero; Bennett, Robert J.; Smith, Harry J.; van Lieshout, Carry (2020), “Research data supporting "Machine learning in the processing of historical census data" ”, Mendeley Data, v1 http://dx.doi.org/10.17632/p4zptr98dh.1

Statistics

Views: 41
Downloads: 0

Categories

Artificial Intelligence, Entrepreneurship, Machine Learning, Historical Analysis, Deep Learning

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?
You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report