Group Ownership Information from Annual Reports of SIX Swiss Exchange Companies

Published: 21 July 2025| Version 1 | DOI: 10.17632/39kyzcp9r6.1
Contributors:
,

Description

This dataset contains group ownership information extracted from annual reports for the year 2021, published by companies listed on the SIX Swiss Exchange. It is used in the article "Discriminative meets generative: Automated information retrieval from unstructured corporate documents via (large) language models". The labels for page identification are provided as a CSV file ("labels/page_identification_labels.csv"). Each row corresponds to a single page and includes its label. With a few exceptions, the dataset includes only pages that contain group ownership information (i.e., positive cases). Please note that the page numbers in this dataset were extracted using the Python library doc2data (https://pypi.org/project/doc2data/). They may not correspond to the printed page numbers in the original PDF files. The dataset includes the following attributes: * file_name: name of the associated annual report file * company: company name * symbol: company ticker symbol * accounting_rules: accounting standards applied in the annual report (i.e., US GAAP, IFRS, Swiss GAAP FER or Bank Law) * foreign_issuer: boolean indicating whether the company is a foreign issuer * no_subs: boolean indicating whether the annual report does not contain the target information (i.e., no subsidiaries) * dual_lang: boolean indicating whether the pages in the report contain content in two languages * language: language of the annual report (i.e., German, English, French or Italian) * page_nr: page number within the document to which the label refers * page_label: label assigned to the specific page - either primary range, secondary range or not relevant (for the information extraction task we only use the primary range). The labels for information extraction are provided as a JSON file ("labels/bbox_labels.json"). The labels indicate whether a word, which is defined by its literal text and its bounding box on the page, belongs to a specified category of group ownership information. The dataset is structured via nested dictionaries in the following way: * dictionary of files: - [file name as key for each annual report]: dictionary of pages * dictionary of pages: - [page number as key for each relevant page]: list of dictionaries, each representing a word * dictionary per word (four fixed keys): - id: unique identifier for each word on the page - bbox: bounding box coordinates as four normalized values [left, top, right, bottom] representing the position of the word on the page - text: the extracted text content from that bounding box region - label: label assigned to the word The labels correspond to the following classes: LE (name of the legal entity), C (location of the legal entity), Own (total ownership percentage of the issuer in the legal entity), and Other (non-relevant words). Finally, we also publish the exact data split ("data_split.csv") that was used for model training and evaluation in the associated article.

Files

Steps to reproduce

The companies listed on the Swiss SIX Exchange were obtained from https://www.six-group.com/de/market-data/shares/companies.html (accessed in May 2022). Annual reports for the fiscal year 2021 (folder "annual_reports") were downloaded in PDF format from the websites of the companies. The reports were downloaded in the languages in which they were available, as firms could report in multiple languages. In some instances, our target information is in a different language from the rest of the report (e.g., the annual report is in German, but the page on business group information is in English and equivalent to the annual report published in English). To avoid duplicates, we do not keep those reports. Overall, we downloaded 354 annual reports. For our analysis, we excluded annual reports of companies that are not primarily listed on the SIX Swiss Exchange (i.e., foreign issuers). This results in 319 reports which is the number reported in the article. Naturally, reports that do not contain our target information (i.e., no subsidiaries) can only be utilitzed as negative samples.

Institutions

  • Albert-Ludwigs-Universitat Freiburg
  • Zurcher Hochschule fur Angewandte Wissenschaften

Categories

Document Management Processing, Accounting Information System, Business Informatics, Ownership Structure, Document Layout Analysis

Funders

Licence