Portuguese Financial, Legal, and Property Documents Dataset
Description
This dataset was developed to support research on automatic document classification using Large Language Models (LLMs) within the financial and banking domain. The underlying research hypothesis is that, despite its relatively small size, a dataset that is diverse in document structure, origin, and linguistic variation can still enable reliable evaluation of LLM-based classification pipelines. The data aims to demonstrate how well modern language models can distinguish between document types that share overlapping terminology, complex structure, or noisy text, which are common conditions in real-world financial institutions. The dataset contains 270 publicly available Portuguese PDF documents, evenly distributed across nine classes (30 documents each). These classes represent document types commonly handled in financial and administrative processes: Annual Report, Trial Balance, Commercial Registry Certificate, Deed, Property Valuation Report, Land Registry Certificate, Property and Urban Register, Energy Certificate, and a final Other category. The “Other” class includes documents that either partially resemble the main categories or present unique characteristics, enabling the evaluation of how classification models handle out-of-distribution content. A notable feature of the dataset is its heterogeneity since it includes: born-digital PDFs, digitized and scanned documents as well as documents containing handwritten notes or annotations, introducing realistic noise and OCR challenges. This variation not only reflects what is commonly found in typical banking workflows but also increases the relevance of the dataset for studying preprocessing and extraction strategies in LLM-based pipelines. Preliminary examination of the dataset highlights clear structural and semantic patterns across classes, for example, the tabular and numerical structure of Trial Balances versus the legal and narrative style of Deeds. At the same time, it underscores subtle distinctions between similar categories such as Land Registry Certificates and Property and Urban Registers, where classification may rely on contextual cues rather than simple keyword matching. These characteristics make the dataset a useful testbed for LLM-based document understanding tasks. Researchers can use this dataset in diverse ways such as: train and evaluate LLM-based solutions, benchmark OCR + LLM pipelines on mixed-quality PDFs, analyze linguistic and structural ambiguity between related financial and legal documents, develop end-to-end document understanding systems, including text extraction, summarization, or metadata tagging. Overall, this dataset offers a compact yet realistic sample of financial and legal documents encountered in practice and is well suited for research on document classification, OCR robustness, and LLM-based document processing.