pt-br2libras-gloss

Published: 28 May 2025| Version 2 | DOI: 10.17632/ryj88ckjww.2
Contributors:
,
,
,
,
,
,
,

Description

This dataset is a UTF-8 encoded Comma-Separated Values (CSV) format containing a bilingual parallel corpus of 127,349 aligned sentence pairs in Brazilian Portuguese and LIBRAS gloss. The file includes four columns: pt-br: Original sentences in Brazilian Portuguese. libras-gloss: Corresponding translations in LIBRAS gloss notation, forming the primary aligned pair with the Brazilian Portuguese sentences. is_government_source: A boolean field indicating whether the source sentence was extracted from an official Brazilian Federal Government website (True) or from a non-governmental source (False). english_translation: An automatically generated English translation of the Brazilian Portuguese sentence. This field serves as supplementary metadata for general understanding and is not part of the core bilingual alignment. A total of 55,047 sentence pairs in the dataset originate from government sources. This dataset is primarily intended to support research in bilingual corpora, machine translation, and sign language processing, specifically focusing on applications involving Brazilian Portuguese and Brazilian Sign Language (LIBRAS).

Files

Institutions

Universidade Federal da Paraiba

Categories

Natural Language Processing, Machine Translation, Sign Language, Portuguese Language

Funding

Secretaria Nacional dos Direitos das Pessoas com Deficiência (SNDPD), Ministério dos Direitos Humanos e da Cidadania, Brasil

Secretaria de Governo Digital (SGD), Ministério da Gestão e da Inovação em Serviços Públicos, Brasil

Licence