Digitization of Old Indonesian Documents using YOLO and OCR

Name: Digitization of Old Indonesian Documents using YOLO and OCR
Creator: Pradah Suherli
Published: 2025-12-25T03:28:15.860Z
Keywords: Transcription, Optical Character Recognition, Object Detection

Suherli, Pradah; Sudiardjo, Kennard; Tjang, Kevin; Anderies, Anderies

doi:10.17632/ms6sjjpzwr.1

Digitization of Old Indonesian Documents using YOLO and OCR

Published: 25 December 2025| Version 1 | DOI: 10.17632/ms6sjjpzwr.1

Contributors:

Pradah Suherli, Kennard Sudiardjo, Kevin Tjang, Anderies Anderies

Description

NOTE: The Dataset For TrOCR Is Available In HuggingFace NusaAksara https://huggingface.co/datasets/NusaAksara/NusaAksara/viewer/Image%20Transcription%20(OCR)/train?f%5Bscript%5D%5Bvalue%5D=%27jawa%27&views%5B%5D=image_transcription_ocr This data is a combination of data for TrOCR Text Transcription of Jawa Aksara from NusaAksara, and A collection data for YOLOv8 manually created and annotated from Robflow consisting of document images from NusaAksara Jawa Aksara Image Segmentation Dataset, and Single page images of historical documents from The British Public Library: Endangered Archives Program. Link To NusaAksara (OCR Transcription): https://huggingface.co/datasets/NusaAksara/NusaAksara/viewer/Image%20Transcription%20(OCR)/train?f%5Bscript%5D%5Bvalue%5D=%27jawa%27&row=4707&views%5B%5D=image_transcription_ocr Link To NusaAksara (OCR SEGMENTATION): https://huggingface.co/datasets/NusaAksara/NusaAksara/viewer/Image%20Segmentation/train?f%5Blanguage%5D%5Bvalue%5D=%27Jawa%27&views%5B%5D=image_segmentation Link EAP: https://eap.bl.uk/search

Files

Steps to reproduce

YOLO: 1. For the YOLO dataset, it is a YOLOv8 Dataset from Roboflow, all you need is to upload it to google drive, and copy the directory you saved the Folder in google drive for training/manipulation in google colab. TrOCR: 2. It is a dataset available in HuggingFace, you can load it using load_dataset from datasets script through python, and filter the dataset to OCR Transcription (Jawa Script) by filtering the train catalogue of data as seen below: dataset = load_dataset("NusaAksara/NusaAksara", "Image Transcription (OCR)") print(dataset) TARGET_SCRIPTS = {"jawa"} train_ds = dataset["train"].filter(lambda x: x["script"] in TARGET_SCRIPTS)

Institutions

Bina Nusantara University

Digitization of Old Indonesian Documents using YOLO and OCR

Description

Files

Steps to reproduce

Institutions

Categories

Related Links

Licence