Petunjuk Operasi Keuangan Jawa Tengah
Description
This research aims to develop an automated text summarization system for financial Operational Guidelines documents (POK) using the Bidirectional Encoder Representations from Transformers (BERT) model. Financial POK documents, issued by government institutions, are typically long, complex, and contain dense technical terminology as well as numerical data, making them difficult and time-consuming for stakeholders to comprehend efficiently. To address this issue, this study applies an extractive summarization approach, which focuses on selecting the most important sentences from the original document without altering their structure. This approach is chosen to preserve factual accuracy and maintain the integrity of financial information. The model implemented in this research is based on BERT, specifically adapted using the BERTSUM architecture, which enhances sentence-level representation through the use of special tokens and inter-sentence Transformer layers. Furthermore, the study utilizes IndoBERT, a pre-trained language model for the Indonesian language, and performs fine-tuning using a dataset of financial POK documents obtained from official government sources such as data.go.id. The model is expected to learn the unique characteristics of Indonesian financial documents, including bureaucratic writing styles and domain-specific terminology. The performance of the proposed model is evaluated using standard summarization metrics, namely ROUGE-1, ROUGE-2, and ROUGE-L, to measure the similarity between the generated summaries and human-written references. Additionally, qualitative evaluation is conducted to assess the coherence, readability, and completeness of the generated summaries. Overall, this research contributes to the advancement of Natural Language Processing (NLP) applications in the Indonesian public sector, particularly in improving accessibility and efficiency in understanding government financial documents.
Files
Steps to reproduce
1. Data Collection Collect Financial Operational Guidelines (POK) documents from official sources such as data.go.id. Ensure the dataset contains complete and relevant financial documents in Indonesian. 2. Data Preprocessing Clean the collected data by removing noise, irrelevant symbols, and formatting issues. Perform sentence segmentation and tokenization using NLP libraries (e.g., NLTK or similar tools). 3. Dataset Preparation Structure the dataset into input-output pairs, where the input is the full document and the output is the reference summary (manual or predefined). Split the dataset into training, validation, and testing sets. 4. Model Initialization Load the pre-trained IndoBERT model (indobenchmark/indobert-base-p1) as the base model for the summarization task. 5. Model Adaptation (BERTSUM) Modify the BERT architecture by adding special tokens ([CLS], [SEP]) for each sentence and applying inter-sentence Transformer layers to support extractive summarization. 6. Model Training (Fine-Tuning) Fine-tune the model using the prepared dataset. Adjust hyperparameters such as learning rate, batch size, and number of epochs to optimize performance and prevent overfitting. 7. Model Evaluation Evaluate the trained model using ROUGE-1, ROUGE-2, and ROUGE-L metrics to measure the similarity between generated summaries and reference summaries. 8. Qualitative Analysis Analyze the generated summaries based on readability, coherence, and completeness. Optionally, involve domain experts (e.g., financial practitioners) for validation. 9. Result Interpretation Interpret the evaluation results to determine the effectiveness of the model in summarizing financial POK documents. 10. Deployment (Optional) Implement the trained model into a simple system or interface to allow users to automatically generate summaries from new POK documents.
Institutions
- Universitas Islam Negeri Maulana Malik IbrahimEast Java, Malang