Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Name: Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis
Creator: Faisal Rahutomo
Published: 2018-08-14T04:11:53.013Z
Keywords: Information Retrieval, Semantics, Natural Language Processing, Similarity Measure, Indonesian Language

Rahutomo, Faisal; Hafidh Ayatullah, Ahmad

doi:10.17632/d7vx5cc92y.1

Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Published: 14 August 2018| Version 1 | DOI: 10.17632/d7vx5cc92y.1

Contributors:

Faisal Rahutomo, Ahmad Hafidh Ayatullah

Description

Microsoft research video description corpus is an openly dataset contains about 120K sentences. The sentences are a set of roughly parallel descriptions of more than 2,000 video snippets of 35 languages. Both paraphrase and bilingual relation are available but Indonesian description is not available in the dataset. This dataset is Indonesian expansion of Microsoft research video description corpus. The collection consists of 43,753 description texts of 1,959 short videos, parallel with Microsoft’s dataset. Adding more value to the dataset, the similarity metrics calculations of the texts are done. The metrics are cosine, jaccard, euclidian, and manhattan with average results are 0.22, 0.33, 2.38, and 6.08 respectively.

Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Description

Files

Categories

Licence