Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Published: 14 August 2018| Version 1 | DOI: 10.17632/d7vx5cc92y.1
Contributors:
,

Description

Microsoft research video description corpus is an openly dataset contains about 120K sentences. The sentences are a set of roughly parallel descriptions of more than 2,000 video snippets of 35 languages. Both paraphrase and bilingual relation are available but Indonesian description is not available in the dataset. This dataset is Indonesian expansion of Microsoft research video description corpus. The collection consists of 43,753 description texts of 1,959 short videos, parallel with Microsoft’s dataset. Adding more value to the dataset, the similarity metrics calculations of the texts are done. The metrics are cosine, jaccard, euclidian, and manhattan with average results are 0.22, 0.33, 2.38, and 6.08 respectively.

Files

Categories

Information Retrieval, Semantics, Natural Language Processing, Similarity Measure, Indonesian Language

Licence