Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

Published: 24 February 2022| Version 1 | DOI: 10.17632/msyc6mzvhg.1
Contributors:
Tokinori Suzuki, Kota Nagamizo,

Description

## Brief Explanation This dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags. A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement. The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset 1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001). 2. img directory This directory is a placeholder directory to fetch image files for downloading. 3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

Files

Steps to reproduce

## How the data was collected The data were created by collecting images and their assigned tags from Flickr (https://flickr.com). We collected images assigned to a tag out of our defined 14 ambiguous tags listed below. Then, annotators judged relevant a Wikipedia article for a tag of the collected image. For more detail of collecting data and the process of annotation, please refer to the paper. The defined tags: 1. albatross 2. bee 3. bison 4. boar 5. coyote 6. cricket 7. jaguar 8. kite 9. llama 10. mouse 11. quail 12. stingray 13. tiger shark 14. whippet We note the preprocess of formatting enwiki_20171001.xml from a Wikipedia dump data. The used dump data is 'enwiki-20171001-pages-articles.xml.bz2' on Wikipedia archive site (https://archive.org/download/enwiki-20171001). We extracted the body part of Wikipedia articles from the data using the Python library, WikiExtractor (https://github.com/attardi/wikiextractor) by running the following command: python -m wikiextractor.WikiExtractor enwiki-20171001-pages-articles.xml.bz2 After that, we concatenated all the output files into one XML file. ## How to reproduce the experiments All of the scripts in this dataset is listed in Jupyter notebook file, scripts.ipynb. It covers from downloading images to evaluation of the task. You may reproduce the experiments runing scripts in the file. Pyhton 3 and Jupyter Notebook environment are required. We check the scripts work on our environment: ·Pyhton 3.8.5 ·Jupyter Notebook (Jupyter cjuore 4.7.1, jupyter-notebook 6.2.0)

Institutions

Kyushu Daigaku

Categories

Digital Library, Wikis, Metadata, Image Database

Licence