Published: 1 July 2024| Version 1 | DOI: 10.17632/sf238jg557.1


Geometric features play an important role in image captioning. Nowadays, many researchers focus on objects' geometric aspects and their interrelationships. However, all possible geometric features of MSCOCO (Lin et al., 2014) are not publicly available in an integrated form. To address this gap, a dataset named "GF-FRCNN MSCOCO" has been created, which stands for Geometric Features Extracted from 36 Bounding Boxes (Anderson et al., 2018) of Faster R-CNN (Ren et al., 2015) for each image in the MSCOCO (Lin et al., 2014) Image Captioning Dataset. This dataset, containing all the possible geometric features extracted from 123,287 images in the MSCOCO image captioning dataset (Lin et al., 2014) , provides essential spatial information about each object. To ensure scale-invariant models that generalize across different image sizes, we normalized the bounding box features relative to image dimensions. This involves computing relative values for the top-left and bottom-right coordinates, width, height, area, and center coordinates of the bounding box. Additional derived features, such as the perimeter, diagonal length, margins from the image edges, and distance to the image center, are also normalized. These relative features are then aggregated into a comprehensive set and clipped to a range of [0, 1] to ensure valid values. The normalized and clipped features provide a robust representation of the geometric properties of objects within the images. For more details about the dataset structure and the content of each folder, please refer to the readme.pdf file included with the dataset. References Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards realtime object detection with region proposal networks. Advances in neural information processing systems 28. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer. pp. 740–755. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018. Bottom-up and top-down attention for image captioning and visual question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086.



University of Science and Technology of China


Object Detection, Features Detection, Image Analysis