Testing Dataset for Head Segmentation Accuracy for the Algorithms in the ‘BGSLibrary’ v3.0.0 Developed by Andrews Sobral

Published: 15-06-2020| Version 1 | DOI: 10.17632/yw5k28z97d.1
Seng Cheong Loke,
Bruce MacDonald,
Matthew Parsons,
Burkhard Wünsche


This dataset consists of video files that were created to test the accuracy of background segmentation algorithms contained in the C++ wrapper ‘BGSLibrary’ v3.0.0 developed by Andrews Sobral. The comparison is based on segmentation accuracy of the algorithms on a series of indoor color-depth video clips of a single person’s head and upper body, each highlighting a common factor that can influence the accuracy of foreground-background segmentation. The algorithms are run on the color image data, while the ‘ground truth’ is semi-automatically extracted from the depth data. The camera chosen for capturing the videos features paired color-depth image sensors, with the color sensor having specifications typical of mobile devices and webcams, which cover most of the use cases for these algorithms. The factors chosen for testing are derived from a literature review accompanying the dataset as being able to influence the efficacy of background segmentation. The assessment criteria for the results were set based on the requirements of common use cases such as gamecasting and mobile communications to allow the readers to make their own judgements on the merits of each algorithm for their own purposes. A description of the algorithms in the BGSLibrary, the factors tested, and the abbreviations used in labeling the data files and folders can be found in the file 'Mendeley Data Tables.pdf'. The files in GAU10-GAU40 and UNI05-UNI20 have been compressed to save space.


Steps to reproduce

The clips were captured with an Intel RealSense Depth Camera D435 with the following camera settings were used: structured light projector on, autofocus enabled, autoexposure disabled, automatic white balancing disabled, backlight compensation disabled, and powerline frequency compensation disabled. Capture resolution was 640 x 480 pixels at 30 fps for color data and 90 fps for depth data, with the depth data processed using temporal and spatial smoothing with hole-filling to reduce artefacts. Synthetic paired color and depth frames were motion interpolated from the source frames to generate video clips without any inter-frames. The clips were then saved to AVI files using FFMPEG. Color clips were encoded using the MPEG-4 Part 2 codec at 4 Mbps except for the noise clips which were encoded at 12 Mbps to preserve the noise artefacts. The clips were captured at night under controlled bidirectional diagonal and side lighting with Philips Hue Color bulbs set to the ‘Energize’ preset. The camera was placed 120 cm in front of either a plain green screen (standard), a cream-colored screen (camouflage), or with the screen removed (complex), having the subject standing 60 cm in front of the screen, with no intervening objects. This resulted in a foreground area that was consistently about half the total background size. The factors affecting segmentation were then applied. All clips were 40 seconds long with the first 10 seconds showing just the background. The subject entered the scene at the 10 second mark and stood in the center of the frame while keeping a neutral expression, with the face and upper body fully visible. The comparison period was set to all frames between the 20 and 40 second mark inclusive. The brightness for all clips was normalized by applying the appropriate constant gamma correction to keep the average pixel brightness throughout the clip at 50% of the maximum brightness. In clips where a subject was present, the face location was determined using libfacedetection by Shiqi Yu, and a seed point and depth obtained from the center of the bounding rectangle. The ‘ground truth’ foreground was then extracted from a floating range flood fill starting at the seed point, with a maximum difference of 2 cm between adjacent pixels. Since the depth data from disparity mapping is coarse and lacks edge accuracy, an automated GrabCut algorithm was used to refine the edges of the foreground. The ‘ground truth’ clips were verified by visually inspecting at 4 fps, and adjustments were made with manual GrabCut assistance where the areas of inaccuracy exceeded 5% of the foreground. For clips without a subject, the depth data was ignored and the ‘ground truth’ foreground was set to zero. The segmentation foreground was obtained by processing the color data using the appropriate ‘BGSLibrary’ algorithm with default settings. The ‘ground truth’ and segmentation clips were saved to AVI files encoded with the lossless FFV1 codec.