Testing Dataset for Fast Background Segmentation of the Head and Upper Body

Published: 22-09-2020| Version 1 | DOI: 10.17632/n7cb3f53g9.1
Seng Cheong Loke,
Bruce A. MacDonald,
Matthew Parsons,
Burkhard C. Wünsche


Portrait segmentation is the process whereby the head and upper body of a person is separated from the background of an image or video stream. This is difficult to achieve accurately, although good results have been obtained with deep learning methods which cope well with occlusion, pose and illumination changes. These are however either slow or require a powerful system to operate in real-time. We present a new method of portrait segmentation called FaceSeg which uses fast DBSCAN clustering combined with smart face tracking that can replicate the benefits and accuracy of deep learning methods at a much faster speed. In a direct comparison using a standard testing suite, our method achieved a segmentation speed of 150 fps for a 640x480 video stream with median accuracy and F1 scores of 99.96% and 99.93% respectively on simple backgrounds, with 98.81% and 98.13% on complex backgrounds. The state-of-art deep learning based FastPortrait / Mobile Neural Network method achieved 15 fps with 99.95% accuracy and 99.91% F1 score on simple backgrounds, and 99.01% accuracy and 98.43 F1 score on complex backgrounds. An efficacy-boosted implementation for FaceSeg can achieve 75 fps with 99.23% accuracy and 98.79% F1 score on complex backgrounds.


Steps to reproduce

The paper by Loke et al included a test suite that was designed to measure indoor segmentation efficacy for the head and upper body in response to several common factors such as: 1) image noise, 2) camera jitter and movement, 3) illumination and shadows, and 4) color camouflage. The test suite is comprised of a series of video clips that was captured with an Intel RealSense Depth Camera D435 at a resolution of 640x480 pixels at 30 fps for color data. Depth data was also captured at 90 fps using temporal and spatial smoothing with hole-filling to reduce artefacts. Synthetic paired color and depth frames were motion interpolated from the source frames to generate video clips without any inter-frames. The ‘ground truth’ for the clips was derived semi-automatically using the face location from a detector from which a seed point and depth were obtained from the bounding rectangle center. The ‘ground truth’ was then extracted from a floating range flood fill starting at the seed point with a maximum difference of 2 cm between adjacent pixels. The clips were subsequently inspected, and touch-ups were done with a GrabCut procedure. Comparisons were made with the best performing segmentation algorithms from the ‘BGSLibrary’ using their default settings: 1) Gaussian Mixture Model (DPZivkovicAGMM), 2) Kernel Density Estimate (KDE), 3) Self-Organizing Neural Network (LBAdaptiveSOM), and 4) Local Binary Similarity Patterns (LOBSTER), as well as the FastPortrait segmenter. Our algorithm FaceSeg was paired with two different landmark detectors (OpenFace 2.0 and libfacedetection) to test if the choice of detectors affected the segmentation results. The detectors were set to run every frame in the first instance and every ten frames in the second instance to test the effectiveness of the tracking routine. The four test combinations were: 1) OpenFace every frame - FaceSegOF(1), 2) OpenFace every ten frames - FaceSegOF(10), 3) libfacedetection every frame - FaceSegLFD(1), and 4) libfacedetection every ten frames - FaceSegLFD(10). Testing was performed on a system using Microsoft Windows 10 with a four-core Intel Xeon E3 processor running at 3.5 GHz using Visual Basic and Visual C++ 2019 in a 64-bit address space with 16 GB RAM allocated and an NVIDIA GeForce GTX 1060 6GB graphics card installed. Image processing was done using the libraries in EmguCV 3.2.0 and Accord.Net 3.8.0. The system speed was rated at 475 million floating points per second (MFLOPS) using the Intel processor diagnostic tool 2.10, 64-bit version. Segmentation efficacy was measured in terms of accuracy and F1 score. Processing time was calculated as the median time taken to process each frame over the comparison period, while CPU usage was the average total CPU utilization during the test. GPU optimizations were turned off for all algorithms to give a fair comparison.