TEM virus dataset

Published: 11 May 2021| Version 3 | DOI: 10.17632/x4dwwfwtw3.3
Contributors:
Damian Matuszewski,

Description

The dataset contains in total 1245 images of 22 virus classes captured with two different electron microscopes: an LEO (Zeiss, Oberkochen, Germany) with a Morada (Olympus) camera and a Tecnai 10 (FEI, Hillsboro, OR, USA) with a MegaView III (Olympus, Münster, Germany) camera. Before imaging, all samples were treated with 10% phosphate-buffered saline, placed on carbon-coated TEM grids, and stained with 2% phosphotungstic acid following standard procedures. The virus image dataset is challenging due to many reasons: limited annotation (a center point/line and not a full segmentation mask), a relatively small number of images per class, diffuse virus boundaries, imperfect focus, noise, different magnifications, and a large variation of the virus, background, and debris appearance. The virus classes in the dataset are strongly unbalanced both regarding the number of images (from 9 to 129) and in the number of virus particles (from 38 to 1934). The sizes of all RAW images are either 1376 x 1032 or 2048 x 2048 pixels (depending on with which electron microscope they were captured) but the pixel sizes vary from 0.26 to 5.57 nm, i.e., they were acquired at different magnifications. Each virus particle is annotated only with its approximate center, i.e., a single point for isolated spherical particles or a centerline in case of clustered viruses (beyond visual recognition of individual particles) or elongated virus. The annotations are in the form of coordinate points stored in a separate text file for each image. We also include image patches of 256 x 256 pixels (256 x 256 nm). The patches were cropped from rescaled images around the annotation points: the manually selected center points for spherical virus particles and all center-line vertices for elongated virus particles, except for the elongated particles that were annotated with a center-line composed of only 2 points (this would usually indicate an oval-shaped virus particle as in e.g. Orf) – in this case, we selected a new point between them for the center of the patch. This resulted in more image patches than particles in virus classes with elongated particles, particularly in Marburg, Ebola, Influenza, Lassa, and Nipah. As many annotation points were placed relatively close to each other (due to natural clustering of the virus particles and/or complex shapes of the elongated particles), the patches cropped from the same image sometimes overlap with each other to some degree. However, this did not lead to a data leakage between the training, validation, and test sets because they were established at the image level and special care was taken to remove images from the dataset that captured the same virus particles (i.e., overlapping images). We augmented the training set image patches by flipping and multiple 90 degrees rotations so that each class contains 736 input samples. Those classes that originally contained more than 736 patches were randomly reduced to this number.

Files

Institutions

Uppsala Universitet

Categories

Natural Sciences, Virology, Virus, Machine Learning, Electron Microscopy, Image Database, Deep Learning

Licence