SydneyMTL - Gastritis Sydney System Golden Dataset
Description
Data Description: Consensus-Based Hibou-L Embeddings for Gastritis Research (Updated Sydney System) Overview & Objective: The Updated Sydney System (USS) for chronic gastritis often suffers from subjective diagnostic thresholds. This dataset, curated via pathologist consensus, provides a high-fidelity "Golden Dataset" to support the development of standardized, objective AI models for gastritis grading. Data Content: The original gastric biopsy specimens were collected from Seegene Medical Foundation, Seoul, South Korea (https://pr.seegenemedical.com/). This dataset contains 366 feature embeddings of the gastric biopsy WSI images extracted using the Hibou-L foundational model. It represents the morphological signatures of the gastric mucosa across five key attributes: H. pylori, Neutrophil activity, Mononuclear cells, Glandular atrophy, and Intestinal metaplasia. The samples were curated using a stratified joint-distribution strategy to ensure a balanced representation of all severity grades (0–3). Structure & Interpretation: 1. HDF5 Files (.h5): Each file contains features (1024-D embeddings), coordinates, and addresses for patches within a Whole Slide Image (WSI). 2. labels_and_prediction.csv: Provides ground truth consensus grades and the predictions from the SydneyMTL model. Note that atrophy=4 indicates cases where grading was not applicable due to the absence of the muscularis mucosae, reflecting real-world clinical constraints. For model predictions, the values are formatted as: "label(confidence)" Applications: This dataset is specifically designed to advance research and development in gastritis diagnostics. It can be utilized to: - Develop and validate automated gastritis grading systems based on the Sydney System. - Benchmark Multi-Instance Learning (MIL) models for multi-attribute classification in gastric pathology. - Analyze morphological feature representations of various gastritis phenotypes using foundational pathology encoders.
Files
Steps to reproduce
1. Preprocessing and Patching: Tissue-containing regions were identified from WSIs and tiled into non-overlapping patches (224x224 pixels) at 40x magnification. 2. Feature Extraction: 1024-dimensional feature embeddings were extracted from each patch using the Hibou-L foundational model.