Datasets API
Downloadable Datasets
- class pathml.datasets.PanNukeDataModule(data_dir, download=False, shuffle=True, transforms=None, nucleus_type_labels=False, split=None, batch_size=8, hovernet_preprocess=False)
DataModule for the PanNuke Dataset. Contains 256px image patches from 19 tissue types with annotations for 5 nucleus types. For more information, see: https://warwick.ac.uk/fac/sci/dcs/research/tia/data/pannuke
- Parameters:
data_dir (str) – Path to directory where PanNuke data is
download (bool, optional) – Whether to download the data. If
True
, checks whether data files exist indata_dir
and downloads them todata_dir
if not. IfFalse
, checks to make sure that data files exist indata_dir
. DefaultFalse
.shuffle (bool, optional) – Whether to shuffle images. Defaults to
True
.transforms (optional) – Data augmentation transforms to apply to images. Transform must accept two arguments: (mask and image) and return a dict with “image” and “mask” keys. See an example here: https://albumentations.ai/docs/getting_started/mask_augmentation/
nucleus_type_labels (bool, optional) –
Whether to provide nucleus type labels, or binary nucleus labels. If
True
, then masks will be returned with six channels, corresponding toNeoplastic cells
Inflammatory
Connective/Soft tissue cells
Dead Cells
Epithelial
Background
If
False
, then the returned mask will have a single channel, with zeros for background pixels and ones for nucleus pixels (i.e. the inverse of the Background mask). Defaults toFalse
.split (int, optional) –
How to divide the three folds into train, test, and validation splits. Must be one of {1, 2, 3, None} corresponding to the following splits:
Training: Fold 1; Validation: Fold 2; Testing: Fold 3
Training: Fold 2; Validation: Fold 1; Testing: Fold 3
Training: Fold 3; Validation: Fold 2; Testing: Fold 1
If
None
, then the entire PanNuke dataset will be used. Defaults toNone
.batch_size (int, optional) – batch size for dataloaders. Defaults to 8.
hovernet_preprocess (bool) – Whether to perform preprocessing specific to HoVer-Net architecture. If
True
, the center of mass of each nucleus will be computed, and an additional mask will be returned with the distance of each nuclear pixel to its center of mass in the horizontal and vertical dimensions. This corresponds to Gamma(I) from the HoVer-Net paper. Defaults toFalse
.
- References
Gamper, J., Koohbanani, N.A., Benet, K., Khuram, A. and Rajpoot, N., 2019, April. PanNuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. In European Congress on Digital Pathology (pp. 11-19). Springer, Cham.
Gamper, J., Koohbanani, N.A., Graham, S., Jahanifar, M., Khurram, S.A., Azam, A., Hewitt, K. and Rajpoot, N., 2020. PanNuke Dataset Extension, Insights and Baselines. arXiv preprint arXiv:2003.10778.
- property test_dataloader
Dataloader for test set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net
- property train_dataloader
Dataloader for training set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net
- property valid_dataloader
Dataloader for validation set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net
- class pathml.datasets.DeepFocusDataModule(data_dir, download=False, shuffle=True, transforms=None, batch_size=8)
DataModule for the DeepFocus dataset. The DeepFocus dataset comprises four slides from different patients, each with four different stains (H&E, Ki67, CD21, and CD10) for a total of 16 whole-slide images. For each slide, a region of interest (ROI) of approx 6mm^2 was scanned at 40x magnification with an Aperio ScanScope on nine different focal planes, generating 216,000 samples with varying amounts of blurriness. Tiles with offset values between [-0.5μm, 0.5μm] are labeled as in-focus and the rest of the images are labeled as blurry.
See: https://github.com/cialab/DeepFocus
- Parameters:
data_dir (str) – file path to directory containing data.
download (bool, optional) – Whether to download the data. If
True
, checks whether data files exist indata_dir
and downloads them todata_dir
if not. IfFalse
, checks to make sure that data files exist indata_dir
. DefaultFalse
.shuffle (bool, optional) – Whether to shuffle images. Defaults to
True
.transforms (optional) – Data augmentation transforms to apply to images.
batch_size (int, optional) – batch size for dataloaders. Defaults to 8.
- Reference:
Senaras, C., Niazi, M.K.K., Lozanski, G. and Gurcan, M.N., 2018. DeepFocus: detection of out-of-focus regions in whole slide digital images using deep learning. PloS one, 13(10), p.e0205387.
- property test_dataloader
- property train_dataloader
- property valid_dataloader
ML Dataset classes
- class pathml.datasets.TileDataset(file_path)
PyTorch Dataset class for h5path files
Each item is a tuple of (
tile_image
,tile_masks
,tile_labels
,slide_labels
) where:tile_image
is a torch.Tensor of shape (C, H, W) or (T, Z, C, H, W)tile_masks
is a torch.Tensor of shape (n_masks, tile_height, tile_width)tile_labels
is a dictslide_labels
is a dict
This is designed to be wrapped in a PyTorch DataLoader for feeding tiles into ML models. Note that label dictionaries are not standardized, as users are free to store whatever labels they want. For that reason, PyTorch cannot automatically stack labels into batches. When creating a DataLoader from a TileDataset, it may therefore be necessary to create a custom
collate_fn
to specify how to create batches of labels. See: https://discuss.pytorch.org/t/how-to-use-collate-fn/27181- Parameters:
file_path (str) – Path to .h5path file on disk
- class pathml.datasets.EntityDataset(cell_dir=None, tissue_dir=None, assign_dir=None)
Torch Geometric Dataset class for storing cell or tissue graphs. Each item returns a pathml.graph.utils.HACTPairData object.
- Parameters:
cell_dir (str) – Path to folder containing cell graphs
tissue_dir (str) – Path to folder containing tissue graphs
assign_dir (str) – Path to folder containing assignment matrices