Datasets API

Downloadable Datasets

class pathml.datasets.PanNukeDataModule(data_dir, download=False, shuffle=True, transforms=None, nucleus_type_labels=False, split=None, batch_size=8, hovernet_preprocess=False)

DataModule for the PanNuke Dataset. Contains 256px image patches from 19 tissue types with annotations for 5 nucleus types. For more information, see: https://warwick.ac.uk/fac/sci/dcs/research/tia/data/pannuke

Parameters:

data_dir (str) – Path to directory where PanNuke data is
download (bool, optional) – Whether to download the data. If True, checks whether data files exist in data_dir and downloads them to data_dir if not. If False, checks to make sure that data files exist in data_dir. Default False.
shuffle (bool, optional) – Whether to shuffle images. Defaults to True.
transforms (optional) – Data augmentation transforms to apply to images. Transform must accept two arguments: (mask and image) and return a dict with “image” and “mask” keys. See an example here: https://albumentations.ai/docs/getting_started/mask_augmentation/
nucleus_type_labels (bool, optional) –
Whether to provide nucleus type labels, or binary nucleus labels. If True, then masks will be returned with six channels, corresponding to
1. Neoplastic cells
2. Inflammatory
3. Connective/Soft tissue cells
4. Dead Cells
5. Epithelial
6. Background
If False, then the returned mask will have a single channel, with zeros for background pixels and ones for nucleus pixels (i.e. the inverse of the Background mask). Defaults to False.
split (int, optional) –
How to divide the three folds into train, test, and validation splits. Must be one of {1, 2, 3, None} corresponding to the following splits:
1. Training: Fold 1; Validation: Fold 2; Testing: Fold 3
2. Training: Fold 2; Validation: Fold 1; Testing: Fold 3
3. Training: Fold 3; Validation: Fold 2; Testing: Fold 1
If None, then the entire PanNuke dataset will be used. Defaults to None.
batch_size (int, optional) – batch size for dataloaders. Defaults to 8.
hovernet_preprocess (bool) – Whether to perform preprocessing specific to HoVer-Net architecture. If True, the center of mass of each nucleus will be computed, and an additional mask will be returned with the distance of each nuclear pixel to its center of mass in the horizontal and vertical dimensions. This corresponds to Gamma(I) from the HoVer-Net paper. Defaults to False.

References

Gamper, J., Koohbanani, N.A., Benet, K., Khuram, A. and Rajpoot, N., 2019, April. PanNuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. In European Congress on Digital Pathology (pp. 11-19). Springer, Cham.

Gamper, J., Koohbanani, N.A., Graham, S., Jahanifar, M., Khurram, S.A., Azam, A., Hewitt, K. and Rajpoot, N., 2020. PanNuke Dataset Extension, Insights and Baselines. arXiv preprint arXiv:2003.10778.

property test_dataloader: Dataloader for test set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net

property train_dataloader: Dataloader for training set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net

property valid_dataloader: Dataloader for validation set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net

class pathml.datasets.DeepFocusDataModule(data_dir, download=False, shuffle=True, transforms=None, batch_size=8)

DataModule for the DeepFocus dataset. The DeepFocus dataset comprises four slides from different patients, each with four different stains (H&E, Ki67, CD21, and CD10) for a total of 16 whole-slide images. For each slide, a region of interest (ROI) of approx 6mm^2 was scanned at 40x magnification with an Aperio ScanScope on nine different focal planes, generating 216,000 samples with varying amounts of blurriness. Tiles with offset values between [-0.5μm, 0.5μm] are labeled as in-focus and the rest of the images are labeled as blurry.

See: https://github.com/cialab/DeepFocus

Parameters:

data_dir (str) – file path to directory containing data.
download (bool, optional) – Whether to download the data. If True, checks whether data files exist in data_dir and downloads them to data_dir if not. If False, checks to make sure that data files exist in data_dir. Default False.
shuffle (bool, optional) – Whether to shuffle images. Defaults to True.
transforms (optional) – Data augmentation transforms to apply to images.
batch_size (int, optional) – batch size for dataloaders. Defaults to 8.

Reference:: Senaras, C., Niazi, M.K.K., Lozanski, G. and Gurcan, M.N., 2018. DeepFocus: detection of out-of-focus regions in whole slide digital images using deep learning. PloS one, 13(10), p.e0205387.

property test_dataloader

property train_dataloader

property valid_dataloader

ML Dataset classes

class pathml.datasets.TileDataset(file_path)

PyTorch Dataset class for h5path files

Each item is a tuple of (tile_image, tile_masks, tile_labels, slide_labels) where:

tile_image is a torch.Tensor of shape (C, H, W) or (T, Z, C, H, W)

tile_masks is a torch.Tensor of shape (n_masks, tile_height, tile_width)

tile_labels is a dict

slide_labels is a dict

This is designed to be wrapped in a PyTorch DataLoader for feeding tiles into ML models. Note that label dictionaries are not standardized, as users are free to store whatever labels they want. For that reason, PyTorch cannot automatically stack labels into batches. When creating a DataLoader from a TileDataset, it may therefore be necessary to create a custom collate_fn to specify how to create batches of labels. See: https://discuss.pytorch.org/t/how-to-use-collate-fn/27181

Parameters:: file_path (str) – Path to .h5path file on disk

class pathml.datasets.EntityDataset(cell_dir=None, tissue_dir=None, assign_dir=None)

Torch Geometric Dataset class for storing cell or tissue graphs. Each item returns a pathml.graph.utils.HACTPairData object.

Parameters:

cell_dir (str) – Path to folder containing cell graphs
tissue_dir (str) – Path to folder containing tissue graphs
assign_dir (str) – Path to folder containing assignment matrices