Datasets API

PanNuke

class pathml.datasets.PanNukeDataModule(data_dir, download=False, shuffle=True, transforms=None, nucleus_type_labels=False, split=None, batch_size=8, hovernet_preprocess=False)

DataModule for the PanNuke Dataset. Contains 256px image patches from 19 tissue types with annotations for 5 nucleus types. For more information, see: https://warwick.ac.uk/fac/sci/dcs/research/tia/data/pannuke

Parameters:
  • data_dir (str) – Path to directory where PanNuke data is

  • download (bool, optional) – Whether to download the data. If True, checks whether data files exist in data_dir and downloads them to data_dir if not. If False, checks to make sure that data files exist in data_dir. Default False.

  • shuffle (bool, optional) – Whether to shuffle images. Defaults to True.

  • transforms (optional) – Data augmentation transforms to apply to images. Transform must accept two arguments: (mask and image) and return a dict with “image” and “mask” keys. See an example here: https://albumentations.ai/docs/getting_started/mask_augmentation/

  • nucleus_type_labels (bool, optional) –

    Whether to provide nucleus type labels, or binary nucleus labels. If True, then masks will be returned with six channels, corresponding to

    1. Neoplastic cells

    2. Inflammatory

    3. Connective/Soft tissue cells

    4. Dead Cells

    5. Epithelial

    6. Background

    If False, then the returned mask will have a single channel, with zeros for background pixels and ones for nucleus pixels (i.e. the inverse of the Background mask). Defaults to False.

  • split (int, optional) –

    How to divide the three folds into train, test, and validation splits. Must be one of {1, 2, 3, None} corresponding to the following splits:

    1. Training: Fold 1; Validation: Fold 2; Testing: Fold 3

    2. Training: Fold 2; Validation: Fold 1; Testing: Fold 3

    3. Training: Fold 3; Validation: Fold 2; Testing: Fold 1

    If None, then the entire PanNuke dataset will be used. Defaults to None.

  • batch_size (int, optional) – batch size for dataloaders. Defaults to 8.

  • hovernet_preprocess (bool) – Whether to perform preprocessing specific to HoVer-Net architecture. If True, the center of mass of each nucleus will be computed, and an additional mask will be returned with the distance of each nuclear pixel to its center of mass in the horizontal and vertical dimensions. This corresponds to Gamma(I) from the HoVer-Net paper. Defaults to False.

References

Gamper, J., Koohbanani, N.A., Benet, K., Khuram, A. and Rajpoot, N., 2019, April. PanNuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. In European Congress on Digital Pathology (pp. 11-19). Springer, Cham.

Gamper, J., Koohbanani, N.A., Graham, S., Jahanifar, M., Khurram, S.A., Azam, A., Hewitt, K. and Rajpoot, N., 2020. PanNuke Dataset Extension, Insights and Baselines. arXiv preprint arXiv:2003.10778.

property test_dataloader

Dataloader for test set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net

property train_dataloader

Dataloader for training set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net

property valid_dataloader

Dataloader for validation set. Yields (image, mask, tissue_type), or (image, mask, hv, tissue_type) for HoVer-Net

DeepFocus

class pathml.datasets.DeepFocusDataModule(data_dir, download=False, shuffle=True, transforms=None, batch_size=8)

DataModule for the DeepFocus dataset. The DeepFocus dataset comprises four slides from different patients, each with four different stains (H&E, Ki67, CD21, and CD10) for a total of 16 whole-slide images. For each slide, a region of interest (ROI) of approx 6mm^2 was scanned at 40x magnification with an Aperio ScanScope on nine different focal planes, generating 216,000 samples with varying amounts of blurriness. Tiles with offset values between [-0.5μm, 0.5μm] are labeled as in-focus and the rest of the images are labeled as blurry.

See: https://github.com/cialab/DeepFocus

Parameters:
  • data_dir (str) – file path to directory containing data.

  • download (bool, optional) – Whether to download the data. If True, checks whether data files exist in data_dir and downloads them to data_dir if not. If False, checks to make sure that data files exist in data_dir. Default False.

  • shuffle (bool, optional) – Whether to shuffle images. Defaults to True.

  • transforms (optional) – Data augmentation transforms to apply to images.

  • batch_size (int, optional) – batch size for dataloaders. Defaults to 8.

Reference:

Senaras, C., Niazi, M.K.K., Lozanski, G. and Gurcan, M.N., 2018. DeepFocus: detection of out-of-focus regions in whole slide digital images using deep learning. PloS one, 13(10), p.e0205387.

property test_dataloader
property train_dataloader
property valid_dataloader