Core API

SlideData

The central class in PathML for representing a whole-slide image.

class pathml.core.SlideData(filepath, name=None, masks=None, tiles=None, labels=None, backend=None, slide_type=None, stain=None, platform=None, tma=None, rgb=None, volumetric=None, time_series=None, counts=None, dtype=None)

Main class representing a slide and its annotations.

Parameters:
  • filepath (str) – Path to file on disk.

  • name (str, optional) – name of slide. If None, and a filepath is provided, name defaults to filepath.

  • masks (pathml.core.Masks, optional) – object containing {key, mask} pairs

  • tiles (pathml.core.Tiles, optional) – object containing {coordinates, tile} pairs

  • labels (collections.OrderedDict, optional) – dictionary containing {key, label} pairs

  • backend (str, optional) – backend to use for interfacing with slide on disk. Must be one of {“OpenSlide”, “BioFormats”, “DICOM”, “h5path”} (case-insensitive). Note that for supported image formats, OpenSlide performance can be significantly better than BioFormats. Consider specifying backend = "openslide" when possible. If None, and a filepath is provided, tries to infer the correct backend from the file extension. Defaults to None.

  • slide_type (pathml.core.SlideType, optional) – slide type specification. Must be a SlideType object. Alternatively, slide type can be specified by using the parameters stain, tma, rgb, volumetric, and time_series.

  • stain (str, optional) – Flag indicating type of slide stain. Must be one of [‘HE’, ‘IHC’, ‘Fluor’]. Defaults to None. Ignored if slide_type is specified.

  • platform (str, optional) – Flag indicating the imaging platform (e.g. CODEX, Vectra, etc.). Defaults to None. Ignored if slide_type is specified.

  • tma (bool, optional) – Flag indicating whether the image is a tissue microarray (TMA). Defaults to False. Ignored if slide_type is specified.

  • rgb (bool, optional) – Flag indicating whether the image is in RGB color. Defaults to None. Ignored if slide_type is specified.

  • volumetric (bool, optional) – Flag indicating whether the image is volumetric. Defaults to None. Ignored if slide_type is specified.

  • time_series (bool, optional) – Flag indicating whether the image is a time series. Defaults to None. Ignored if slide_type is specified.

  • counts (anndata.AnnData) – object containing counts matrix associated with image quantification

property counts
extract_region(location, size, *args, **kwargs)

Extract a region of the image. This is a convenience method which passes arguments through to the extract_region() method of whichever backend is in use. Refer to documentation for each backend.

Parameters:
  • location (Tuple[int, int]) – Location of top-left corner of tile (i, j)

  • size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.

  • *args – positional arguments passed through to extract_region() method of the backend.

  • **kwargs – keyword arguments passed through to extract_region() method of the backend.

Returns:

image at the specified region

Return type:

np.ndarray

generate_tiles(shape=3000, stride=None, pad=False, **kwargs)

Generator over Tile objects containing regions of the image. Calls generate_tiles() method of the backend. Tries to add the corresponding slide-level masks to each tile, if possible. Adds slide-level labels to each tile, if possible.

Parameters:
  • shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Defaults to 256px.

  • stride (int) – stride between chunks. If None, uses stride = size for non-overlapping chunks. Defaults to None.

  • pad (bool) – How to handle tiles on the edges. If True, these edge tiles will be zero-padded and yielded with the other chunks. If False, incomplete edge chunks will be ignored. Defaults to False.

  • **kwargs – Other arguments passed through to generate_tiles() method of the backend.

Yields:

pathml.core.tile.Tile – Extracted Tile object

plot(ax=None)

View a thumbnail of the image, using matplotlib. Not supported by all backends.

Parameters:

ax – matplotlib axis object on which to plot the thumbnail. Optional.

run(pipeline, distributed=True, client=None, tile_size=256, tile_stride=None, level=0, tile_pad=False, overwrite_existing_tiles=False, write_dir=None, **kwargs)

Run a preprocessing pipeline on SlideData. Tiles are generated by calling self.generate_tiles() and pipeline is applied to each tile.

Parameters:
  • pipeline (pathml.preprocessing.pipeline.Pipeline) – Preprocessing pipeline.

  • distributed (bool) – Whether to distribute model using client. Defaults to True.

  • client – dask.distributed client

  • tile_size (int, optional) – Size of each tile. Defaults to 256px

  • tile_stride (int, optional) – Stride between tiles. If None, uses tile_stride = tile_size for non-overlapping tiles. Defaults to None.

  • level (int, optional) – Level to extract tiles from. Defaults to None.

  • tile_pad (bool) – How to handle chunks on the edges. If True, these edge chunks will be zero-padded symmetrically and yielded with the other chunks. If False, incomplete edge chunks will be ignored. Defaults to False.

  • overwrite_existing_tiles (bool) – Whether to overwrite existing tiles. If False, running a pipeline will fail if tiles is not None. Defaults to False.

  • write_dir (str) – Path to directory to write the processed slide to. The processed SlideData object will be written to the directory immediately after the pipeline has completed running. The filepath will default to “<write_dir>/<slide.name>.h5path. Defaults to None.

  • **kwargs – Other arguments passed through to generate_tiles() method of the backend.

property shape

Convenience method for getting the image shape. Calling wsi.shape is equivalent to calling wsi.slide.get_image_shape() with default arguments.

Returns:

Shape of image (H, W)

Return type:

Tuple[int, int]

write(path)

Write contents to disk in h5path format.

Parameters:

path (Union[str, bytes, os.PathLike]) – path to file to be written

Convenience SlideData Classes

class pathml.core.HESlide(*args, **kwargs)

Convenience class to load a SlideData object for H&E slides. Passes through all arguments to SlideData(), along with slide_type = types.HE flag. Refer to SlideData for full documentation.

class pathml.core.VectraSlide(*args, **kwargs)

Convenience class to load a SlideData object for Vectra (Polaris) slides. Passes through all arguments to SlideData(), along with slide_type = types.Vectra flag and default backend = "bioformats". Refer to SlideData for full documentation.

class pathml.core.MultiparametricSlide(*args, **kwargs)

Convenience class to load a SlideData object for multiparametric immunofluorescence slides. Passes through all arguments to SlideData(), along with slide_type = types.IF flag and default backend = "bioformats". Refer to SlideData for full documentation.

class pathml.core.CODEXSlide(*args, **kwargs)

Convenience class to load a SlideData object from Akoya Biosciences CODEX format. Passes through all arguments to SlideData(), along with slide_type = types.CODEX flag and default backend = "bioformats". Refer to SlideData for full documentation.

# TODO:

hierarchical biaxial gating (flow-style analysis)

Slide Types

class pathml.core.SlideType(stain=None, platform=None, tma=None, rgb=None, volumetric=None, time_series=None)

SlideType objects define types based on a set of image parameters.

Parameters:
  • stain (str, optional) – One of [‘HE’, ‘IHC’, ‘Fluor’]. Flag indicating type of slide stain. Defaults to None.

  • platform (str, optional) – Flag indicating the imaging platform (e.g. CODEX, Vectra, etc.).

  • tma (bool, optional) – Flag indicating whether the slide is a tissue microarray (TMA). Defaults to False.

  • rgb (bool, optional) – Flag indicating whether image is in RGB color. Defaults to False.

  • volumetric (bool, optional) – Flag indicating whether image is volumetric. Defaults to False.

  • time_series (bool, optional) – Flag indicating whether image is time-series. Defaults to False.

Examples

>>> from pathml import SlideType, types
>>> he_type = SlideType(stain = "HE", rgb = True)    # define slide type manually
>>> types.HE == he_type    # can also use pre-made types for convenience
True
asdict()

Convert to a dictionary. None values are represented as zeros and empty strings for compatibility with h5py attributes.

If a is a SlideType object, then a == SlideType(**a.asdict()) will be True.

We also provide instantiations of common slide types for convenience:

Type

stain

platform

rgb

tma

volumetric

time_series

pathml.core.types.HE

‘HE’

None

True

False

False

False

pathml.core.types.IHC

‘IHC’

None

True

False

False

False

pathml.core.types.IF

‘Fluor’

None

False

False

False

False

pathml.core.types.CODEX

‘Fluor’

‘CODEX’

False

False

False

False

pathml.core.types.Vectra

‘Fluor’

‘Vectra’

False

False

False

False

Tile

class pathml.core.Tile(image, coords, name=None, masks=None, labels=None, counts=None, slide_type=None, stain=None, tma=None, rgb=None, volumetric=None, time_series=None)

Object representing a tile extracted from an image. Holds the array for the tile, as well as the (i,j) coordinates of the top-left corner of the tile in the original image. The (i,j) coordinate system is based on labelling the top-leftmost pixel as (0, 0)

Parameters:
  • image (np.ndarray) – Image array of tile

  • coords (tuple) – Coordinates of tile relative to the whole-slide image. The (i,j) coordinate system is based on labelling the top-leftmost pixel of the WSI as (0, 0).

  • name (str, optional) – Name of tile

  • masks (dict) – masks belonging to tile. If masks are supplied, all masks must be the same shape as the tile.

  • labels – labels belonging to tile

  • counts (AnnData) – counts matrix for the tile.

  • slide_type (pathml.core.SlideType, optional) – slide type specification. Must be a SlideType object. Alternatively, slide type can be specified by using the parameters stain, tma, rgb, volumetric, and time_series.

  • stain (str, optional) – Flag indicating type of slide stain. Must be one of [‘HE’, ‘IHC’, ‘Fluor’]. Defaults to None. Ignored if slide_type is specified.

  • tma (bool, optional) – Flag indicating whether the image is a tissue microarray (TMA). Defaults to False. Ignored if slide_type is specified.

  • rgb (bool, optional) – Flag indicating whether the image is in RGB color. Defaults to None. Ignored if slide_type is specified.

  • volumetric (bool, optional) – Flag indicating whether the image is volumetric. Defaults to None. Ignored if slide_type is specified.

  • time_series (bool, optional) – Flag indicating whether the image is a time series. Defaults to None. Ignored if slide_type is specified.

plot(ax=None)

View the tile image, using matplotlib. Only supports RGB images currently

Parameters:

ax – matplotlib axis object on which to plot the thumbnail. Optional.

property shape

convenience method. Calling tile.shape is equivalent to calling tile.image.shape

SlideDataset

class pathml.core.SlideDataset(slides)

Container for a dataset of WSIs

Parameters:

slides – list of SlideData objects

run(pipeline, client=None, distributed=True, **kwargs)

Runs a preprocessing pipeline on all slides in the dataset

Parameters:
  • pipeline (pathml.preprocessing.pipeline.Pipeline) – Preprocessing pipeline.

  • client – dask.distributed client

  • distributed (bool) – Whether to distribute model using client. Defaults to True.

  • kwargs (dict) – keyword arguments passed to run() for each slide

write(dir, filenames=None)

Write all SlideData objects to the specified directory. Calls .write() method for each slide in the dataset. Optionally pass a list of filenames to use, otherwise filenames will be created from .name attributes of each slide.

Parameters:
  • dir (Union[str, bytes, os.PathLike]) – Path to directory where slides are to be saved

  • filenames (List[str], optional) – list of filenames to be used.

Tiles and Masks helper classes

class pathml.core.Tiles(h5manager, tiles=None)

Object wrapping a dict of tiles.

Parameters:

tiles (Union[dict[tuple[int], ~pathml.core.tiles.Tile], list[~pathml.core.tiles.Tile]]) – tile objects

add(tile)

Add tile indexed by tile.coords to tiles.

Parameters:

tile (Tile) – tile object

property keys
remove(key)

Remove tile from tiles.

Parameters:

key (str) – key (coords) indicating tile to be removed

property tile_shape
update(tile)

Update a tile.

Parameters:

tile (pathml.core.tile.Tiles) – key of tile to be updated

class pathml.core.Masks(h5manager, masks=None)

Object wrapping a dict of masks.

Parameters:
  • h5manager (pathml.core.h5pathManager) –

  • masks (dict) – dictionary of np.ndarray objects representing ex. labels, segmentations.

add(key, mask)

Add mask indexed by key to self.h5manager.

Parameters:
  • key (str) – key

  • mask (np.ndarray) – array of mask. Must contain elements of type int8

property keys
remove(key)

Remove mask.

Parameters:

key (str) – key indicating mask to be removed

slice(slicer)

Slice all masks in self.h5manager extending of numpy array slicing.

Parameters:

slices – list where each element is an object of type slice indicating how the dimension should be sliced

Slide Backends

OpenslideBackend

class pathml.core.OpenSlideBackend(filename)

Use OpenSlide to interface with image files.

Depends on openslide-python which wraps the openslide C library.

Parameters:

filename (str) – path to image file on disk

extract_region(location, size, level=None)

Extract a region of the image

Parameters:
  • location (Tuple[int, int]) – Location of top-left corner of tile (i, j)

  • size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.

  • level (int) – level from which to extract chunks. Level 0 is highest resolution.

Returns:

image at the specified region

Return type:

np.ndarray

generate_tiles(shape=3000, stride=None, pad=False, level=0)

Generator over tiles.

Padding works as follows: If pad is False, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile that is fully contained in the image. If pad is True, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile which starts in the image. Regions outside the image will be padded with 0. For example, for a 5x5 image with a tile size of 3 and a stride of 2, tile generation with pad=False will create 4 tiles total, compared to 6 tiles if pad=True.

Parameters:
  • shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.

  • stride (int) – stride between chunks. If None, uses stride = size for non-overlapping chunks. Defaults to None.

  • pad (bool) – How to handle tiles on the edges. If True, these edge tiles will be zero-padded and yielded with the other chunks. If False, incomplete edge chunks will be ignored. Defaults to False.

  • level (int, optional) – For slides with multiple levels, which level to extract tiles from. Defaults to 0 (highest resolution).

Yields:

pathml.core.tile.Tile – Extracted Tile object

get_image_shape(level=0)

Get the shape of the image at specified level.

Parameters:

level (int) – Which level to get shape from. Level 0 is highest resolution. Defaults to 0.

Returns:

Shape of image at target level, in (i, j) coordinates.

Return type:

Tuple[int, int]

get_thumbnail(size)

Get a thumbnail of the slide.

Parameters:

size (Tuple[int, int]) – the maximum size of the thumbnail

Returns:

RGB thumbnail image

Return type:

np.ndarray

BioFormatsBackend

class pathml.core.BioFormatsBackend(filename, dtype=None)

Use BioFormats to interface with image files.

Now support multi-level images. Depends on python-bioformats which wraps ome bioformats java library, parses pixel and metadata of proprietary formats, and converts all formats to OME-TIFF. Please cite: https://pubmed.ncbi.nlm.nih.gov/20513764/

Parameters:
  • filename (str) – path to image file on disk

  • dtype (numpy.dtype) – data type of image. If None, will use BioFormats to infer the data type from the image’s OME metadata. Defaults to None.

Note

While the Bio-Formats convention uses XYZCT channel order, we use YXZCT for compatibility with the rest of PathML which is based on (i, j) coordinate system.

extract_region(location, size, level=0, series_as_channels=False, normalize=True)

Extract a region of the image. All bioformats images have 5 dimensions representing (i, j, z, channel, time). Even if an image does not have multiple z-series or time-series, those dimensions will still be kept. For example, a standard RGB image will be of shape (i, j, 1, 3, 1). If a tuple with len < 5 is passed, missing dimensions will be retrieved in full.

Parameters:
  • location (Tuple[int, int]) – (i, j) location of corner of extracted region closest to the origin.

  • size (Tuple[int, int, ...]) – (i, j) size of each region. If an integer is passed, will convert to a

  • of (tuple) – dimensions will be retrieved in full.

  • level (int) – level from which to extract chunks. Level 0 is highest resolution. Defaults to 0.

  • series_as_channels (bool) – Whether to treat image series as channels. If True, multi-level images are not supported. Defaults to False.

  • normalize (bool, optional) – Whether to normalize the image to int8 before returning. Defaults to True. If False, image will be returned as-is immediately after reading, typically in float64.

Returns:

image at the specified region. 5-D array of (i, j, z, c, t)

Return type:

np.ndarray

generate_tiles(shape=3000, stride=None, pad=False, level=0, **kwargs)

Generator over tiles.

Padding works as follows: If pad is False, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile that is fully contained in the image. If pad is True, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile which starts in the image. Regions outside the image will be padded with 0. For example, for a 5x5 image with a tile size of 3 and a stride of 2, tile generation with pad=False will create 4 tiles total, compared to 6 tiles if pad=True.

Parameters:
  • shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.

  • stride (int) – stride between chunks. If None, uses stride = size for non-overlapping chunks. Defaults to None.

  • pad (bool) – How to handle tiles on the edges. If True, these edge tiles will be zero-padded and yielded with the other chunks. If False, incomplete edge chunks will be ignored. Defaults to False.

  • **kwargs – Other arguments passed through to extract_region() method.

Yields:

pathml.core.tile.Tile – Extracted Tile object

get_image_shape(level=None)

Get the shape of the image on specific level.

Parameters:

level (int) – Which level to get shape from. If level is None, returns the shape of the biggest level. Defaults to None.

Returns:

Shape of image (i, j) at target level

Return type:

Tuple[int, int]

get_thumbnail(size=None)

Get a thumbnail of the image. Since there is no default thumbnail for multiparametric, volumetric images, this function supports downsampling of all image dimensions.

Parameters:

size (Tuple[int, int]) – thumbnail size

Returns:

RGB thumbnail image

Return type:

np.ndarray

Example

Get 1000x1000 thumbnail of 7 channel fluorescent image. shape = data.slide.get_image_shape() thumb = data.slide.get_thumbnail(size=(1000,1000, shape[2], shape[3], shape[4]))

DICOMBackend

class pathml.core.DICOMBackend(filename)

Interface with DICOM files on disk. Provides efficient access to individual Frame items contained in the Pixel Data element without loading the entire element into memory. Assumes that frames are non-overlapping. DICOM does not support multi-level images.

Parameters:

filename (str) – Path to the DICOM Part10 file on disk

extract_region(location, size=None, level=None)

Extract a single frame from the DICOM image.

Parameters:
  • location (Union[int, Tuple[int, int]]) – coordinate location of top-left corner of frame, or integer index of frame.

  • size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Must be the same as the frame size.

Returns:

image at the specified region

Return type:

np.ndarray

generate_tiles(shape, stride, pad, level=0, **kwargs)

Generator over tiles. For DICOMBackend, each tile corresponds to a frame.

Parameters:
  • shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Must match frame size.

  • stride (int) – Ignored for DICOMBackend. Frames are yielded individually.

  • pad (bool) – How to handle tiles on the edges. If True, these edge tiles will be zero-padded and yielded with the other chunks. If False, incomplete edge chunks will be ignored. Defaults to False.

Yields:

pathml.core.tile.Tile – Extracted Tile object

static get_bot(fp)

Reads the value of the Basic Offset Table. This table is used to access individual frames without loading the entire file into memory

Parameters:

fp (pydicom.filebase.DicomFile) – pydicom DicomFile object

Returns:

Offset of each Frame of the Pixel Data element following the Basic Offset Table

Return type:

list

get_image_shape()

Get the shape of the image.

Returns:

Shape of image (H, W)

Return type:

Tuple[int, int]

abstract get_thumbnail(size, **kwargs)

h5pathManager

class pathml.core.h5managers.h5pathManager(h5path=None, slidedata=None)

Interface between slidedata object and data management on disk by h5py.

add_mask(key, mask)

Add mask to h5. This manages slide-level masks.

Parameters:
  • key (str) – mask key

  • mask (np.ndarray) – mask array

add_tile(tile)

Add a tile to h5path.

Parameters:

tile (pathml.core.tile.Tile) – Tile object

get_mask(item, slicer=None)
get_slidetype()
get_tile(item)

Retrieve tile from h5manager by key or index.

Parameters:

item (int, str, tuple) – key or index of tile to be retrieved

Returns:

Tile(pathml.core.tile.Tile)

remove_mask(key)

Remove mask by key.

Parameters:

key (str) – key indicating mask to be removed

remove_tile(key)

Remove tile from self.h5 by key.

slice_masks(slicer)

Generator slicing all tiles, extending numpy array slicing.

Parameters:

slicer – List where each element is an object of type slice https://docs.python.org/3/c-api/slice.html indicating how the corresponding dimension should be sliced. The list length should correspond to the dimension of the tile. For 2D H&E images, pass a length 2 list of slice objects.

Yields:

key(str) – mask key val(np.ndarray): mask

update_mask(key, mask)

Update a mask.

Parameters:
  • key (str) – key indicating mask to be updated

  • mask (np.ndarray) – mask