Core API
SlideData
The central class in PathML
for representing a whole-slide image.
- class pathml.core.SlideData(filepath, name=None, masks=None, tiles=None, labels=None, backend=None, slide_type=None, stain=None, platform=None, tma=None, rgb=None, volumetric=None, time_series=None, counts=None, dtype=None)
Main class representing a slide and its annotations.
- Parameters
filepath (str) – Path to file on disk.
name (str, optional) – name of slide. If
None
, and afilepath
is provided, name defaults to filepath.masks (pathml.core.Masks, optional) – object containing {key, mask} pairs
tiles (pathml.core.Tiles, optional) – object containing {coordinates, tile} pairs
labels (collections.OrderedDict, optional) – dictionary containing {key, label} pairs
backend (str, optional) – backend to use for interfacing with slide on disk. Must be one of {“OpenSlide”, “BioFormats”, “DICOM”, “h5path”} (case-insensitive). Note that for supported image formats, OpenSlide performance can be significantly better than BioFormats. Consider specifying
backend = "openslide"
when possible. IfNone
, and afilepath
is provided, tries to infer the correct backend from the file extension. Defaults toNone
.slide_type (pathml.core.SlideType, optional) – slide type specification. Must be a
SlideType
object. Alternatively, slide type can be specified by using the parametersstain
,tma
,rgb
,volumetric
, andtime_series
.stain (str, optional) – Flag indicating type of slide stain. Must be one of [‘HE’, ‘IHC’, ‘Fluor’]. Defaults to
None
. Ignored ifslide_type
is specified.platform (str, optional) – Flag indicating the imaging platform (e.g. CODEX, Vectra, etc.). Defaults to
None
. Ignored ifslide_type
is specified.tma (bool, optional) – Flag indicating whether the image is a tissue microarray (TMA). Defaults to
False
. Ignored ifslide_type
is specified.rgb (bool, optional) – Flag indicating whether the image is in RGB color. Defaults to
None
. Ignored ifslide_type
is specified.volumetric (bool, optional) – Flag indicating whether the image is volumetric. Defaults to
None
. Ignored ifslide_type
is specified.time_series (bool, optional) – Flag indicating whether the image is a time series. Defaults to
None
. Ignored ifslide_type
is specified.counts (anndata.AnnData) – object containing counts matrix associated with image quantification
- property counts(self)
- extract_region(self, location, size, *args, **kwargs)
Extract a region of the image. This is a convenience method which passes arguments through to the
extract_region()
method of whichever backend is in use. Refer to documentation for each backend.- Parameters
location (Tuple[int, int]) – Location of top-left corner of tile (i, j)
size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
*args – positional arguments passed through to
extract_region()
method of the backend.**kwargs – keyword arguments passed through to
extract_region()
method of the backend.
- Returns
image at the specified region
- Return type
np.ndarray
- generate_tiles(self, shape=3000, stride=None, pad=False, **kwargs)
Generator over Tile objects containing regions of the image. Calls
generate_tiles()
method of the backend. Tries to add the corresponding slide-level masks to each tile, if possible. Adds slide-level labels to each tile, if possible.- Parameters
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Defaults to 256px.
stride (int) – stride between chunks. If
None
, usesstride = size
for non-overlapping chunks. Defaults toNone
.pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.**kwargs – Other arguments passed through to
generate_tiles()
method of the backend.
- Yields
pathml.core.tile.Tile – Extracted Tile object
- plot(self, ax=None)
View a thumbnail of the image, using matplotlib. Not supported by all backends.
- Parameters
ax – matplotlib axis object on which to plot the thumbnail. Optional.
- run(self, pipeline, distributed=True, client=None, tile_size=256, tile_stride=None, level=0, tile_pad=False, overwrite_existing_tiles=False, write_dir=None, **kwargs)
Run a preprocessing pipeline on SlideData. Tiles are generated by calling self.generate_tiles() and pipeline is applied to each tile.
- Parameters
pipeline (pathml.preprocessing.pipeline.Pipeline) – Preprocessing pipeline.
distributed (bool) – Whether to distribute model using client. Defaults to True.
client – dask.distributed client
tile_size (int, optional) – Size of each tile. Defaults to 256px
tile_stride (int, optional) – Stride between tiles. If
None
, usestile_stride = tile_size
for non-overlapping tiles. Defaults toNone
.level (int, optional) – Level to extract tiles from. Defaults to
None
.tile_pad (bool) – How to handle chunks on the edges. If
True
, these edge chunks will be zero-padded symmetrically and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.overwrite_existing_tiles (bool) – Whether to overwrite existing tiles. If
False
, running a pipeline will fail iftiles is not None
. Defaults toFalse
.write_dir (str) – Path to directory to write the processed slide to. The processed SlideData object will be written to the directory immediately after the pipeline has completed running. The filepath will default to “<write_dir>/<slide.name>.h5path. Defaults to
None
.**kwargs – Other arguments passed through to
generate_tiles()
method of the backend.
- property shape(self)
Convenience method for getting the image shape. Calling
wsi.shape
is equivalent to callingwsi.slide.get_image_shape()
with default arguments.- Returns
Shape of image (H, W)
- Return type
Tuple[int, int]
- write(self, path)
Write contents to disk in h5path format.
- Parameters
path (Union[str, bytes, os.PathLike]) – path to file to be written
Convenience SlideData Classes
- class pathml.core.HESlide(*args, **kwargs)
Convenience class to load a SlideData object for H&E slides. Passes through all arguments to
SlideData()
, along withslide_type = types.HE
flag. Refer toSlideData
for full documentation.
- class pathml.core.VectraSlide(*args, **kwargs)
Convenience class to load a SlideData object for Vectra (Polaris) slides. Passes through all arguments to
SlideData()
, along withslide_type = types.Vectra
flag and defaultbackend = "bioformats"
. Refer toSlideData
for full documentation.
- class pathml.core.MultiparametricSlide(*args, **kwargs)
Convenience class to load a SlideData object for multiparametric immunofluorescence slides. Passes through all arguments to
SlideData()
, along withslide_type = types.IF
flag and defaultbackend = "bioformats"
. Refer toSlideData
for full documentation.
- class pathml.core.CODEXSlide(*args, **kwargs)
Convenience class to load a SlideData object from Akoya Biosciences CODEX format. Passes through all arguments to
SlideData()
, along withslide_type = types.CODEX
flag and defaultbackend = "bioformats"
. Refer toSlideData
for full documentation.- # TODO:
hierarchical biaxial gating (flow-style analysis)
Slide Types
- class pathml.core.SlideType(stain=None, platform=None, tma=None, rgb=None, volumetric=None, time_series=None)
SlideType objects define types based on a set of image parameters.
- Parameters
stain (str, optional) – One of [‘HE’, ‘IHC’, ‘Fluor’]. Flag indicating type of slide stain. Defaults to None.
platform (str, optional) – Flag indicating the imaging platform (e.g. CODEX, Vectra, etc.).
tma (bool, optional) – Flag indicating whether the slide is a tissue microarray (TMA). Defaults to False.
rgb (bool, optional) – Flag indicating whether image is in RGB color. Defaults to False.
volumetric (bool, optional) – Flag indicating whether image is volumetric. Defaults to False.
time_series (bool, optional) – Flag indicating whether image is time-series. Defaults to False.
Examples
>>> from pathml import SlideType, types >>> he_type = SlideType(stain = "HE", rgb = True) # define slide type manually >>> types.HE == he_type # can also use pre-made types for convenience True
- asdict(self)
Convert to a dictionary. None values are represented as zeros and empty strings for compatibility with h5py attributes.
If
a
is a SlideType object, thena == SlideType(**a.asdict())
will beTrue
.
We also provide instantiations of common slide types for convenience:
Type
stain
platform
rgb
tma
volumetric
time_series
pathml.core.types.HE
‘HE’
None
True
False
False
False
pathml.core.types.IHC
‘IHC’
None
True
False
False
False
pathml.core.types.IF
‘Fluor’
None
False
False
False
False
pathml.core.types.CODEX
‘Fluor’
‘CODEX’
False
False
False
False
pathml.core.types.Vectra
‘Fluor’
‘Vectra’
False
False
False
False
Tile
- class pathml.core.Tile(image, coords, name=None, masks=None, labels=None, counts=None, slide_type=None, stain=None, tma=None, rgb=None, volumetric=None, time_series=None)
Object representing a tile extracted from an image. Holds the array for the tile, as well as the (i,j) coordinates of the top-left corner of the tile in the original image. The (i,j) coordinate system is based on labelling the top-leftmost pixel as (0, 0)
- Parameters
image (np.ndarray) – Image array of tile
coords (tuple) – Coordinates of tile relative to the whole-slide image. The (i,j) coordinate system is based on labelling the top-leftmost pixel of the WSI as (0, 0).
name (str, optional) – Name of tile
masks (dict) – masks belonging to tile. If masks are supplied, all masks must be the same shape as the tile.
labels – labels belonging to tile
counts (AnnData) – counts matrix for the tile.
slide_type (pathml.core.SlideType, optional) – slide type specification. Must be a
SlideType
object. Alternatively, slide type can be specified by using the parametersstain
,tma
,rgb
,volumetric
, andtime_series
.stain (str, optional) – Flag indicating type of slide stain. Must be one of [‘HE’, ‘IHC’, ‘Fluor’]. Defaults to
None
. Ignored ifslide_type
is specified.tma (bool, optional) – Flag indicating whether the image is a tissue microarray (TMA). Defaults to
False
. Ignored ifslide_type
is specified.rgb (bool, optional) – Flag indicating whether the image is in RGB color. Defaults to
None
. Ignored ifslide_type
is specified.volumetric (bool, optional) – Flag indicating whether the image is volumetric. Defaults to
None
. Ignored ifslide_type
is specified.time_series (bool, optional) – Flag indicating whether the image is a time series. Defaults to
None
. Ignored ifslide_type
is specified.
- plot(self, ax=None)
View the tile image, using matplotlib. Only supports RGB images currently
- Parameters
ax – matplotlib axis object on which to plot the thumbnail. Optional.
- property shape(self)
convenience method. Calling
tile.shape
is equivalent to callingtile.image.shape
SlideDataset
- class pathml.core.SlideDataset(slides)
Container for a dataset of WSIs
- Parameters
slides – list of SlideData objects
- run(self, pipeline, client=None, distributed=True, **kwargs)
Runs a preprocessing pipeline on all slides in the dataset
- Parameters
pipeline (pathml.preprocessing.pipeline.Pipeline) – Preprocessing pipeline.
client – dask.distributed client
distributed (bool) – Whether to distribute model using client. Defaults to True.
kwargs (dict) – keyword arguments passed to
run()
for each slide
- write(self, dir, filenames=None)
Write all SlideData objects to the specified directory. Calls .write() method for each slide in the dataset. Optionally pass a list of filenames to use, otherwise filenames will be created from
.name
attributes of each slide.- Parameters
dir (Union[str, bytes, os.PathLike]) – Path to directory where slides are to be saved
filenames (List[str], optional) – list of filenames to be used.
Tiles and Masks helper classes
- class pathml.core.Tiles(h5manager, tiles=None)
Object wrapping a dict of tiles.
- Parameters
tiles (Union[dict[tuple[int], ~pathml.core.tiles.Tile], list[~pathml.core.tiles.Tile]]) – tile objects
- property keys(self)
- remove(self, key)
Remove tile from tiles.
- Parameters
key (str) – key (coords) indicating tile to be removed
- property tile_shape(self)
- update(self, tile)
Update a tile.
- Parameters
tile (pathml.core.tile.Tiles) – key of tile to be updated
- class pathml.core.Masks(h5manager, masks=None)
Object wrapping a dict of masks.
- Parameters
h5manager (pathml.core.h5pathManager) –
masks (dict) – dictionary of np.ndarray objects representing ex. labels, segmentations.
- add(self, key, mask)
Add mask indexed by key to self.h5manager.
- Parameters
key (str) – key
mask (np.ndarray) – array of mask. Must contain elements of type int8
- property keys(self)
- remove(self, key)
Remove mask.
- Parameters
key (str) – key indicating mask to be removed
- slice(self, slicer)
Slice all masks in self.h5manager extending of numpy array slicing.
- Parameters
slices – list where each element is an object of type slice indicating how the dimension should be sliced
Slide Backends
OpenslideBackend
- class pathml.core.OpenSlideBackend(filename)
Use OpenSlide to interface with image files.
Depends on openslide-python which wraps the openslide C library.
- Parameters
filename (str) – path to image file on disk
- extract_region(self, location, size, level=None)
Extract a region of the image
- Parameters
location (Tuple[int, int]) – Location of top-left corner of tile (i, j)
size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
level (int) – level from which to extract chunks. Level 0 is highest resolution.
- Returns
image at the specified region
- Return type
np.ndarray
- generate_tiles(self, shape=3000, stride=None, pad=False, level=0)
Generator over tiles.
Padding works as follows: If
pad is False
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile that is fully contained in the image. Ifpad is True
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile which starts in the image. Regions outside the image will be padded with 0. For example, for a 5x5 image with a tile size of 3 and a stride of 2, tile generation withpad=False
will create 4 tiles total, compared to 6 tiles ifpad=True
.- Parameters
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
stride (int) – stride between chunks. If
None
, usesstride = size
for non-overlapping chunks. Defaults toNone
.pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.level (int, optional) – For slides with multiple levels, which level to extract tiles from. Defaults to 0 (highest resolution).
- Yields
pathml.core.tile.Tile – Extracted Tile object
- get_image_shape(self, level=0)
Get the shape of the image at specified level.
- Parameters
level (int) – Which level to get shape from. Level 0 is highest resolution. Defaults to 0.
- Returns
Shape of image at target level, in (i, j) coordinates.
- Return type
Tuple[int, int]
- get_thumbnail(self, size)
Get a thumbnail of the slide.
- Parameters
size (Tuple[int, int]) – the maximum size of the thumbnail
- Returns
RGB thumbnail image
- Return type
np.ndarray
BioFormatsBackend
- class pathml.core.BioFormatsBackend(filename, dtype=None)
Use BioFormats to interface with image files.
Now support multi-level images. Depends on python-bioformats which wraps ome bioformats java library, parses pixel and metadata of proprietary formats, and converts all formats to OME-TIFF. Please cite: https://pubmed.ncbi.nlm.nih.gov/20513764/
- Parameters
filename (str) – path to image file on disk
dtype (numpy.dtype) – data type of image. If
None
, will use BioFormats to infer the data type from the image’s OME metadata. Defaults toNone
.
Note
While the Bio-Formats convention uses XYZCT channel order, we use YXZCT for compatibility with the rest of PathML which is based on (i, j) coordinate system.
- extract_region(self, location, size, level=0, series_as_channels=False, normalize=True)
Extract a region of the image. All bioformats images have 5 dimensions representing (i, j, z, channel, time). Even if an image does not have multiple z-series or time-series, those dimensions will still be kept. For example, a standard RGB image will be of shape (i, j, 1, 3, 1). If a tuple with len < 5 is passed, missing dimensions will be retrieved in full.
- Parameters
location (Tuple[int, int]) – (i, j) location of corner of extracted region closest to the origin.
size (Tuple[int, int, ...]) – (i, j) size of each region. If an integer is passed, will convert to a
of (tuple) – dimensions will be retrieved in full.
level (int) – level from which to extract chunks. Level 0 is highest resolution. Defaults to 0.
series_as_channels (bool) – Whether to treat image series as channels. If
True
, multi-level images are not supported. Defaults toFalse
.normalize (bool, optional) – Whether to normalize the image to int8 before returning. Defaults to True. If False, image will be returned as-is immediately after reading, typically in float64.
- Returns
image at the specified region. 5-D array of (i, j, z, c, t)
- Return type
np.ndarray
- generate_tiles(self, shape=3000, stride=None, pad=False, level=0, **kwargs)
Generator over tiles.
Padding works as follows: If
pad is False
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile that is fully contained in the image. Ifpad is True
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile which starts in the image. Regions outside the image will be padded with 0. For example, for a 5x5 image with a tile size of 3 and a stride of 2, tile generation withpad=False
will create 4 tiles total, compared to 6 tiles ifpad=True
.- Parameters
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
stride (int) – stride between chunks. If
None
, usesstride = size
for non-overlapping chunks. Defaults toNone
.pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.**kwargs – Other arguments passed through to
extract_region()
method.
- Yields
pathml.core.tile.Tile – Extracted Tile object
- get_image_shape(self, level=None)
Get the shape of the image on specific level.
- Parameters
level (int) – Which level to get shape from. If
level is None
, returns the shape of the biggest level. Defaults toNone
.- Returns
Shape of image (i, j) at target level
- Return type
Tuple[int, int]
- get_thumbnail(self, size=None)
Get a thumbnail of the image. Since there is no default thumbnail for multiparametric, volumetric images, this function supports downsampling of all image dimensions.
- Parameters
size (Tuple[int, int]) – thumbnail size
- Returns
RGB thumbnail image
- Return type
np.ndarray
Example
Get 1000x1000 thumbnail of 7 channel fluorescent image. shape = data.slide.get_image_shape() thumb = data.slide.get_thumbnail(size=(1000,1000, shape[2], shape[3], shape[4]))
DICOMBackend
- class pathml.core.DICOMBackend(filename)
Interface with DICOM files on disk. Provides efficient access to individual Frame items contained in the Pixel Data element without loading the entire element into memory. Assumes that frames are non-overlapping. DICOM does not support multi-level images.
- Parameters
filename (str) – Path to the DICOM Part10 file on disk
- extract_region(self, location, size=None, level=None)
Extract a single frame from the DICOM image.
- Parameters
location (Union[int, Tuple[int, int]]) – coordinate location of top-left corner of frame, or integer index of frame.
size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Must be the same as the frame size.
- Returns
image at the specified region
- Return type
np.ndarray
- generate_tiles(self, shape, stride, pad, level=0, **kwargs)
Generator over tiles. For DICOMBackend, each tile corresponds to a frame.
- Parameters
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Must match frame size.
stride (int) – Ignored for DICOMBackend. Frames are yielded individually.
pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.
- Yields
pathml.core.tile.Tile – Extracted Tile object
- static get_bot(fp)
Reads the value of the Basic Offset Table. This table is used to access individual frames without loading the entire file into memory
- Parameters
fp (pydicom.filebase.DicomFile) – pydicom DicomFile object
- Returns
Offset of each Frame of the Pixel Data element following the Basic Offset Table
- Return type
list
- get_image_shape(self)
Get the shape of the image.
- Returns
Shape of image (H, W)
- Return type
Tuple[int, int]
- abstract get_thumbnail(self, size, **kwargs)
h5pathManager
- class pathml.core.h5managers.h5pathManager(h5path=None, slidedata=None)
Interface between slidedata object and data management on disk by h5py.
- add_mask(self, key, mask)
Add mask to h5. This manages slide-level masks.
- Parameters
key (str) – mask key
mask (np.ndarray) – mask array
- add_tile(self, tile)
Add a tile to h5path.
- Parameters
tile (pathml.core.tile.Tile) – Tile object
- get_mask(self, item, slicer=None)
- get_slidetype(self)
- get_tile(self, item)
Retrieve tile from h5manager by key or index.
- Parameters
item (int, str, tuple) – key or index of tile to be retrieved
- Returns
Tile(pathml.core.tile.Tile)
- remove_mask(self, key)
Remove mask by key.
- Parameters
key (str) – key indicating mask to be removed
- remove_tile(self, key)
Remove tile from self.h5 by key.
- slice_masks(self, slicer)
Generator slicing all tiles, extending numpy array slicing.
- Parameters
slicer – List where each element is an object of type slice https://docs.python.org/3/c-api/slice.html indicating how the corresponding dimension should be sliced. The list length should correspond to the dimension of the tile. For 2D H&E images, pass a length 2 list of slice objects.
- Yields
key(str) – mask key val(np.ndarray): mask
- update_mask(self, key, mask)
Update a mask.
- Parameters
key (str) – key indicating mask to be updated
mask (np.ndarray) – mask