Core API
SlideData
The central class in PathML
for representing a whole-slide image.
- class pathml.core.SlideData(filepath, name=None, masks=None, tiles=None, labels=None, backend=None, slide_type=None, stain=None, platform=None, tma=None, rgb=None, volumetric=None, time_series=None, counts=None, dtype=None)
Main class representing a slide and its annotations.
- Parameters:
filepath (str) – Path to file on disk.
name (str, optional) – name of slide. If
None
, and afilepath
is provided, name defaults to filepath.masks (pathml.core.Masks, optional) – object containing {key, mask} pairs
tiles (pathml.core.Tiles, optional) – object containing {coordinates, tile} pairs
labels (collections.OrderedDict, optional) – dictionary containing {key, label} pairs
backend (str, optional) – backend to use for interfacing with slide on disk. Must be one of {“OpenSlide”, “BioFormats”, “DICOM”, “h5path”} (case-insensitive). Note that for supported image formats, OpenSlide performance can be significantly better than BioFormats. Consider specifying
backend = "openslide"
when possible. IfNone
, and afilepath
is provided, tries to infer the correct backend from the file extension. Defaults toNone
.slide_type (pathml.core.SlideType, optional) – slide type specification. Must be a
SlideType
object. Alternatively, slide type can be specified by using the parametersstain
,tma
,rgb
,volumetric
, andtime_series
.stain (str, optional) – Flag indicating type of slide stain. Must be one of [‘HE’, ‘IHC’, ‘Fluor’]. Defaults to
None
. Ignored ifslide_type
is specified.platform (str, optional) – Flag indicating the imaging platform (e.g. CODEX, Vectra, etc.). Defaults to
None
. Ignored ifslide_type
is specified.tma (bool, optional) – Flag indicating whether the image is a tissue microarray (TMA). Defaults to
False
. Ignored ifslide_type
is specified.rgb (bool, optional) – Flag indicating whether the image is in RGB color. Defaults to
None
. Ignored ifslide_type
is specified.volumetric (bool, optional) – Flag indicating whether the image is volumetric. Defaults to
None
. Ignored ifslide_type
is specified.time_series (bool, optional) – Flag indicating whether the image is a time series. Defaults to
None
. Ignored ifslide_type
is specified.counts (anndata.AnnData) – object containing counts matrix associated with image quantification
- property counts
- extract_region(location, size, *args, **kwargs)
Extract a region of the image. This is a convenience method which passes arguments through to the
extract_region()
method of whichever backend is in use. Refer to documentation for each backend.- Parameters:
location (Tuple[int, int]) – Location of top-left corner of tile (i, j)
size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
*args – positional arguments passed through to
extract_region()
method of the backend.**kwargs – keyword arguments passed through to
extract_region()
method of the backend.
- Returns:
image at the specified region
- Return type:
np.ndarray
- generate_tiles(shape=3000, stride=None, pad=False, **kwargs)
Generator over Tile objects containing regions of the image. Calls
generate_tiles()
method of the backend. Tries to add the corresponding slide-level masks to each tile, if possible. Adds slide-level labels to each tile, if possible.- Parameters:
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Defaults to 256px.
stride (int) – stride between chunks. If
None
, usesstride = size
for non-overlapping chunks. Defaults toNone
.pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.**kwargs – Other arguments passed through to
generate_tiles()
method of the backend.
- Yields:
pathml.core.tile.Tile – Extracted Tile object
- plot(ax=None)
View a thumbnail of the image, using matplotlib. Not supported by all backends.
- Parameters:
ax – matplotlib axis object on which to plot the thumbnail. Optional.
- run(pipeline, distributed=True, client=None, tile_size=256, tile_stride=None, level=0, tile_pad=False, overwrite_existing_tiles=False, write_dir=None, **kwargs)
Run a preprocessing pipeline on SlideData. Tiles are generated by calling self.generate_tiles() and pipeline is applied to each tile.
- Parameters:
pipeline (pathml.preprocessing.pipeline.Pipeline) – Preprocessing pipeline.
distributed (bool) – Whether to distribute model using client. Defaults to True.
client – dask.distributed client
tile_size (int, optional) – Size of each tile. Defaults to 256px
tile_stride (int, optional) – Stride between tiles. If
None
, usestile_stride = tile_size
for non-overlapping tiles. Defaults toNone
.level (int, optional) – Level to extract tiles from. Defaults to
None
.tile_pad (bool) – How to handle chunks on the edges. If
True
, these edge chunks will be zero-padded symmetrically and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.overwrite_existing_tiles (bool) – Whether to overwrite existing tiles. If
False
, running a pipeline will fail iftiles is not None
. Defaults toFalse
.write_dir (str) – Path to directory to write the processed slide to. The processed SlideData object will be written to the directory immediately after the pipeline has completed running. The filepath will default to “<write_dir>/<slide.name>.h5path. Defaults to
None
.**kwargs – Other arguments passed through to
generate_tiles()
method of the backend.
- property shape
Convenience method for getting the image shape. Calling
wsi.shape
is equivalent to callingwsi.slide.get_image_shape()
with default arguments.- Returns:
Shape of image (H, W)
- Return type:
Tuple[int, int]
- write(path)
Write contents to disk in h5path format.
- Parameters:
path (Union[str, bytes, os.PathLike]) – path to file to be written
Convenience SlideData Classes
- class pathml.core.HESlide(*args, **kwargs)
Convenience class to load a SlideData object for H&E slides. Passes through all arguments to
SlideData()
, along withslide_type = types.HE
flag. Refer toSlideData
for full documentation.
- class pathml.core.VectraSlide(*args, **kwargs)
Convenience class to load a SlideData object for Vectra (Polaris) slides. Passes through all arguments to
SlideData()
, along withslide_type = types.Vectra
flag and defaultbackend = "bioformats"
. Refer toSlideData
for full documentation.
- class pathml.core.MultiparametricSlide(*args, **kwargs)
Convenience class to load a SlideData object for multiparametric immunofluorescence slides. Passes through all arguments to
SlideData()
, along withslide_type = types.IF
flag and defaultbackend = "bioformats"
. Refer toSlideData
for full documentation.
- class pathml.core.CODEXSlide(*args, **kwargs)
Convenience class to load a SlideData object from Akoya Biosciences CODEX format. Passes through all arguments to
SlideData()
, along withslide_type = types.CODEX
flag and defaultbackend = "bioformats"
. Refer toSlideData
for full documentation.- # TODO:
hierarchical biaxial gating (flow-style analysis)
Slide Types
- class pathml.core.SlideType(stain=None, platform=None, tma=None, rgb=None, volumetric=None, time_series=None)
SlideType objects define types based on a set of image parameters.
- Parameters:
stain (str, optional) – One of [‘HE’, ‘IHC’, ‘Fluor’]. Flag indicating type of slide stain. Defaults to None.
platform (str, optional) – Flag indicating the imaging platform (e.g. CODEX, Vectra, etc.).
tma (bool, optional) – Flag indicating whether the slide is a tissue microarray (TMA). Defaults to False.
rgb (bool, optional) – Flag indicating whether image is in RGB color. Defaults to False.
volumetric (bool, optional) – Flag indicating whether image is volumetric. Defaults to False.
time_series (bool, optional) – Flag indicating whether image is time-series. Defaults to False.
Examples
>>> from pathml import SlideType, types >>> he_type = SlideType(stain = "HE", rgb = True) # define slide type manually >>> types.HE == he_type # can also use pre-made types for convenience True
- asdict()
Convert to a dictionary. None values are represented as zeros and empty strings for compatibility with h5py attributes.
If
a
is a SlideType object, thena == SlideType(**a.asdict())
will beTrue
.
We also provide instantiations of common slide types for convenience:
Type
stain
platform
rgb
tma
volumetric
time_series
pathml.core.types.HE
‘HE’
None
True
False
False
False
pathml.core.types.IHC
‘IHC’
None
True
False
False
False
pathml.core.types.IF
‘Fluor’
None
False
False
False
False
pathml.core.types.CODEX
‘Fluor’
‘CODEX’
False
False
False
False
pathml.core.types.Vectra
‘Fluor’
‘Vectra’
False
False
False
False
Tile
- class pathml.core.Tile(image, coords, name=None, masks=None, labels=None, counts=None, slide_type=None, stain=None, tma=None, rgb=None, volumetric=None, time_series=None)
Object representing a tile extracted from an image. Holds the array for the tile, as well as the (i,j) coordinates of the top-left corner of the tile in the original image. The (i,j) coordinate system is based on labelling the top-leftmost pixel as (0, 0)
- Parameters:
image (np.ndarray) – Image array of tile
coords (tuple) – Coordinates of tile relative to the whole-slide image. The (i,j) coordinate system is based on labelling the top-leftmost pixel of the WSI as (0, 0).
name (str, optional) – Name of tile
masks (dict) – masks belonging to tile. If masks are supplied, all masks must be the same shape as the tile.
labels – labels belonging to tile
counts (AnnData) – counts matrix for the tile.
slide_type (pathml.core.SlideType, optional) – slide type specification. Must be a
SlideType
object. Alternatively, slide type can be specified by using the parametersstain
,tma
,rgb
,volumetric
, andtime_series
.stain (str, optional) – Flag indicating type of slide stain. Must be one of [‘HE’, ‘IHC’, ‘Fluor’]. Defaults to
None
. Ignored ifslide_type
is specified.tma (bool, optional) – Flag indicating whether the image is a tissue microarray (TMA). Defaults to
False
. Ignored ifslide_type
is specified.rgb (bool, optional) – Flag indicating whether the image is in RGB color. Defaults to
None
. Ignored ifslide_type
is specified.volumetric (bool, optional) – Flag indicating whether the image is volumetric. Defaults to
None
. Ignored ifslide_type
is specified.time_series (bool, optional) – Flag indicating whether the image is a time series. Defaults to
None
. Ignored ifslide_type
is specified.
- plot(ax=None)
View the tile image, using matplotlib. Only supports RGB images currently
- Parameters:
ax – matplotlib axis object on which to plot the thumbnail. Optional.
- property shape
convenience method. Calling
tile.shape
is equivalent to callingtile.image.shape
SlideDataset
- class pathml.core.SlideDataset(slides)
Container for a dataset of WSIs
- Parameters:
slides – list of SlideData objects
- run(pipeline, client=None, distributed=True, **kwargs)
Runs a preprocessing pipeline on all slides in the dataset
- Parameters:
pipeline (pathml.preprocessing.pipeline.Pipeline) – Preprocessing pipeline.
client – dask.distributed client
distributed (bool) – Whether to distribute model using client. Defaults to True.
kwargs (dict) – keyword arguments passed to
run()
for each slide
- write(dir, filenames=None)
Write all SlideData objects to the specified directory. Calls .write() method for each slide in the dataset. Optionally pass a list of filenames to use, otherwise filenames will be created from
.name
attributes of each slide.- Parameters:
dir (Union[str, bytes, os.PathLike]) – Path to directory where slides are to be saved
filenames (List[str], optional) – list of filenames to be used.
Tiles and Masks helper classes
- class pathml.core.Tiles(h5manager, tiles=None)
Object wrapping a dict of tiles.
- Parameters:
tiles (Union[dict[tuple[int], ~pathml.core.tiles.Tile], list[~pathml.core.tiles.Tile]]) – tile objects
- property keys
- remove(key)
Remove tile from tiles.
- Parameters:
key (str) – key (coords) indicating tile to be removed
- property tile_shape
- update(tile)
Update a tile.
- Parameters:
tile (pathml.core.tile.Tiles) – key of tile to be updated
- class pathml.core.Masks(h5manager, masks=None)
Object wrapping a dict of masks.
- Parameters:
h5manager (pathml.core.h5pathManager) –
masks (dict) – dictionary of np.ndarray objects representing ex. labels, segmentations.
- add(key, mask)
Add mask indexed by key to self.h5manager.
- Parameters:
key (str) – key
mask (np.ndarray) – array of mask. Must contain elements of type int8
- property keys
- remove(key)
Remove mask.
- Parameters:
key (str) – key indicating mask to be removed
- slice(slicer)
Slice all masks in self.h5manager extending of numpy array slicing.
- Parameters:
slices – list where each element is an object of type slice indicating how the dimension should be sliced
Slide Backends
OpenslideBackend
- class pathml.core.OpenSlideBackend(filename)
Use OpenSlide to interface with image files.
Depends on openslide-python which wraps the openslide C library.
- Parameters:
filename (str) – path to image file on disk
- extract_region(location, size, level=None)
Extract a region of the image
- Parameters:
location (Tuple[int, int]) – Location of top-left corner of tile (i, j)
size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
level (int) – level from which to extract chunks. Level 0 is highest resolution.
- Returns:
image at the specified region
- Return type:
np.ndarray
- generate_tiles(shape=3000, stride=None, pad=False, level=0)
Generator over tiles.
Padding works as follows: If
pad is False
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile that is fully contained in the image. Ifpad is True
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile which starts in the image. Regions outside the image will be padded with 0. For example, for a 5x5 image with a tile size of 3 and a stride of 2, tile generation withpad=False
will create 4 tiles total, compared to 6 tiles ifpad=True
.- Parameters:
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
stride (int) – stride between chunks. If
None
, usesstride = size
for non-overlapping chunks. Defaults toNone
.pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.level (int, optional) – For slides with multiple levels, which level to extract tiles from. Defaults to 0 (highest resolution).
- Yields:
pathml.core.tile.Tile – Extracted Tile object
- get_image_shape(level=0)
Get the shape of the image at specified level.
- Parameters:
level (int) – Which level to get shape from. Level 0 is highest resolution. Defaults to 0.
- Returns:
Shape of image at target level, in (i, j) coordinates.
- Return type:
Tuple[int, int]
- get_thumbnail(size)
Get a thumbnail of the slide.
- Parameters:
size (Tuple[int, int]) – the maximum size of the thumbnail
- Returns:
RGB thumbnail image
- Return type:
np.ndarray
BioFormatsBackend
- class pathml.core.BioFormatsBackend(filename, dtype=None)
Use BioFormats to interface with image files.
Now support multi-level images. Depends on python-bioformats which wraps ome bioformats java library, parses pixel and metadata of proprietary formats, and converts all formats to OME-TIFF. Please cite: https://pubmed.ncbi.nlm.nih.gov/20513764/
- Parameters:
filename (str) – path to image file on disk
dtype (numpy.dtype) – data type of image. If
None
, will use BioFormats to infer the data type from the image’s OME metadata. Defaults toNone
.
Note
While the Bio-Formats convention uses XYZCT channel order, we use YXZCT for compatibility with the rest of PathML which is based on (i, j) coordinate system.
- extract_region(location, size, level=0, series_as_channels=False, normalize=True)
Extract a region of the image. All bioformats images have 5 dimensions representing (i, j, z, channel, time). Even if an image does not have multiple z-series or time-series, those dimensions will still be kept. For example, a standard RGB image will be of shape (i, j, 1, 3, 1). If a tuple with len < 5 is passed, missing dimensions will be retrieved in full.
- Parameters:
location (Tuple[int, int]) – (i, j) location of corner of extracted region closest to the origin.
size (Tuple[int, int, ...]) – (i, j) size of each region. If an integer is passed, will convert to a
of (tuple) – dimensions will be retrieved in full.
level (int) – level from which to extract chunks. Level 0 is highest resolution. Defaults to 0.
series_as_channels (bool) – Whether to treat image series as channels. If
True
, multi-level images are not supported. Defaults toFalse
.normalize (bool, optional) – Whether to normalize the image to int8 before returning. Defaults to True. If False, image will be returned as-is immediately after reading, typically in float64.
- Returns:
image at the specified region. 5-D array of (i, j, z, c, t)
- Return type:
np.ndarray
- generate_tiles(shape=3000, stride=None, pad=False, level=0, **kwargs)
Generator over tiles.
Padding works as follows: If
pad is False
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile that is fully contained in the image. Ifpad is True
, then the first tile will start flush with the edge of the image, and the tile locations will increment according to specified stride, stopping with the last tile which starts in the image. Regions outside the image will be padded with 0. For example, for a 5x5 image with a tile size of 3 and a stride of 2, tile generation withpad=False
will create 4 tiles total, compared to 6 tiles ifpad=True
.- Parameters:
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated.
stride (int) – stride between chunks. If
None
, usesstride = size
for non-overlapping chunks. Defaults toNone
.pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.**kwargs – Other arguments passed through to
extract_region()
method.
- Yields:
pathml.core.tile.Tile – Extracted Tile object
- get_image_shape(level=None)
Get the shape of the image on specific level.
- Parameters:
level (int) – Which level to get shape from. If
level is None
, returns the shape of the biggest level. Defaults toNone
.- Returns:
Shape of image (i, j) at target level
- Return type:
Tuple[int, int]
- get_thumbnail(size=None)
Get a thumbnail of the image. Since there is no default thumbnail for multiparametric, volumetric images, this function supports downsampling of all image dimensions.
- Parameters:
size (Tuple[int, int]) – thumbnail size
- Returns:
RGB thumbnail image
- Return type:
np.ndarray
Example
Get 1000x1000 thumbnail of 7 channel fluorescent image. shape = data.slide.get_image_shape() thumb = data.slide.get_thumbnail(size=(1000,1000, shape[2], shape[3], shape[4]))
DICOMBackend
- class pathml.core.DICOMBackend(filename)
Interface with DICOM files on disk. Provides efficient access to individual Frame items contained in the Pixel Data element without loading the entire element into memory. Assumes that frames are non-overlapping. DICOM does not support multi-level images.
- Parameters:
filename (str) – Path to the DICOM Part10 file on disk
- extract_region(location, size=None, level=None)
Extract a single frame from the DICOM image.
- Parameters:
location (Union[int, Tuple[int, int]]) – coordinate location of top-left corner of frame, or integer index of frame.
size (Union[int, Tuple[int, int]]) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Must be the same as the frame size.
- Returns:
image at the specified region
- Return type:
np.ndarray
- generate_tiles(shape, stride, pad, level=0, **kwargs)
Generator over tiles. For DICOMBackend, each tile corresponds to a frame.
- Parameters:
shape (int or tuple(int)) – Size of each tile. May be a tuple of (height, width) or a single integer, in which case square tiles of that size are generated. Must match frame size.
stride (int) – Ignored for DICOMBackend. Frames are yielded individually.
pad (bool) – How to handle tiles on the edges. If
True
, these edge tiles will be zero-padded and yielded with the other chunks. IfFalse
, incomplete edge chunks will be ignored. Defaults toFalse
.
- Yields:
pathml.core.tile.Tile – Extracted Tile object
- static get_bot(fp)
Reads the value of the Basic Offset Table. This table is used to access individual frames without loading the entire file into memory
- Parameters:
fp (pydicom.filebase.DicomFile) – pydicom DicomFile object
- Returns:
Offset of each Frame of the Pixel Data element following the Basic Offset Table
- Return type:
list
- get_image_shape()
Get the shape of the image.
- Returns:
Shape of image (H, W)
- Return type:
Tuple[int, int]
- abstract get_thumbnail(size, **kwargs)
h5pathManager
- class pathml.core.h5managers.h5pathManager(h5path=None, slidedata=None)
Interface between slidedata object and data management on disk by h5py.
- add_mask(key, mask)
Add mask to h5. This manages slide-level masks.
- Parameters:
key (str) – mask key
mask (np.ndarray) – mask array
- add_tile(tile)
Add a tile to h5path.
- Parameters:
tile (pathml.core.tile.Tile) – Tile object
- get_mask(item, slicer=None)
- get_slidetype()
- get_tile(item)
Retrieve tile from h5manager by key or index.
- Parameters:
item (int, str, tuple) – key or index of tile to be retrieved
- Returns:
Tile(pathml.core.tile.Tile)
- remove_mask(key)
Remove mask by key.
- Parameters:
key (str) – key indicating mask to be removed
- remove_tile(key)
Remove tile from self.h5 by key.
- slice_masks(slicer)
Generator slicing all tiles, extending numpy array slicing.
- Parameters:
slicer – List where each element is an object of type slice https://docs.python.org/3/c-api/slice.html indicating how the corresponding dimension should be sliced. The list length should correspond to the dimension of the tile. For 2D H&E images, pass a length 2 list of slice objects.
- Yields:
key(str) – mask key val(np.ndarray): mask
- update_mask(key, mask)
Update a mask.
- Parameters:
key (str) – key indicating mask to be updated
mask (np.ndarray) – mask