Preprocessing API

Pipeline

class pathml.preprocessing.Pipeline(transform_sequence=None)

Compose a sequence of Transforms

Parameters:: transform_sequence (list) – sequence of transforms to be consecutively applied. List of pathml.core.Transform objects

apply(tile): modify Tile object in-place

save(filename)

save pipeline to disk

Parameters:: filename (str) – save path on disk

Transforms

class pathml.preprocessing.MedianBlur(kernel_size=5)

Median blur kernel.

Parameters:: kernel_size (int) – Width of kernel. Must be an odd number. Defaults to 5.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.GaussianBlur(kernel_size=5, sigma=5)

Gaussian blur kernel.

Parameters:

kernel_size (int) – Width of kernel. Must be an odd number. Defaults to 5.
sigma (float) – Variance of Gaussian kernel. Variance is assumed to be equal in X and Y axes. Defaults to 5.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.BoxBlur(kernel_size=5)

Box (average) blur kernel.

Parameters:: kernel_size (int) – Width of kernel. Defaults to 5.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.BinaryThreshold(mask_name=None, use_otsu=True, threshold=0, inverse=False)

Binary thresholding transform to create a binary mask. If input image is RGB it is first converted to greyscale, otherwise the input must have 1 channel.

Parameters:

mask_name (str) – Name of mask that is created.
use_otsu (bool) – Whether to use Otsu’s method to automatically determine optimal threshold. Defaults to True.
threshold (int) – Specified threshold. Ignored if use_otsu is True. Defaults to 0.
inverse (bool) – Whether to use inverse threshold. If using inverse threshold, pixels below the threshold will be returned as 1. Otherwise pixels below the threshold will be returned as 0. Defaults to False.

References

Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1), pp.62-66.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.MorphOpen(mask_name=None, kernel_size=5, n_iterations=1)

Morphological opening. First applies erosion operation, then dilation. Reduces noise by removing small objects from the background. Operates on a binary mask.

Parameters:

mask_name (str) – Name of mask on which to apply transform
kernel_size (int) – Size of kernel for default square kernel. Ignored if a custom kernel is specified. Defaults to 5.
n_iterations (int) – Number of opening operations to perform. Defaults to 1.

F(mask): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.MorphClose(mask_name=None, kernel_size=5, n_iterations=1)

Morphological closing. First applies dilation operation, then erosion. Reduces noise by closing small holes in the foreground. Operates on a binary mask.

Parameters:

mask_name (str) – Name of mask on which to apply transform
kernel_size (int) – Size of kernel for default square kernel. Ignored if a custom kernel is specified. Defaults to 5.
n_iterations (int) – Number of opening operations to perform. Defaults to 1.

F(mask): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.ForegroundDetection(mask_name=None, min_region_size=5000, max_hole_size=1500, outer_contours_only=False)

Foreground detection for binary masks. Identifies regions that have a total area greater than specified threshold. Supports including holes within foreground regions, or excluding holes above a specified area threshold.

Parameters:

min_region_size (int) – Minimum area of detected foreground regions, in pixels. Defaults to 5000.
max_hole_size (int) – Maximum size of allowed holes in foreground regions, in pixels. Ignored if outer_contours_only is True. Defaults to 1500.
outer_contours_only (bool) – If true, ignore holes in detected foreground regions. Defaults to False.
mask_name (str) – Name of mask on which to apply transform

References

Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M. and Mahmood, F., 2020. Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images. arXiv preprint arXiv:2004.09666.

F(mask): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.SuperpixelInterpolation(region_size=10, n_iter=30)

Divide input image into superpixels using SLIC algorithm, then interpolate each superpixel with average color. SLIC superpixel algorithm described in Achanta et al. 2012.

Parameters:

region_size (int) – region_size parameter used for superpixel creation. Defaults to 10.
n_iter (int) – Number of iterations to run SLIC algorithm. Defaults to 30.

References

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P. and Süsstrunk, S., 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11), pp.2274-2282.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.StainNormalizationHE(target='normalize', stain_estimation_method='macenko', optical_density_threshold=0.15, regularizer=0.1, angular_percentile=0.01, background_intensity=245, stain_matrix_target_od=np.array([[0.5626, 0.2159], [0.7201, 0.8012], [0.4062, 0.5581]]).T, max_c_target=np.array([[1.9705, 1.0308]]))

Normalize H&E stained images to a reference slide. Also can be used to separate hematoxylin and eosin channels.

H&E images are assumed to be composed of two stains, each one having a vector of its characteristic RGB values. The stain matrix is a 2x3 matrix where the first row corresponds to the hematoxylin stain vector and the second corresponds to eosin stain vector. The stain matrix can be estimated from a reference image in a number of ways; here we provide implementations of two such algorithms from Macenko et al. and Vahadane et al.

After estimating the stain matrix for an image, the next step is to assign stain concentrations to each pixel. Each pixel is assumed to be a linear combination of the two stain vectors, where the coefficients are the intensities of each stain vector at that pixel. To solve for the intensities, we use least squares in Macenko method and lasso in vahadane method.

The image can then be reconstructed by applying those pixel intensities to a stain matrix. This allows you to standardize the appearance of an image by reconstructing it using a reference stain matrix. Using this method of normalization may help account for differences in slide appearance arising from variations in staining procedure, differences between scanners, etc. Images can also be reconstructed using only a single stain vector, e.g. to separate the hematoxylin and eosin channels of an H&E image.

This code is based in part on StainTools: https://github.com/Peter554/StainTools

Parameters:

target (str) – one of ‘normalize’, ‘hematoxylin’, or ‘eosin’. Defaults to ‘normalize’
stain_estimation_method (str) – method for estimating stain matrix. Must be one of ‘macenko’ or ‘vahadane’. Defaults to ‘macenko’.
optical_density_threshold (float) – Threshold for removing low-optical density pixels when estimating stain vectors. Defaults to 0.15
regularizer (float) – Regularization parameter for dictionary learning when estimating stain vector using vahadane method. Ignored if concentration_estimation_method != 'vahadane'. Defaults to 0.1
angular_percentile (float) – Percentile for stain vector selection when estimating stain vector using Macenko method. Ignored if concentration_estimation_method != 'macenko'. Defaults to 0.01
background_intensity (int) – Intensity of background light. Must be an integer between 0 and 255. Defaults to 245.
stain_matrix_target_od (np.ndarray) – Stain matrix for reference slide. Matrix of H and E stain vectors in optical density (OD) space. Stain matrix is (2, 3) and first row corresponds to hematoxylin. Default stain matrix can be used, or you can also fit to a reference slide of your choosing by calling fit_to_reference().
max_c_target (np.ndarray) – Maximum concentrations of each stain in reference slide. Default can be used, or you can also fit to a reference slide of your choosing by calling fit_to_reference().

References

Macenko, M., Niethammer, M., Marron, J.S., Borland, D., Woosley, J.T., Guan, X., Schmitt, C. and Thomas, N.E., 2009, June. A method for normalizing histology slides for quantitative analysis. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro (pp. 1107-1110). IEEE.

Vahadane, A., Peng, T., Sethi, A., Albarqouni, S., Wang, L., Baust, M., Steiger, K., Schlitter, A.M., Esposito, I. and Navab, N., 2016. Structure-preserving color normalization and sparse stain separation for histological images. IEEE transactions on medical imaging, 35(8), pp.1962-1971.

F(image): functional implementation

apply(tile): modify Tile object in-place

fit_to_reference(target)

class pathml.preprocessing.NucleusDetectionHE(mask_name=None, stain_estimation_method='vahadane', superpixel_region_size=10, n_iter=30, **stain_kwargs)

Simple nucleus detection algorithm for H&E stained images. Works by first separating hematoxylin channel, then doing interpolation using superpixels, and finally using Otsu’s method for binary thresholding.

Parameters:

stain_estimation_method (str) – Method for estimating stain matrix. Defaults to “vahadane”
superpixel_region_size (int) – region_size parameter used for superpixel creation. Defaults to 10.
n_iter (int) – Number of iterations to run SLIC superpixel algorithm. Defaults to 30.
mask_name (str) – Name of mask that is created.
stain_kwargs (dict) – other arguments passed to StainNormalizationHE()

References

Hu, B., Tang, Y., Eric, I., Chang, C., Fan, Y., Lai, M. and Xu, Y., 2018. Unsupervised learning for cell-level visual representation in histopathology images with generative adversarial networks. IEEE journal of biomedical and health informatics, 23(3), pp.1316-1328.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.TissueDetectionHE(mask_name=None, use_saturation=True, blur_ksize=17, threshold=None, morph_n_iter=3, morph_k_size=7, min_region_size=5000, max_hole_size=1500, outer_contours_only=False)

Detect tissue regions from H&E stained slide. First applies a median blur, then binary thresholding, then morphological opening and closing, and finally foreground detection.

Parameters:

use_saturation (bool) – Whether to convert to HSV and use saturation channel for tissue detection. If False, convert from RGB to greyscale and use greyscale image_ref for tissue detection. Defaults to True.
blur_ksize (int) – kernel size used to apply median blurring. Defaults to 15.
threshold (int) – threshold for binary thresholding. If None, uses Otsu’s method. Defaults to None.
morph_n_iter (int) – number of iterations of morphological opening and closing to apply. Defaults to 3.
morph_k_size (int) – kernel size for morphological opening and closing. Defaults to 7.
min_region_size (int) – Minimum area of detected foreground regions, in pixels. Defaults to 5000.
max_hole_size (int) – Maximum size of allowed holes in foreground regions, in pixels. Ignored if outer_contours_only=True. Defaults to 1500.
outer_contours_only (bool) – If true, ignore holes in detected foreground regions. Defaults to False.
mask_name (str) – name for new mask

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.LabelArtifactTileHE(label_name=None)

Applies a rule-based method to identify whether or not an image contains artifacts (e.g. pen marks). Based on criteria from Kothari et al. 2012 ACM-BCB 218-225.

Parameters:: label_name (str) – name for new mask

References

Kothari, S., Phan, J.H., Osunkoya, A.O. and Wang, M.D., 2012, October. Biological interpretation of morphological patterns in histopathological whole-slide images. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (pp. 218-225).

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.LabelWhiteSpaceHE(label_name=None, greyscale_threshold=230, proportion_threshold=0.5)

Simple threshold method to label an image as majority whitespace. Converts image to greyscale. If the proportion of pixels exceeding the greyscale threshold is greater than the proportion threshold, then the image is labelled as whitespace.

Parameters:: label_name (str) – name for new mask

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.SegmentMIF(model='mesmer', nuclear_channel=None, cytoplasm_channel=None, image_resolution=0.5, preprocess_kwargs=None, postprocess_kwargs_nuclear=None, postprocess_kwargs_whole_cell=None)

Transform applying segmentation to MIF images.

Input image must be formatted (c, x, y) or (batch, c, x, y). z and t dimensions must be selected before calling SegmentMIF

Supported models:

Mesmer: Mesmer uses human-in-the-loop pipeline to train a ResNet50 backbone w/ Feature Pyramid Network segmentation model on 1.3 million cell annotations and 1.2 million nuclear annotations (TissueNet dataset). Model outputs predictions for centroid and boundary of every nucleus and cell, then centroid and boundary predictions are used as inputs to a watershed algorithm that creates segmentation masks.

Note

Mesmer model requires installation of deepcell dependency: pip install deepcell

Parameters:

model (str) – string indicating which segmentation model to use. Currently only ‘mesmer’ is supported.
nuclear_channel (int) – channel that defines cell nucleus
cytoplasm_channel (int) – channel that defines cell membrane or cytoplasm
image_resolution (float) – pixel resolution of image in microns
preprocess_kwargs (dict) – keyword arguemnts to pass to pre-processing function
postprocess_kwargs_nuclear (dict) – keyword arguments to pass to post-processing function
postprocess_kwargs_whole_cell (dict) – keyword arguments to pass to post-processing function

References

Greenwald, N.F., Miller, G., Moen, E. et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-01094-0

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.SegmentMIFRemote(model_path='temp.onnx', nuclear_channel=None, cytoplasm_channel=None, image_resolution=0.5, preprocess_kwargs=None, postprocess_kwargs_nuclear=None, postprocess_kwargs_whole_cell=None)

Transform applying segmentation to MIF images using a Mesmer model. Mesmer uses human-in-the-loop pipeline to train a ResNet50 backbone w/ Feature Pyramid Network segmentation model on 1.3 million cell annotations and 1.2 million nuclear annotations (TissueNet dataset). Model outputs predictions for centroid and boundary of every nucleus and cell, then centroid and boundary predictions are used as inputs to a watershed algorithm that creates segmentation masks.

Implements pathml.inference.RemoteMesmer in the backend.

Input image must be formatted (c, x, y) or (batch, c, x, y). z and t dimensions must be selected before calling SegmentMIF

Parameters:

model_path (str) – path where the ONNX model is downloaded
nuclear_channel (int) – channel that defines cell nucleus
cytoplasm_channel (int) – channel that defines cell membrane or cytoplasm
image_resolution (float) – pixel resolution of image in microns. Currently only supports 0.5
preprocess_kwargs (dict) – keyword arguemnts to pass to pre-processing function
postprocess_kwargs_nuclear (dict) – keyword arguments to pass to post-processing function
postprocess_kwargs_whole_cell (dict) – keyword arguments to pass to post-processing function

References

Greenwald, N.F., Miller, G., Moen, E. et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-01094-0

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.QuantifyMIF(segmentation_mask)

Convert segmented image into anndata.AnnData counts object AnnData. Counts objects are used to interface with the Python single cell analysis ecosystem Scanpy. The counts object contains a summary of channel statistics in each cell along with its coordinate.

Parameters:: segmentation_mask (str) – key indicating which mask to use as label image

F(img, segmentation, coords_offset=(0, 0))

Functional implementation

Parameters:

img (np.ndarray) – Input image of shape (i, j, n_channels)
segmentation (np.ndarray) – Segmentation map of shape (i, j) or (i, j, 1). Zeros are background. Regions should be labelled with unique integers.
coords_offset (tuple, optional) – Coordinates (i, j) used to convert tile-level coordinates to slide-level. Defaults to (0, 0) for no offset.

Returns:

Counts matrix

apply(tile): modify Tile object in-place

class pathml.preprocessing.CollapseRunsVectra

Coerce Vectra output to standard format. For compatibility with transforms, tiles need to have their shape collapsed to (x, y, c)

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.CollapseRunsCODEX(z)

Coerce CODEX output to standard format. CODEX format is (x, y, z, c, t) where c=4 (4 runs per cycle) and t is the number of cycles. Output format is (x, y, c) where all cycles are collapsed into c (c = 4 * # of cycles).

Parameters:: z (int) – in-focus z-plane

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.RescaleIntensity(in_range='image', out_range='dtype')

Return image after stretching or shrinking its intensity levels. The desired intensity range of the input and output, in_range and out_range respectively, are used to stretch or shrink the intensity range of the input image This function is a wrapper for ‘rescale_intensity’ function from scikit-image: https://scikit-image.org/docs/dev/api/skimage.exposure.html#skimage.exposure.rescale_intensity

Parameters:

in_range (str or 2-tuple, optional) – Min and max intensity values of input image. The possible values for this parameter are enumerated below. ‘image’ : Use image min/max as the intensity range. ‘dtype’ : Use min/max of the image’s dtype as the intensity range. ‘dtype-name’ : Use intensity range based on desired dtype. Must be valid key in DTYPE_RANGE. ‘2-tuple’ : Use range_values as explicit min/max intensities.
out_range (str or 2-tuple, optional) – Min and max intensity values of output image. The possible values for this parameter are enumerated below. ‘image’ : Use image min/max as the intensity range. ‘dtype’ : Use min/max of the image’s dtype as the intensity range. ‘dtype-name’ : Use intensity range based on desired dtype. Must be valid key in DTYPE_RANGE. ‘2-tuple’ : Use range_values as explicit min/max intensities.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.HistogramEqualization(nbins=256, mask=None)

Return image after histogram equalization. This function is a wrapper for ‘equalize_hist’ function from scikit-image: https://scikit-image.org/docs/dev/api/skimage.exposure.html#skimage.exposure.equalize_hist

Parameters:

nbins (int, optional) – Number of gray bins for histogram. Note: this argument is ignored for integer images, for which each integer is its own bin.
mask (ndarray of bools or 0s and 1s, optional) – Array of same shape as image. Only points at which mask == True are used for the equalization, which is applied to the whole image.

F(image): functional implementation

apply(tile): modify Tile object in-place

class pathml.preprocessing.AdaptiveHistogramEqualization(kernel_size=None, clip_limit=0.3, nbins=256)

Contrast Limited Adaptive Histogram Equalization (CLAHE). An algorithm for local contrast enhancement, that uses histograms computed over different tile regions of the image. Local details can therefore be enhanced even in regions that are darker or lighter than most of the image. This function is a wrapper for ‘equalize_adapthist’ function from scikit-image: https://scikit-image.org/docs/dev/api/skimage.exposure.html#skimage.exposure.equalize_adapthist

Parameters:

kernel_size (int or array_like, optional) – Defines the shape of contextual regions used in the algorithm. If iterable is passed, it must have the same number of elements as image.ndim (without color channel). If integer, it is broadcasted to each image dimension. By default, kernel_size is 1/8 of image height by 1/8 of its width.
clip_limit (float) – Clipping limit, normalized between 0 and 1 (higher values give more contrast).
nbins (int) – Number of gray bins for histogram (“data range”).

F(image): functional implementation

apply(tile): modify Tile object in-place

TileStitching

This section covers the TileStitcher class, which is specialized for stitching tiled images, particularly useful in digital pathology.

class pathml.preprocessing.tilestitcher.TileStitcher(qupath_jarpath=[], java_path=None, memory='40g', bfconvert_dir='./')

A Python class for stitching tiled images, specifically designed for spectrally unmixed images in a pyramidal OME-TIFF format.

This class is a Python implementation of Pete Bankhead’s script for image stitching, available at available at https://gist.github.com/petebankhead/b5a86caa333de1fdcff6bdee72a20abe. It requires QuPath and JDK to be installed prior to use.

Parameters:

qupath_jarpath (list) – Paths to QuPath JAR files.
java_path (str) – Path to Java installation.
memory (str) – Memory allocation for the JVM.
bfconvert_dir (str) – Directory for Bio-Formats conversion tools.

checkTIFF(file)

Check if a given file is a valid TIFF file.

This method reads the first few bytes of the file to determine if it conforms to TIFF specifications.

Parameters:: file (str) – Path to the file to be checked.
Returns:: True if the file is a valid TIFF file, False otherwise.
Return type:: bool

static format_jvm_options(qupath_jars, memory)

is_bfconvert_available(): Check if bfconvert is available.

parseRegion(file, z=0, t=0)

Parse an image region from a given TIFF file.

Parameters:

file (str) – Path to the TIFF file.
z (int, optional) – Z-position of the image. Defaults to 0.
t (int, optional) – Time point of the image. Defaults to 0.

Returns:

An ImageRegion object representing the parsed region.

Return type:

ImageRegion

parse_regions(infiles)

Parse image regions from a list of TIFF files and build a sparse image server.

Parameters:: infiles (list) – List of paths to TIFF files.
Returns:: A server containing the parsed image regions.
Return type:: SparseImageServer

run_bfconvert(stitched_image_path, bfconverted_path=None, delete_original=True)

Run the Bio-Formats conversion tool on a stitched image.

Parameters:

stitched_image_path (str) – Path to the stitched image.
bfconverted_path (str, optional) – Path for the converted image. If None, a default path is generated.
delete_original (bool) – If True, delete the original stitched image after conversion.

run_image_stitching(input_dir, output_filename, downsamples=[1, 8], separate_series=False)

Perform image stitching on the provided TIFF files and output a stitched OME-TIFF image.

Parameters:

input_dir (str) – Directory containing the input TIFF files.
output_filename (str) – Filename for the output stitched image.
downsamples (list, optional) – List of downsample levels. Defaults to [1, 8].
separate_series (bool, optional) – Whether to separate the series. Defaults to False.

setup_bfconvert(bfconvert_dir)

Set up Bio-Formats conversion tool (bfconvert) in the given directory.

Parameters:: bfconvert_dir (str) – Directory path for setting up bfconvert.
Returns:: Path to the bfconvert tool.
Return type:: str

shutdown(): Shut down the Java Virtual Machine (JVM) if it’s running.

toShort(b1, b2)

Convert two bytes to a short integer.

This helper function is used for interpreting the binary data in file headers, particularly for TIFF files.

Parameters:

b1 (byte) – The first byte.
b2 (byte) – The second byte.

Returns:

The short integer represented by the two bytes.

Return type:

int