Brightfield Imaging: Quickstart

View on GitHub

Here we demonstrate a typical workflow for preprocessing of H&E images. The image used in this example is publicly avilalable for download:

a. Load the image

from pathml.core import SlideData, types

# load the image
wsi = SlideData("../../data/CMU-1.svs", name = "example", slide_type = types.HE)

b. Define a preprocessing pipeline

Pipelines are created by composing a sequence of modular transformations; in this example we apply a blur to reduce noise in the image followed by tissue detection

from pathml.preprocessing import Pipeline, BoxBlur, TissueDetectionHE

pipeline = Pipeline([
    TissueDetectionHE(mask_name = "tissue", min_region_size=500,
                      threshold=30, outer_contours_only=True)

c. Run preprocessing

Now that we have constructed our pipeline, we are ready to run it on our WSI. PathML supports distributed computing, speeding up processing by running tiles in parallel among many workers rather than processing each tile sequentially on a single worker. This is supported by Dask.distributed on the backend, and is highly scalable for very large datasets.

The first step is to create a Client object. In this case, we will use a simple cluster running locally; however, Dask supports other setups including Kubernetes, SLURM, etc. See the PathML documentation for more information.

from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=6)
client = Client(cluster), distributed=True, client=client);
print(f"Total number of tiles extracted: {len(wsi.tiles)}")
Total number of tiles extracted: 150

e. Save results to disk

The resulting preprocessed data is written to disk, leveraging the HDF5 data specification optimized for efficiently manipulating larger-than-memory data.


f. Create PyTorch DataLoader

The DataLoader provides an interface with any machine learning model built on the PyTorch ecosystem

from import TileDataset
from import DataLoader

dataset = TileDataset("./data/CMU-1-preprocessed.h5path")
dataloader = DataLoader(dataset, batch_size = 16, num_workers = 4)