HDF5 Integration
Note
For advanced users
Overview
A single whole-slide image may contain on the order of 1010 pixels, making it infeasible to
process entire images in RAM. PathML
supports efficient manipulation of large-scale imaging data via
the h5path format, a hierarchical data structure which allows users to access small regions of the processed WSI
without loading the entire image. This feature reduces the RAM required to run a PathML
workflow (pipelines can be
run on a consumer laptop), simplifies the reading and writing of processed WSIs, improves data exploration utilities,
and enables fast reading for downstream tasks (e.g. PyTorch Dataloaders). Since slides are managed on disk, your drive
must have sufficient storage. Performance will benefit from storage with fast read/write (SSD, NVMe).
How it Works
Each SlideData
object is backed by an .h5path
file on disk.
All interaction with the .h5path
file is handled automatically by the h5pathManager
.
For example, when a user calls slidedata.tiles[tile_key]
, the h5pathManager
will
retrieve the tile from disk and return it, without the user needing to worry about accessing the HDF5 file themself.
As tiles are extracted and passed to a preprocessing pipeline, the h5pathManager
also
handles aggregating the processed tiles into the .h5path
file.
At the conclusion of preprocessing, the h5py object can optionally be
permanently written to disk in .h5path
format via the
SlideData.write()
method.
About HDF5
The internals of PathML
as well as the .h5path
file format are based on the hierarchical data format
HDF5, implemented by
h5py.
HDF5 format consists of 3 types of elements:
Groups |
A “container,” similar to a directory in a filesystem. Groups may contain Datasets, Attributes, or other Groups. |
Datasets |
Rectangular collection of data elements. Wraps |
Attributes |
Small named metadata elements. Each attribute is attached to a Group or Dataset. |
Groups
are container-like and can be queried like dictionaries:
import h5py
root = h5py.File('path/to/file.h5path', 'r')
masks = root['masks']
Datasets
can be treated like numpy.ndArray
objects:
Important
To retrieve a numpy.ndArray
object from h5py.Dataset
you must slice the Dataset with
NumPy fancy-indexing syntax: for example […] to retrieve the full array, or [a:b, …] to
return the array with first dimension sliced to the interval [a, b].
import h5py
root = h5py.File('path/to/file.h5path', 'r')
im = root['tiles']['(0, 0)']['array'][...]
im_slice = root['tiles']['(0, 0)']['array'][0:100, 0:100, :]
Attributes
are stored in a .attrs
object which can be queried like a dictionary:
import h5py
root = h5py.File('path/to/file.h5path', 'r')
tile_shape = root['tiles'].attrs['tile_shape']
.h5path
File Format
h5path utilizes a self-describing hierarchical file system similar to SlideData
.
Here we examine the h5path file format in detail:
root/ (Group)
├── fields/ (Group)
│ ├── name (Attribute, str)
│ ├── shape (Attribute, tuple)
│ ├── labels (Group)
│ │ ├── label1 (Attribute, [str, int, float, array])
│ │ ├── label2 (Attribute, [str, int, float, array])
│ │ └── etc...
│ └── slide_type (Group)
│ ├── stain (Attribute, str)
│ ├── tma (Attribute, bool)
│ ├── rgb (Attribute, bool)
│ ├── volumetric (Attribute, bool)
│ └── time_series (Attribute, bool)
├── masks/ (Group)
│ ├── mask1 (Dataset, array)
│ ├── mask2 (Dataset, array)
│ └── etc...
├── counts (Group)
│ └── `.h5ad` format
└── tiles/ (Group)
├── tile_shape (Attribute, tuple)
├── tile_stride (Attribute, tuple)
├── tile_key1/ (Group)
│ ├── array (Dataset, array)
│ ├── masks/ (Group)
│ │ ├── mask1 (Dataset, array)
│ │ ├── mask2 (Dataset, array)
│ │ └── etc...
│ ├── coords (Attribute, tuple)
│ ├── name (Attribute, str)
│ └── labels/ (Group)
│ ├── label1 (Attribute, [str, int, float, array])
│ ├── label2 (Attribute, [str, int, float, array])
│ └── etc...
├── tile_key2/ (Group)
│ └── etc...
└── etc...
Slide-level metadata is stored in the fields/
group.
Slide-level counts matrix metadata is stored in the counts/
group.
The tiles/
group stores tile-level data. Each tile occupies its own group, and tile coordinates are used as
keys for indexing tiles within the tiles/
group. Within each tile’s group, the array
dataset contains the
tile image, the masks/
group contains tile-level masks, and other metadata including name, labels, and coords
are stored as attributes. Slide-level metadata about tiling, including tile shape and stride, are stored as attributes
in the tiles/
group.
Whole-slide masks are stored in the masks/
Group. All masks are enforced to be the same shape as the image array.
However, when running a pipeline, these masks are moved to the tile-level and stored within the tile groups.
The slide-level masks are therefore not saved when calling SlideData.write()
.
We use float16
as the data type for all Datasets.
Note
Be aware that the h5path
format specification may change between major versions
Reading and Writing
SlideData
objects are easily written to h5path format
by calling SlideData.write()
.
All files with .h5
or .h5path
extensions are loaded to SlideData
objects
automatically.