Intro to SpatialData#

This notebook introduces the core concepts and basic operations in SpatialData using a small subset of a spatial transcriptomics murine liver dataset (Guilliams et al. 2022). We will also introduce essential plotting with the spatialdata-plot companion library.

Note

The purpose of the SpatialData framework is to lay the foundation for methods/pipelines to analyze spatial omics data. It is not intended to analyze data itself, but to provide a common data model and APIs to manipulate spatial omics data. Browse the other tutorials for a demo of how various analysis packages work with SpatialData: see the “External tutorials” section.

Key terms and data model#

SpatialData objects are representations of spatial omics data, which are manipulated with the SpatialData framework. On-disk, these are stored as .zarr stores, which are essentially folders containing data and metadata in a structured, but flexible format.

We can think of a SpatialData object as a container for various Elements. An Element is either a SpatialElement (Images, Labels, Points, Shapes) or a Table. Here is a brief description:

  • Images: H&E, staining images

  • Labels: pixel-level segmentation

  • Points: transcripts locations with gene information, landmarks points

  • Shapes: cell/nucleus boundaries, subcellular structures, anatomical annotations, regions of interest (ROIs)

  • Tables: sparse/dense matrices annotating the the SpatialElements or storing arbitrary (non-spatial) metadata. They do not contain spatial coordinates.

We can categorize the SpatialElements into two broad types:

  • Rasters: Data made up of pixels: including Images and Labels

  • Vectors: Data made up of points and lines. Polygons are also vectors, since they are a simply a list of connected points. Points and Shapes are elements of this type.

We will see examples of manipulating these below and leave advanced usage in other tutorials.

SpatialData Elements
import matplotlib as mpl
import matplotlib.pyplot as plt
import spatialdata as sd
import spatialdata_plot

Load SpatialData .zarr data from disk#

Download the example dataset from here and unzip it. You should see a folder with the .zarr extension, which we will set as the data_path to load from disk. You can also use symlinks to make the data_path point to data stored outside the current working directory.

SpatialData objects are stored in the SpatialData Zarr format (find more in our design doc and developer resources), which uses Zarr for on-disk storage. Again, Zarr stores are simply folders containing data and metadata in a standardized, flexible format.

The spatialdata framework has three ways to construct SpatialData objects:

  • You can read a SpatialData object that has already been saved to .zarr in the SpatialData Zarr format …

  • You can use the reader functions from spatialdata-io for supported spatial omics platforms (e.g. Visium HD, MERSCOPE, etc.).

  • You can construct a SpatialData object from scratch using our Python spatialdata APIs …

    • … using the SpatialData class initializer.

    • … by adding elements incrementally.

See Also

Here below we will simply read a SpatialData object from disk. See the advanced data structures tutorial for technical details and considerations.

data_path = "mouse_liver.zarr"
sdata = sd.read_zarr(data_path)
sdata
SpatialData object, with associated Zarr store: /Users/macbook/embl/projects/basel/spatialdata-sandbox/mouse_liver/data.zarr
├── Images
│     └── 'raw_image': DataTree[cyx] (1, 6432, 6432), (1, 1608, 1608)
├── Labels
│     └── 'segmentation_mask': DataArray[yx] (6432, 6432)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│     └── 'nucleus_boundaries': GeoDataFrame shape: (3375, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (3375, 99)
with coordinate systems:
    ▸ 'global', with elements:
        raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)

Explore the Elements of the SpatialData object#

Let’s explore each element type and how they are represented:

See Also

For advanced manipulation of Elements and how to handle annotations with Tables, see the “SpatialElements and tables” tutorial and the “Tables” tutorial.

Images#

Images are represented as xarray.DataArray or xarray.DataTree objects respectively for single-scale or multi-scale images. They will either be 2D (cyx) or 3D (czyx) arrays where the c axis is for channels and the z, y, and x axes are spatial coordinates.

We can access elements using keys e.g. sdata[key], similar to dictionaries.

sdata["raw_image"]
<xarray.DatasetView> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

Images can have multiple scales, which are stored in a pyramid for downsampled representations useful for more efficient visualization and analysis. We can access a single scale using the get_pyramid_levels function, where 0 is the highest resolution, 1 is downsampled by a factor of (usually 2), 2 is downsampled by another factor (usually 2, with respect to the previous scale), etc.

In this case, our image has 2 scales. Let’s access the highest resolution image.

sd.get_pyramid_levels(sdata["raw_image"], n=0)
<xarray.DataArray 'image' (c: 1, y: 6432, x: 6432)> Size: 83MB
dask.array<from-zarr, shape=(1, 6432, 6432), dtype=uint16, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
  * c        (c) int64 8B 0
  * y        (y) float64 51kB 0.5 1.5 2.5 3.5 ... 6.43e+03 6.43e+03 6.432e+03
  * x        (x) float64 51kB 0.5 1.5 2.5 3.5 ... 6.43e+03 6.43e+03 6.432e+03
Attributes:
    transform:  {'global': Identity }

To view image properties such as axes and channel names, SpatialData provides some helper functions. More helper functions can be found in the API page.

sd.models.get_axes_names(sdata["raw_image"])
('c', 'y', 'x')

Here we see that our image only has one channel for DAPI staining.

sd.models.get_channel_names(sdata["raw_image"])
[np.int64(0)]

Let’s use spatialdata-plot to show the image (DAPI staining for nuclei).

sdata.pl.render_images("raw_image", cmap="gray").pl.show()
../../../../_images/0a73832faf2eec94b4408c66b7439d701c0a2baa8679a30bf52f20b04fd19331.png

Points#

Points are represented as dask.DataFrame objects containing point coordinates, (lazy version of pandas.DataFrame objects). By lazy-loading the dataframe, we can reduce memory usage and only loading in-memory for computations.

For spatial transcriptomics datasets measuring single-molecules, coordinates are stored in columns x and y (optionally z for 3D) and have an additional column annotating gene identity.

See Also

Points instances can be in 2D or 3D though some operations, such as rasterize(), or visualization operations currently do not support the 3D component.

sdata["transcripts"]
Dask DataFrame Structure:
x y gene
npartitions=1
float64 float64 category[unknown]
... ... ...
Dask Name: read-parquet, 1 graph layer

We can easily convert the dask.DataFrame to a pandas.DataFrame so it is easier to work with. Note that this will load the entire object into memory. If the data is large you can used directly the dask APIs for the manipulation of the dataframe.

sdata["transcripts"].compute()
x y gene
0 433.0 1217.0 Adgre1
1 151.0 1841.0 Adgre1
2 139.0 1983.0 Adgre1
3 1349.0 1601.0 Adgre1
4 784.0 1732.0 Adgre1
... ... ... ...
1998863 5743.0 5233.0 Lyve1
1998864 5721.0 4581.0 Lyve1
1998865 5807.0 4842.0 Lyve1
1998866 5843.0 5309.0 Lyve1
1998869 5580.0 5226.0 Lyve1

1153548 rows × 3 columns

Let’s visualize the points. Note that plotting automatically switches from matplotlib to datashader when the number of points is large, ensuring performant rendering. As you can see below, it will be more useful to plot subsets of transcripts, such as specific genes.

sdata.pl.render_points("transcripts").pl.show()
INFO     Using 'datashader' backend with 'None' as reduction method to speed up plotting. Depending on the         
         reduction method, the value range of the plot might change. Set method to 'matplotlib' do disable this    
         behaviour.
../../../../_images/40d59ff5bfeddfa7a8e0dce49d9b393771dc5e0ead9632585123d9a27019a3ba.png

Let’s plot the transcripts colored by gene. We can also use the groups argument to plot transcripts for a subset of genes.

Note

If the color column has many unique values (e.g. 1000s of genes), plotting will slow down dramatically.

sdata.pl.render_points("transcripts", color="gene", groups="Vwf", palette="red").pl.show()
../../../../_images/0f3aaf95275cff135135378c9743feb55b6c8a0a332201120312ff218b192cb8.png

Shapes#

Shapes are represented as geopandas.GeoDataFrame objects containing geometric objects like polygons. For a comprehensive guide on working with geopandas.GeoDataFrame objects, please refer to the geopandas documentation.

sdata["nucleus_boundaries"]
geometry
cell_ID
99 POLYGON ((6277 798, 6277 799, 6272 799, 6272 8...
142 POLYGON ((6427 6123, 6423 6123, 6423 6124, 641...
208 POLYGON ((3747 858, 3747 859, 3746 859, 3746 8...
235 POLYGON ((752 6144, 752 6145, 763 6145, 763 61...
336 POLYGON ((5174 935, 5174 936, 5172 936, 5172 9...
... ...
7652 POLYGON ((1094 6028, 1094 6029, 1091 6029, 109...
7680 POLYGON ((1324 6063, 1324 6064, 1320 6064, 132...
7694 POLYGON ((1692 6063, 1692 6064, 1686 6064, 168...
7708 POLYGON ((768 6072, 768 6073, 765 6073, 765 60...
7723 POLYGON ((374 6074, 374 6076, 373 6076, 373 60...

3375 rows × 1 columns

sdata.pl.render_shapes("nucleus_boundaries").pl.show()
../../../../_images/b69d877c6964f727bdcbe3fec73c51ee5e49cacc0dd672738eac6d91c8af0c86.png

Tables#

Annotated matrices are represented as anndata.AnnData objects. Ingested datasets will usually have a table located at sdata['table'] for a count/abundance matrix. This enables downstream analysis with tools like Scanpy, scVI, etc.

For different spatial technologies this quantifies:

  • transcriptomics: transcript counts;

  • spatial proteomics: marker abundances;

  • slide-based assays: spot/grid abundances.

See Also

For more advanced usage such as multi-table annotations and joining, see the Tables page.

sdata["table"]
AnnData object with n_obs × n_vars = 3375 × 99
    obs: 'cell_ID', 'fov_labels', 'annotation'
    uns: 'annotation_colors', 'spatialdata_attrs'
    obsm: 'spatial'

Here is an example of plotting numerical information on the cells (“Hal” gene expression).

sdata.pl.render_shapes("nucleus_boundaries", color="Hal").pl.show()
../../../../_images/0e849e5306a6e1a6cd56a0c67aea5b05e5ae7545489c8c58c075cf1fef57c319.png

Now is an example of plotting categorical information (cell types).

sdata.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()
../../../../_images/7166c88d35927eac964de6734f2d44cdab837eb8d2cd94317c196e01ce1d7676.png

Let’s visualize all the layers together. We can easily chain the operations to render multiple Elements overlayed on top of each other in the same plot.

(
    sdata.pl.render_shapes(
        "nucleus_boundaries",
        color="annotation",
        outline_alpha=1.0,
        outline_width=0.5,
    )
    .pl.render_points(
        "transcripts",
        color="gene",
        groups="Vwf",
        palette="cyan",
    )
    .pl.render_points(
        "transcripts",
        color="gene",
        groups="Clec4f",
        palette="magenta",
    )
    .pl.show(figsize=(8, 8))
)
../../../../_images/7b98b2904dccbaa99fabfc8e61401ff8d20406f7352c4d9774a383d3585229e4.png

Spatial Querying and Subsetting#

Given a single-modal or multi-modal dataset, spatial queries allow to retrieve all the spatial elements and instances that are within a given rectangular window or polygonal shape.

For the purpose of this tutorial, our dataset is conveniently aligned to a single coordinate system global.

See Also

For advanced usage of coordinate systems and transformations see the Transformations notebooks, for a more in-depth look at query operations, see the Spatial Queries notebook.

Bounding box query subsets a given SpatialData object within a rectangular window. We can get the bounds of the dataset to help define min/max coordinates.

sd.get_extent(sdata)
{'y': (np.float64(0.0), np.float64(6432.0)),
 'x': (np.float64(0.0), np.float64(6432.0))}

To perform a bounding box query, we provide the axes, min/max coordinates, and the target coordinate system for the bounding box coordinates.

crop_sdata1 = sd.bounding_box_query(
    sdata, axes=("y", "x"), min_coordinate=[1000, 1000], max_coordinate=[2000, 2000], target_coordinate_system="global"
)
crop_sdata1
SpatialData object
├── Images
│     └── 'raw_image': DataTree[cyx] (1, 1000, 1000), (1, 250, 250)
├── Labels
│     └── 'segmentation_mask': DataArray[yx] (1000, 1000)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│     └── 'nucleus_boundaries': GeoDataFrame shape: (96, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (96, 99)
with coordinate systems:
    ▸ 'global', with elements:
        raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
crop_sdata1.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()
../../../../_images/6ee10b67c20c65b5e5d88e00044438201c4f289a305e82b0dae7274fbc14e1b2.png

Similarly, we can perform a polygon query by using a shapely.Polygon object and passing the target coordinate system.

from shapely import Polygon

polygon = Polygon([(1000, 1000), (1500, 1000), (2000, 1500), (1500, 2000)])
crop_sdata2 = sd.polygon_query(sdata, polygon, target_coordinate_system="global")
crop_sdata2
SpatialData object
├── Images
│     └── 'raw_image': DataTree[cyx] (1, 1000, 1000), (1, 250, 250)
├── Labels
│     └── 'segmentation_mask': DataArray[yx] (1000, 1000)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│     └── 'nucleus_boundaries': GeoDataFrame shape: (45, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (45, 99)
with coordinate systems:
    ▸ 'global', with elements:
        raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
crop_sdata2.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()
../../../../_images/a9e04e3235f76b957029fcd14045f89deb3ce0efd51c0ec8328ace29539dd0d4.png

Saving SpatialData Objects#

Currently, the cropped SpatialData object only exists in memory. Let’s save it by writing it to disk in a new .zarr store.

Note

It is not recommended to modify this in place. For more information on incremental reading and writing of SpatialData objects, see more here.

from pathlib import Path
from tempfile import TemporaryDirectory

tmpdir = TemporaryDirectory()
crop_sdata2.write(Path(tmpdir.name) / "crop.zarr")
INFO     The SpatialData object is not self-contained (i.e. it contains some elements that are Dask-backed from    
         locations outside /tmp/tmppc7yy78v/crop.zarr). Please see the documentation of `is_self_contained()` to   
         understand the implications of working with SpatialData objects that are not self-contained.              
INFO     The Zarr backing store has been changed from None the new file path: /tmp/tmppc7yy78v/crop.zarr

Saving new elements#

Let’s say we want to add a new element to our cropped SpatialData object. For example, we can subset the nucleus boundaries to only include a single cell type, cholangiocytes and save it as a new element. First retrieve the cell type annotations from the table.

cell_types = sd.get_values("annotation", crop_sdata2["table"])["annotation"]

Slice the nucleus boundaries to only include cholangiocytes and save it as a new element.

crop_sdata2["cholangiocytes"] = crop_sdata2["nucleus_boundaries"][cell_types == "Cholangiocytes"]
crop_sdata2.pl.render_shapes("cholangiocytes").pl.show()
../../../../_images/af5f4ca51fa04fe8e6c75c1cfd9bd818ee5abe723a8e39d8a15ea89807e6b7c7.png

Note that sdata[‘cholangiocytes’] only exists in-memory until we explicity write it to disk with crop_sdata2.write_element(). Let’s do that now.

crop_sdata2.write_element("cholangiocytes")
tmpdir.cleanup()

Next Steps#

Congratulations! Now you know the basics of SpatialData and how to use the core APIs. Check out the other tutorials for more advanced usage, integrated packages, and technology-specific examples. In particular, the section “External tutorials” will link to analysis notebooks, available in third-party libraries that use SpatialData objects as a data model.