Intro to SpatialData#
This notebook introduces the core concepts and basic operations in SpatialData
using a small subset of a spatial transcriptomics murine liver dataset (Guilliams et al. 2022). We will also introduce essential plotting with the spatialdata-plot
companion library.
Note
The purpose of the
SpatialData
framework is to lay the foundation for methods/pipelines to analyze spatial omics data. It is not intended to analyze data itself, but to provide a common data model and APIs to manipulate spatial omics data. Browse the other tutorials for a demo of how various analysis packages work withSpatialData
: see the “External tutorials” section.
Key terms and data model#
SpatialData
objects are representations of spatial omics data, which are manipulated with the SpatialData
framework. On-disk, these are stored as .zarr
stores, which are essentially folders containing data and metadata in a structured, but flexible format.
We can think of a SpatialData
object as a container for various Elements
. An Element
is either a SpatialElement
(Images
, Labels
, Points
, Shapes
) or a Table
. Here is a brief description:
Images
: H&E, staining imagesLabels
: pixel-level segmentationPoints
: transcripts locations with gene information, landmarks pointsShapes
: cell/nucleus boundaries, subcellular structures, anatomical annotations, regions of interest (ROIs)Tables
: sparse/dense matrices annotating the theSpatialElements
or storing arbitrary (non-spatial) metadata. They do not contain spatial coordinates.
We can categorize the SpatialElements
into two broad types:
Rasters
: Data made up of pixels: includingImages
andLabels
Vectors
: Data made up of points and lines. Polygons are also vectors, since they are a simply a list of connected points.Points
andShapes
are elements of this type.
We will see examples of manipulating these below and leave advanced usage in other tutorials.
See Also
Glossary for other key terms
Advanced Data Structures part 1 and Advanced Data Structures part 2

import matplotlib as mpl
import matplotlib.pyplot as plt
import spatialdata as sd
import spatialdata_plot
Load SpatialData .zarr
data from disk#
Download the example dataset from here and unzip it. You should see a folder with the .zarr
extension, which we will set as the data_path
to load from disk. You can also use symlinks to make the data_path point to data stored outside the current working directory.
SpatialData
objects are stored in the SpatialData Zarr format (find more in our design doc and developer resources), which uses Zarr for on-disk storage. Again, Zarr stores are simply folders containing data and metadata in a standardized, flexible format.
The spatialdata framework has three ways to construct SpatialData objects:
You can read a
SpatialData
object that has already been saved to.zarr
in the SpatialData Zarr format …… from disk.
… from the cloud (find more here).
You can use the reader functions from
spatialdata-io
for supported spatial omics platforms (e.g. Visium HD, MERSCOPE, etc.).You can construct a
SpatialData
object from scratch using our Pythonspatialdata
APIs …… using the
SpatialData
class initializer.… by adding elements incrementally.
See Also
Here below we will simply read a
SpatialData
object from disk. See the advanced data structures tutorial for technical details and considerations.
data_path = "mouse_liver.zarr"
sdata = sd.read_zarr(data_path)
sdata
SpatialData object, with associated Zarr store: /Users/macbook/embl/projects/basel/spatialdata-sandbox/mouse_liver/data.zarr
├── Images
│ └── 'raw_image': DataTree[cyx] (1, 6432, 6432), (1, 1608, 1608)
├── Labels
│ └── 'segmentation_mask': DataArray[yx] (6432, 6432)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│ └── 'nucleus_boundaries': GeoDataFrame shape: (3375, 1) (2D shapes)
└── Tables
└── 'table': AnnData (3375, 99)
with coordinate systems:
▸ 'global', with elements:
raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
Explore the Elements
of the SpatialData
object#
Let’s explore each element type and how they are represented:
Images:
xarray.DataArray
orxarray.DataTree
objects respectively for single-scale or multi-scale imagesLabels:
xarray.DataArray
orxarray.DataTree
objects containing integer codes for different labelsPoints:
dask.DataFrame
objects containing point coordinates, (lazy version ofpandas.DataFrame
objects)Shapes:
geopandas.GeoDataFrame
objects containing geometric objects like polygonsTables:
anndata.AnnData
objects for tabular data with annotations
See Also
For advanced manipulation of
Elements
and how to handle annotations withTables
, see the “SpatialElements and tables” tutorial and the “Tables” tutorial.
Images#
Images are represented as xarray.DataArray
or xarray.DataTree
objects respectively for single-scale or multi-scale images. They will either be 2D (cyx) or 3D (czyx) arrays where the c
axis is for channels and the z
, y
, and x
axes are spatial coordinates.
We can access elements using keys e.g. sdata[key]
, similar to dictionaries.
sdata["raw_image"]
<xarray.DatasetView> Size: 0B Dimensions: () Data variables: *empty*
Images can have multiple scales, which are stored in a pyramid for downsampled representations useful for more efficient visualization and analysis. We can access a single scale using the get_pyramid_levels
function, where 0
is the highest resolution, 1
is downsampled by a factor of (usually 2), 2
is downsampled by another factor (usually 2, with respect to the previous scale), etc.
In this case, our image has 2 scales. Let’s access the highest resolution image.
sd.get_pyramid_levels(sdata["raw_image"], n=0)
<xarray.DataArray 'image' (c: 1, y: 6432, x: 6432)> Size: 83MB dask.array<from-zarr, shape=(1, 6432, 6432), dtype=uint16, chunksize=(1, 256, 256), chunktype=numpy.ndarray> Coordinates: * c (c) int64 8B 0 * y (y) float64 51kB 0.5 1.5 2.5 3.5 ... 6.43e+03 6.43e+03 6.432e+03 * x (x) float64 51kB 0.5 1.5 2.5 3.5 ... 6.43e+03 6.43e+03 6.432e+03 Attributes: transform: {'global': Identity }
To view image properties such as axes and channel names, SpatialData
provides some helper functions. More helper functions can be found in the API page.
sd.models.get_axes_names(sdata["raw_image"])
('c', 'y', 'x')
Here we see that our image only has one channel for DAPI staining.
sd.models.get_channel_names(sdata["raw_image"])
[np.int64(0)]
Let’s use spatialdata-plot
to show the image (DAPI staining for nuclei).
sdata.pl.render_images("raw_image", cmap="gray").pl.show()

Points#
Points are represented as dask.DataFrame
objects containing point coordinates, (lazy version of pandas.DataFrame
objects). By lazy-loading the dataframe, we can reduce memory usage and only loading in-memory for computations.
For spatial transcriptomics datasets measuring single-molecules, coordinates are stored in columns x
and y
(optionally z
for 3D) and have an additional column annotating gene identity.
See Also
Points
instances can be in 2D or 3D though some operations, such asrasterize()
, or visualization operations currently do not support the 3D component.
sdata["transcripts"]
x | y | gene | |
---|---|---|---|
npartitions=1 | |||
float64 | float64 | category[unknown] | |
... | ... | ... |
We can easily convert the dask.DataFrame
to a pandas.DataFrame
so it is easier to work with. Note that this will load the entire object into memory. If the data is large you can used directly the dask
APIs for the manipulation of the dataframe.
sdata["transcripts"].compute()
x | y | gene | |
---|---|---|---|
0 | 433.0 | 1217.0 | Adgre1 |
1 | 151.0 | 1841.0 | Adgre1 |
2 | 139.0 | 1983.0 | Adgre1 |
3 | 1349.0 | 1601.0 | Adgre1 |
4 | 784.0 | 1732.0 | Adgre1 |
... | ... | ... | ... |
1998863 | 5743.0 | 5233.0 | Lyve1 |
1998864 | 5721.0 | 4581.0 | Lyve1 |
1998865 | 5807.0 | 4842.0 | Lyve1 |
1998866 | 5843.0 | 5309.0 | Lyve1 |
1998869 | 5580.0 | 5226.0 | Lyve1 |
1153548 rows × 3 columns
Let’s visualize the points. Note that plotting automatically switches from matplotlib
to datashader
when the number of points is large, ensuring performant rendering. As you can see below, it will be more useful to plot subsets of transcripts, such as specific genes.
sdata.pl.render_points("transcripts").pl.show()
INFO Using 'datashader' backend with 'None' as reduction method to speed up plotting. Depending on the
reduction method, the value range of the plot might change. Set method to 'matplotlib' do disable this
behaviour.

Let’s plot the transcripts colored by gene. We can also use the groups
argument to plot transcripts for a subset of genes.
Note
If the color column has many unique values (e.g. 1000s of genes), plotting will slow down dramatically.
sdata.pl.render_points("transcripts", color="gene", groups="Vwf", palette="red").pl.show()

Shapes#
Shapes are represented as geopandas.GeoDataFrame
objects containing geometric objects like polygons. For a comprehensive guide on working with geopandas.GeoDataFrame
objects, please refer to the geopandas
documentation.
sdata["nucleus_boundaries"]
geometry | |
---|---|
cell_ID | |
99 | POLYGON ((6277 798, 6277 799, 6272 799, 6272 8... |
142 | POLYGON ((6427 6123, 6423 6123, 6423 6124, 641... |
208 | POLYGON ((3747 858, 3747 859, 3746 859, 3746 8... |
235 | POLYGON ((752 6144, 752 6145, 763 6145, 763 61... |
336 | POLYGON ((5174 935, 5174 936, 5172 936, 5172 9... |
... | ... |
7652 | POLYGON ((1094 6028, 1094 6029, 1091 6029, 109... |
7680 | POLYGON ((1324 6063, 1324 6064, 1320 6064, 132... |
7694 | POLYGON ((1692 6063, 1692 6064, 1686 6064, 168... |
7708 | POLYGON ((768 6072, 768 6073, 765 6073, 765 60... |
7723 | POLYGON ((374 6074, 374 6076, 373 6076, 373 60... |
3375 rows × 1 columns
sdata.pl.render_shapes("nucleus_boundaries").pl.show()

Tables#
Annotated matrices are represented as anndata.AnnData
objects. Ingested datasets will usually have a table located at sdata['table']
for a count/abundance matrix. This enables downstream analysis with tools like Scanpy, scVI, etc.
For different spatial technologies this quantifies:
transcriptomics: transcript counts;
spatial proteomics: marker abundances;
slide-based assays: spot/grid abundances.
See Also
For more advanced usage such as multi-table annotations and joining, see the Tables page.
sdata["table"]
AnnData object with n_obs × n_vars = 3375 × 99
obs: 'cell_ID', 'fov_labels', 'annotation'
uns: 'annotation_colors', 'spatialdata_attrs'
obsm: 'spatial'
Here is an example of plotting numerical information on the cells (“Hal” gene expression).
sdata.pl.render_shapes("nucleus_boundaries", color="Hal").pl.show()

Now is an example of plotting categorical information (cell types).
sdata.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()

Let’s visualize all the layers together. We can easily chain the operations to render multiple Elements
overlayed on top of each other in the same plot.
(
sdata.pl.render_shapes(
"nucleus_boundaries",
color="annotation",
outline_alpha=1.0,
outline_width=0.5,
)
.pl.render_points(
"transcripts",
color="gene",
groups="Vwf",
palette="cyan",
)
.pl.render_points(
"transcripts",
color="gene",
groups="Clec4f",
palette="magenta",
)
.pl.show(figsize=(8, 8))
)

Spatial Querying and Subsetting#
Given a single-modal or multi-modal dataset, spatial queries allow to retrieve all the spatial elements and instances that are within a given rectangular window or polygonal shape.
For the purpose of this tutorial, our dataset is conveniently aligned to a single coordinate system global
.
See Also
For advanced usage of coordinate systems and transformations see the Transformations notebooks, for a more in-depth look at query operations, see the Spatial Queries notebook.
Bounding box query subsets a given SpatialData
object within a rectangular window. We can get the bounds of the dataset to help define min/max coordinates.
sd.get_extent(sdata)
{'y': (np.float64(0.0), np.float64(6432.0)),
'x': (np.float64(0.0), np.float64(6432.0))}
To perform a bounding box query, we provide the axes, min/max coordinates, and the target coordinate system for the bounding box coordinates.
crop_sdata1 = sd.bounding_box_query(
sdata, axes=("y", "x"), min_coordinate=[1000, 1000], max_coordinate=[2000, 2000], target_coordinate_system="global"
)
crop_sdata1
SpatialData object
├── Images
│ └── 'raw_image': DataTree[cyx] (1, 1000, 1000), (1, 250, 250)
├── Labels
│ └── 'segmentation_mask': DataArray[yx] (1000, 1000)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│ └── 'nucleus_boundaries': GeoDataFrame shape: (96, 1) (2D shapes)
└── Tables
└── 'table': AnnData (96, 99)
with coordinate systems:
▸ 'global', with elements:
raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
crop_sdata1.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()

Similarly, we can perform a polygon query by using a shapely.Polygon
object and passing the target coordinate system.
from shapely import Polygon
polygon = Polygon([(1000, 1000), (1500, 1000), (2000, 1500), (1500, 2000)])
crop_sdata2 = sd.polygon_query(sdata, polygon, target_coordinate_system="global")
crop_sdata2
SpatialData object
├── Images
│ └── 'raw_image': DataTree[cyx] (1, 1000, 1000), (1, 250, 250)
├── Labels
│ └── 'segmentation_mask': DataArray[yx] (1000, 1000)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│ └── 'nucleus_boundaries': GeoDataFrame shape: (45, 1) (2D shapes)
└── Tables
└── 'table': AnnData (45, 99)
with coordinate systems:
▸ 'global', with elements:
raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
crop_sdata2.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()

Saving SpatialData
Objects#
Currently, the cropped SpatialData
object only exists in memory. Let’s save it by writing it to disk in a new .zarr
store.
Note
It is not recommended to modify this in place. For more information on incremental reading and writing of
SpatialData
objects, see more here.
from pathlib import Path
from tempfile import TemporaryDirectory
tmpdir = TemporaryDirectory()
crop_sdata2.write(Path(tmpdir.name) / "crop.zarr")
INFO The SpatialData object is not self-contained (i.e. it contains some elements that are Dask-backed from
locations outside /tmp/tmppc7yy78v/crop.zarr). Please see the documentation of `is_self_contained()` to
understand the implications of working with SpatialData objects that are not self-contained.
INFO The Zarr backing store has been changed from None the new file path: /tmp/tmppc7yy78v/crop.zarr
Saving new elements#
Let’s say we want to add a new element to our cropped SpatialData
object. For example, we can subset the nucleus boundaries to only include a single cell type, cholangiocytes and save it as a new element. First retrieve the cell type annotations from the table.
cell_types = sd.get_values("annotation", crop_sdata2["table"])["annotation"]
Slice the nucleus boundaries to only include cholangiocytes and save it as a new element.
crop_sdata2["cholangiocytes"] = crop_sdata2["nucleus_boundaries"][cell_types == "Cholangiocytes"]
crop_sdata2.pl.render_shapes("cholangiocytes").pl.show()

Note that sdata[‘cholangiocytes’] only exists in-memory until we explicity write it to disk with crop_sdata2.write_element()
. Let’s do that now.
crop_sdata2.write_element("cholangiocytes")
tmpdir.cleanup()
Next Steps#
Congratulations! Now you know the basics of SpatialData
and how to use the core APIs. Check out the other tutorials for more advanced usage, integrated packages, and technology-specific examples. In particular, the section “External tutorials” will link to analysis notebooks, available in third-party libraries that use SpatialData
objects as a data model.