Intro to SpatialData#
This notebook introduces the core concepts and basic operations in SpatialData using a small subset of a spatial transcriptomics murine liver dataset (Guilliams et al. 2022). We will also introduce essential plotting with the spatialdata-plot companion library.
Note
The purpose of the
SpatialDataframework is to lay the foundation for methods/pipelines to analyze spatial omics data. It is not intended to analyze data itself, but to provide a common data model and APIs to manipulate spatial omics data. Browse the other tutorials for a demo of how various analysis packages work withSpatialData: see the “External tutorials” section.
Key terms and data model#
SpatialData objects are representations of spatial omics data, which are manipulated with the SpatialData framework. On-disk, these are stored as .zarr stores, which are essentially folders containing data and metadata in a structured, but flexible format.
We can think of a SpatialData object as a container for various Elements. An Element is either a SpatialElement (Images, Labels, Points, Shapes) or a Table. Here is a brief description:
Images: H&E, staining imagesLabels: pixel-level segmentationPoints: transcripts locations with gene information, landmarks pointsShapes: cell/nucleus boundaries, subcellular structures, anatomical annotations, regions of interest (ROIs)Tables: sparse/dense matrices annotating the theSpatialElementsor storing arbitrary (non-spatial) metadata. They do not contain spatial coordinates.
We can categorize the SpatialElements into two broad types:
Rasters: Data made up of pixels: includingImagesandLabelsVectors: Data made up of points and lines. Polygons are also vectors, since they are a simply a list of connected points.PointsandShapesare elements of this type.
We will see examples of manipulating these below and leave advanced usage in other tutorials.
See Also
Glossary for other key terms
Advanced Data Structures part 1 and Advanced Data Structures part 2
import spatialdata as sd
import spatialdata_plot # noqa: F401
Download the mouse_liver data from https://spatialdata.scverse.org/en/stable/tutorials/notebooks/datasets/README.html and unzip it. You should see a folder with the .zarr extension, which we will set as the data_path to load from disk. You can also use symlinks to make the data_path point to data stored outside the current working directory.
SpatialData objects are stored in the SpatialData Zarr format (find more in our design doc and developer resources), which uses Zarr for on-disk storage. Again, Zarr stores are simply folders containing data and metadata in a standardized, flexible format.
The spatialdata framework has three ways to construct SpatialData objects:
You can read a
SpatialDataobject that has already been saved to.zarrin the SpatialData Zarr format …… from disk.
… from the cloud (only partially implemented and not available in the current state, see more here).
You can use the reader functions from
spatialdata-iofor supported spatial omics platforms (e.g. Visium HD, MERSCOPE, etc.).You can construct a
SpatialDataobject from scratch using our PythonspatialdataAPIs …… using the
SpatialDataclass initializer.… by adding elements incrementally.
See Also
Here below we will simply read a
SpatialDataobject from disk. See the advanced data structures tutorial for technical details and considerations.
Information regarding data licensing for the dataset used below, and attribution, is available at: https://github.com/scverse/spatialdata-notebooks/tree/main/datasets.
data_path = "mouse_liver.zarr"
sdata = sd.read_zarr(data_path)
sdata
SpatialData object, with associated Zarr store: /Users/macbook/embl/projects/basel/spatialdata-sandbox/mouse_liver/data.zarr
├── Images
│ └── 'raw_image': DataTree[cyx] (1, 6432, 6432), (1, 1608, 1608)
├── Labels
│ └── 'segmentation_mask': DataArray[yx] (6432, 6432)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│ └── 'nucleus_boundaries': GeoDataFrame shape: (3375, 1) (2D shapes)
└── Tables
└── 'table': AnnData (3375, 99)
with coordinate systems:
▸ 'global', with elements:
raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
Explore the Elements of the SpatialData object#
Let’s explore each element type and how they are represented:
Images:
xarray.DataArrayorxarray.DataTreeobjects respectively for single-scale or multi-scale imagesLabels:
xarray.DataArrayorxarray.DataTreeobjects containing integer codes for different labelsPoints:
dask.DataFrameobjects containing point coordinates, (lazy version ofpandas.DataFrameobjects)Shapes:
geopandas.GeoDataFrameobjects containing geometric objects like polygonsTables:
anndata.AnnDataobjects for tabular data with annotations
See Also
For advanced manipulation of
Elementsand how to handle annotations withTables, see the “SpatialElements and tables” tutorial and the “Tables” tutorial.
Images#
Images are represented as xarray.DataArray or xarray.DataTree objects respectively for single-scale or multi-scale images. They will either be 2D (cyx) or 3D (czyx) arrays where the c axis is for channels and the z, y, and x axes are spatial coordinates.
We can access elements using keys e.g. sdata[key], similar to dictionaries.
sdata["raw_image"]
<xarray.DatasetView> Size: 0B
Dimensions: ()
Data variables:
*empty*Images can have multiple scales, which are stored in a pyramid for downsampled representations useful for more efficient visualization and analysis. We can access a single scale using the get_pyramid_levels function, where 0 is the highest resolution, 1 is downsampled by a factor of (usually 2), 2 is downsampled by another factor (usually 2, with respect to the previous scale), etc.
In this case, our image has 2 scales. Let’s access the highest resolution image.
sd.get_pyramid_levels(sdata["raw_image"], n=0)
<xarray.DataArray 'image' (c: 1, y: 6432, x: 6432)> Size: 83MB
dask.array<from-zarr, shape=(1, 6432, 6432), dtype=uint16, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
* c (c) int64 8B 0
* y (y) float64 51kB 0.5 1.5 2.5 3.5 ... 6.43e+03 6.43e+03 6.432e+03
* x (x) float64 51kB 0.5 1.5 2.5 3.5 ... 6.43e+03 6.43e+03 6.432e+03
Attributes:
transform: {'global': Identity }To view image properties such as axes and channel names, SpatialData provides some helper functions. More helper functions can be found in the API page.
sd.models.get_axes_names(sdata["raw_image"])
('c', 'y', 'x')
Here we see that our image only has one channel for DAPI staining.
sd.models.get_channel_names(sdata["raw_image"])
[np.int64(0)]
Let’s use spatialdata-plot to show the image (DAPI staining for nuclei).
sdata.pl.render_images("raw_image", cmap="gray").pl.show()
Points#
Points are represented as dask.DataFrame objects containing point coordinates, (lazy version of pandas.DataFrame objects). By lazy-loading the dataframe, we can reduce memory usage and only loading in-memory for computations.
For spatial transcriptomics datasets measuring single-molecules, coordinates are stored in columns x and y (optionally z for 3D) and have an additional column annotating gene identity.
See Also
Pointsinstances can be in 2D or 3D though some operations, such asrasterize(), or visualization operations currently do not support the 3D component.
sdata["transcripts"]
| x | y | gene | |
|---|---|---|---|
| npartitions=1 | |||
| float64 | float64 | category[unknown] | |
| ... | ... | ... |
We can easily convert the dask.DataFrame to a pandas.DataFrame so it is easier to work with. Note that this will load the entire object into memory. If the data is large you can used directly the dask APIs for the manipulation of the dataframe.
sdata["transcripts"].compute()
| x | y | gene | |
|---|---|---|---|
| 0 | 433.0 | 1217.0 | Adgre1 |
| 1 | 151.0 | 1841.0 | Adgre1 |
| 2 | 139.0 | 1983.0 | Adgre1 |
| 3 | 1349.0 | 1601.0 | Adgre1 |
| 4 | 784.0 | 1732.0 | Adgre1 |
| ... | ... | ... | ... |
| 1998863 | 5743.0 | 5233.0 | Lyve1 |
| 1998864 | 5721.0 | 4581.0 | Lyve1 |
| 1998865 | 5807.0 | 4842.0 | Lyve1 |
| 1998866 | 5843.0 | 5309.0 | Lyve1 |
| 1998869 | 5580.0 | 5226.0 | Lyve1 |
1153548 rows × 3 columns
Let’s visualize the points. Note that plotting automatically switches from matplotlib to datashader when the number of points is large, ensuring performant rendering. As you can see below, it will be more useful to plot subsets of transcripts, such as specific genes.
sdata.pl.render_points("transcripts").pl.show()
INFO Using 'datashader' backend with 'None' as reduction method to speed up plotting. Depending on the
reduction method, the value range of the plot might change. Set method to 'matplotlib' do disable this
behaviour.
Let’s plot the transcripts colored by gene. We can also use the groups argument to plot transcripts for a subset of genes.
Note
If the color column has many unique values (e.g. 1000s of genes), plotting will slow down dramatically.
sdata.pl.render_points("transcripts", color="gene", groups="Vwf", palette="red").pl.show()
Shapes#
Shapes are represented as geopandas.GeoDataFrame objects containing geometric objects like polygons. For a comprehensive guide on working with geopandas.GeoDataFrame objects, please refer to the geopandas documentation.
sdata["nucleus_boundaries"]
| geometry | |
|---|---|
| cell_ID | |
| 99 | POLYGON ((6277 798, 6277 799, 6272 799, 6272 8... |
| 142 | POLYGON ((6427 6123, 6423 6123, 6423 6124, 641... |
| 208 | POLYGON ((3747 858, 3747 859, 3746 859, 3746 8... |
| 235 | POLYGON ((752 6144, 752 6145, 763 6145, 763 61... |
| 336 | POLYGON ((5174 935, 5174 936, 5172 936, 5172 9... |
| ... | ... |
| 7652 | POLYGON ((1094 6028, 1094 6029, 1091 6029, 109... |
| 7680 | POLYGON ((1324 6063, 1324 6064, 1320 6064, 132... |
| 7694 | POLYGON ((1692 6063, 1692 6064, 1686 6064, 168... |
| 7708 | POLYGON ((768 6072, 768 6073, 765 6073, 765 60... |
| 7723 | POLYGON ((374 6074, 374 6076, 373 6076, 373 60... |
3375 rows × 1 columns
sdata.pl.render_shapes("nucleus_boundaries").pl.show()
Tables#
Annotated matrices are represented as anndata.AnnData objects. Ingested datasets will usually have a table located at sdata['table'] for a count/abundance matrix. This enables downstream analysis with tools like Scanpy, scVI, etc.
For different spatial technologies this quantifies:
transcriptomics: transcript counts;
spatial proteomics: marker abundances;
slide-based assays: spot/grid abundances.
See Also
For more advanced usage such as multi-table annotations and joining, see the Tables page.
sdata["table"]
AnnData object with n_obs × n_vars = 3375 × 99
obs: 'cell_ID', 'fov_labels', 'annotation'
uns: 'annotation_colors', 'spatialdata_attrs'
obsm: 'spatial'
Here is an example of plotting numerical information on the cells (“Hal” gene expression).
sdata.pl.render_shapes("nucleus_boundaries", color="Hal").pl.show()
Now is an example of plotting categorical information (cell types).
sdata.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()
Let’s visualize all the layers together. We can easily chain the operations to render multiple Elements overlayed on top of each other in the same plot.
(
sdata.pl.render_shapes(
"nucleus_boundaries",
color="annotation",
outline_alpha=1.0,
outline_width=0.5,
)
.pl.render_points(
"transcripts",
color="gene",
groups="Vwf",
palette="cyan",
)
.pl.render_points(
"transcripts",
color="gene",
groups="Clec4f",
palette="magenta",
)
.pl.show(figsize=(8, 8))
)
Spatial Querying and Subsetting#
Given a single-modal or multi-modal dataset, spatial queries allow to retrieve all the spatial elements and instances that are within a given rectangular window or polygonal shape.
For the purpose of this tutorial, our dataset is conveniently aligned to a single coordinate system global.
See Also
For advanced usage of coordinate systems and transformations see the Transformations notebooks, for a more in-depth look at query operations, see the Spatial Queries notebook.
Bounding box query subsets a given SpatialData object within a rectangular window. We can get the bounds of the dataset to help define min/max coordinates.
sd.get_extent(sdata)
{'y': (np.float64(0.0), np.float64(6432.0)),
'x': (np.float64(0.0), np.float64(6432.0))}
To perform a bounding box query, we provide the axes, min/max coordinates, and the target coordinate system for the bounding box coordinates.
crop_sdata1 = sd.bounding_box_query(
sdata, axes=("y", "x"), min_coordinate=[1000, 1000], max_coordinate=[2000, 2000], target_coordinate_system="global"
)
crop_sdata1
SpatialData object
├── Images
│ └── 'raw_image': DataTree[cyx] (1, 1000, 1000), (1, 250, 250)
├── Labels
│ └── 'segmentation_mask': DataArray[yx] (1000, 1000)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│ └── 'nucleus_boundaries': GeoDataFrame shape: (96, 1) (2D shapes)
└── Tables
└── 'table': AnnData (96, 99)
with coordinate systems:
▸ 'global', with elements:
raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
crop_sdata1.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()
Similarly, we can perform a polygon query by using a shapely.Polygon object and passing the target coordinate system.
from shapely import Polygon
polygon = Polygon([(1000, 1000), (1500, 1000), (2000, 1500), (1500, 2000)])
crop_sdata2 = sd.polygon_query(sdata, polygon, target_coordinate_system="global")
crop_sdata2
SpatialData object
├── Images
│ └── 'raw_image': DataTree[cyx] (1, 1000, 1000), (1, 250, 250)
├── Labels
│ └── 'segmentation_mask': DataArray[yx] (1000, 1000)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│ └── 'nucleus_boundaries': GeoDataFrame shape: (45, 1) (2D shapes)
└── Tables
└── 'table': AnnData (45, 99)
with coordinate systems:
▸ 'global', with elements:
raw_image (Images), segmentation_mask (Labels), transcripts (Points), nucleus_boundaries (Shapes)
crop_sdata2.pl.render_shapes("nucleus_boundaries", color="annotation").pl.show()
Saving SpatialData Objects#
Currently, the cropped SpatialData object only exists in memory. Let’s save it by writing it to disk in a new .zarr store.
Note
It is not recommended to modify this in place. For more information on incremental reading and writing of
SpatialDataobjects, see more here.
from pathlib import Path
from tempfile import TemporaryDirectory
tmpdir = TemporaryDirectory()
crop_sdata2.write(Path(tmpdir.name) / "crop.zarr")
INFO The SpatialData object is not self-contained (i.e. it contains some elements that are Dask-backed from
locations outside /tmp/tmppc7yy78v/crop.zarr). Please see the documentation of `is_self_contained()` to
understand the implications of working with SpatialData objects that are not self-contained.
INFO The Zarr backing store has been changed from None the new file path: /tmp/tmppc7yy78v/crop.zarr
Saving new elements#
Let’s say we want to add a new element to our cropped SpatialData object. For example, we can subset the nucleus boundaries to only include a single cell type, cholangiocytes and save it as a new element. First retrieve the cell type annotations from the table.
cell_types = sd.get_values("annotation", crop_sdata2["table"])["annotation"]
Slice the nucleus boundaries to only include cholangiocytes and save it as a new element.
crop_sdata2["cholangiocytes"] = crop_sdata2["nucleus_boundaries"][cell_types == "Cholangiocytes"]
crop_sdata2.pl.render_shapes("cholangiocytes").pl.show()
Note that sdata[‘cholangiocytes’] only exists in-memory until we explicity write it to disk with crop_sdata2.write_element(). Let’s do that now.
crop_sdata2.write_element("cholangiocytes")
tmpdir.cleanup()
Next Steps#
Congratulations! Now you know the basics of SpatialData and how to use the core APIs. Check out the other tutorials for more advanced usage, integrated packages, and technology-specific examples. In particular, the section “External tutorials” will link to analysis notebooks, available in third-party libraries that use SpatialData objects as a data model.