Filtering SpatialData elements with Table Queries#
See also the “Tables” tutorial for a discussions on tables and annotations.
Introduction#
The spatialdata framework supports both the representation of SpatialElements (images, labels, points, shapes) and of annotations for these elements. As we explored in the tables notebook, some types of SpatialElements can contain annotations within themselves, but the general approach we take is to represent SpatialElements and annotations in separate objects using AnnData tables.
In this notebook we introduce table queries - a filtering mechanism that allows you to subset both the annotations (tables) and their corresponding spatial elements using expressive query syntax. This functionality is provided by the filter_table_by_query() function, which uses the annsel library for building query expressions. Under the hood, annsel uses narwhals, an “extremely lightweight and extensible compatibility layer between dataframe libraries”. This notebook assumes that you are have familarized yourself with content in the tables notebook.
Setup and Data Loading#
Lets start by importing the necessary libraries and loading the example blobs dataset.
from pathlib import Path
import annsel as an
import numpy as np
import spatialdata as sd
from spatialdata.datasets import blobs
blobs_sdata = blobs()
blobs_sdata
SpatialData object
├── Images
│ ├── 'blobs_image': DataArray[cyx] (3, 512, 512)
│ └── 'blobs_multiscale_image': DataTree[cyx] (3, 512, 512), (3, 256, 256), (3, 128, 128)
├── Labels
│ ├── 'blobs_labels': DataArray[yx] (512, 512)
│ └── 'blobs_multiscale_labels': DataTree[yx] (512, 512), (256, 256), (128, 128)
├── Points
│ └── 'blobs_points': DataFrame with shape: (<Delayed>, 4) (2D points)
├── Shapes
│ ├── 'blobs_circles': GeoDataFrame shape: (5, 2) (2D shapes)
│ ├── 'blobs_multipolygons': GeoDataFrame shape: (2, 1) (2D shapes)
│ └── 'blobs_polygons': GeoDataFrame shape: (5, 1) (2D shapes)
└── Tables
└── 'table': AnnData (26, 3)
with coordinate systems:
▸ 'global', with elements:
blobs_image (Images), blobs_multiscale_image (Images), blobs_labels (Labels), blobs_multiscale_labels (Labels), blobs_points (Points), blobs_circles (Shapes), blobs_multipolygons (Shapes), blobs_polygons (Shapes)
The table in the blobs dataset is rather minimal, so we will artifically add a couple of columns (cell_type and area) to help illustrate the functionality.
rng = np.random.default_rng(123456)
blobs_sdata.tables["table"].obs["cell_type"] = rng.choice(
["A", "B", "C", "C", "AA", "BB", "CC"], size=blobs_sdata.tables["table"].n_obs
)
blobs_sdata.tables["table"].obs["cell_type_granular"] = rng.choice(
["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"], size=blobs_sdata.tables["table"].n_obs
)
blobs_sdata.tables["table"].obs["area"] = rng.choice(
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100], size=blobs_sdata.tables["table"].n_obs
)
Supported Operations#
Basic Filtering Examples#
Now let’s explore how to filter our blobs SpatialData object using table queries.
The most common use case is to filter based on observations (obs):
blobs_sdata_filtered = sd.filter_by_table_query(blobs_sdata, table_name="table", obs_expr=an.col("cell_type") == "A")
blobs_sdata_filtered
SpatialData object
├── Labels
│ └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
└── 'table': AnnData (5, 3)
with coordinate systems:
▸ 'global', with elements:
blobs_labels (Labels)
print(
f"\nObservations reduced from {blobs_sdata.tables['table'].n_obs} to {blobs_sdata_filtered.tables['table'].n_obs}"
)
Observations reduced from 26 to 5
Breaking Down an.col("cell_type") == "A"#
What is an.col("cell_type")?
an.col("cell_type") creates a column reference that points to the “cell_type” column (doesn’t specify if it’s in obs or var). By assigning this to the obs_expr argument, you’re telling the function to filter the obs component of the AnnData table based on this column. Think of it as saying “I want to work with the cell_type column”.
What does == "A" do?
The equality operator == "A" applies a comparison operator to that column reference, creating a boolean condition that will be True for rows where cell_type equals “A” and False everywhere else.
Why This Syntax Design?
These expressions are ran in narwhals under the hood to create expressions and run them. If you have a keen eye, you may notice that this syntax is similar to Polars, as the Narwhals API follows as closely as it can to the ergonomics of Polars.
Lets take look at another example, this time we will want to select observations which belong to the blobs_labels region.
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata,
table_name="table",
obs_expr=an.col("region") == "blobs_labels",
)
blobs_sdata_filtered
SpatialData object
├── Labels
│ └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
└── 'table': AnnData (26, 3)
with coordinate systems:
▸ 'global', with elements:
blobs_labels (Labels)
Since all the observations in the table are from the blobs_labels element, The table query will return the same AnnData object to SpatialDate. But in terms of the other SpatilaElements we can see that it’s only kept the blobs_labels element.
You can also filter based on numeric values, as you’d expect.
blobs_sdata_filtered = sd.filter_by_table_query(blobs_sdata, table_name="table", obs_expr=an.col("instance_id") <= 10)
blobs_sdata_filtered
SpatialData object
├── Labels
│ └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
└── 'table': AnnData (9, 3)
with coordinate systems:
▸ 'global', with elements:
blobs_labels (Labels)
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata, table_name="table", obs_expr=an.col("instance_id").is_in([1, 3, 5, 8, 13])
)
blobs_sdata_filtered
SpatialData object
├── Labels
│ └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
└── 'table': AnnData (5, 3)
with coordinate systems:
▸ 'global', with elements:
blobs_labels (Labels)
Supported Operators and Expressions#
an.col("column_name")- reference a column inobsorvarNote: Can be multiple columns,
an.col(["column_name1", "column_name2"])
Special “columns”:
an.obs_names- reference observation names (row indices, akaAnnData.obs_names)an.var_names- reference variable names (column names, akaAnnData.var_names)
Comparison operators:
>,>=,<,<=,==,!=
Membership:
.is_in([list])
String methods:
.str.contains(),.str.starts_with(),.str.ends_with()
Logical:
&(and),|(or),~(not)
As long as an expression does not perform an aggregation under the hood or change length, it can be passed used.
For a full list of supported operators and expressions, see the corersponding narwhals documentation.
We can also combine multiple expressions per table component (obs, var, etc…)
Here we will select observations that have a cell type which starts with "A", and observations which whose cell_type_granular is in ["A", "B", "C"].
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata,
table_name="table",
obs_expr=((an.col("cell_type").str.starts_with("A")) | (an.col("cell_type_granular").is_in(["A", "B", "C"]))),
)
blobs_sdata_filtered
SpatialData object
├── Labels
│ └── 'blobs_labels': DataArray[yx] (512, 512)
└── Tables
└── 'table': AnnData (16, 3)
with coordinate systems:
▸ 'global', with elements:
blobs_labels (Labels)
There are two ways to use “and” operators in table queries:
Using
&operator between two expressionsUsing a tuple of expressions
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata,
table_name="table",
obs_expr=((an.col("cell_type").str.starts_with("A")), (an.col("cell_type_granular").is_in(["A", "B", "C"]))),
)
blobs_sdata_filtered.tables["table"].obs
| instance_id | region | cell_type | cell_type_granular | area | |
|---|---|---|---|---|---|
| 18 | 18 | blobs_labels | AA | C | 80 |
| 19 | 19 | blobs_labels | A | C | 60 |
| 26 | 26 | blobs_labels | AA | A | 100 |
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata,
table_name="table",
obs_expr=((an.col("cell_type").str.starts_with("A")) & (an.col("cell_type_granular").is_in(["A", "B", "C"]))),
)
blobs_sdata_filtered.tables["table"].obs
| instance_id | region | cell_type | cell_type_granular | area | |
|---|---|---|---|---|---|
| 18 | 18 | blobs_labels | AA | C | 80 |
| 19 | 19 | blobs_labels | A | C | 60 |
| 26 | 26 | blobs_labels | AA | A | 100 |
blobs_sdata_filtered.tables["table"].var_names
Index(['channel_0_sum', 'channel_1_sum', 'channel_2_sum'], dtype='object')
In this example, suppose that the var_name channel_0_sum is of some importance to you when the expression value for some observation is greater than 125. We can also filter based on that matrix’s column.
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata,
table_name="table",
x_expr=an.col("channel_0_sum") > 125,
)
blobs_sdata_filtered.tables["table"].obs
| instance_id | region | cell_type | cell_type_granular | area | |
|---|---|---|---|---|---|
| 1 | 1 | blobs_labels | A | F | 20 |
| 2 | 2 | blobs_labels | AA | F | 10 |
| 3 | 3 | blobs_labels | BB | C | 80 |
| 4 | 4 | blobs_labels | C | E | 10 |
| 5 | 5 | blobs_labels | C | B | 50 |
| 6 | 6 | blobs_labels | A | D | 60 |
| 8 | 8 | blobs_labels | A | G | 30 |
| 9 | 9 | blobs_labels | CC | H | 50 |
| 10 | 10 | blobs_labels | B | I | 50 |
| 13 | 13 | blobs_labels | C | A | 40 |
| 15 | 15 | blobs_labels | C | I | 90 |
| 16 | 16 | blobs_labels | B | C | 90 |
| 17 | 17 | blobs_labels | C | E | 10 |
| 23 | 23 | blobs_labels | BB | H | 100 |
| 25 | 25 | blobs_labels | C | C | 70 |
And of course you can combine different filters across different AnnData Table components.
blobs_sdata_filtered = sd.filter_by_table_query(
blobs_sdata,
table_name="table",
obs_expr=an.col("cell_type") == "B",
x_expr=an.col("channel_0_sum") > 125,
)
blobs_sdata_filtered.tables["table"].obs
| instance_id | region | cell_type | cell_type_granular | area | |
|---|---|---|---|---|---|
| 10 | 10 | blobs_labels | B | I | 50 |
| 16 | 16 | blobs_labels | B | C | 90 |
Using a Real Dataset#
To wrap up the notebook, we’ll take a look querying using the mibitof dataset. In addition there is a companion notebook from teh documentation exploring and plotting the dataset.
mibitof_zarr_path = Path("mibitof.zarr").expanduser()
mibitof_sdata = sd.read_zarr(mibitof_zarr_path)
mibitof_sdata
SpatialData object, with associated Zarr store: /Users/macbook/embl/projects/basel/spatialdata-sandbox/mibitof/data.zarr
├── Images
│ ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│ ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│ └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│ ├── 'point8_labels': DataArray[yx] (1024, 1024)
│ ├── 'point16_labels': DataArray[yx] (1024, 1024)
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (3309, 36)
with coordinate systems:
▸ 'point8', with elements:
point8_image (Images), point8_labels (Labels)
▸ 'point16', with elements:
point16_image (Images), point16_labels (Labels)
▸ 'point23', with elements:
point23_image (Images), point23_labels (Labels)
Lets also get a brief look at the obs component of the AnnData table. Here are a few columns of interest:
point: This is the name of the Field of View (FOV) that an observation belongs to (in this case it’s cells)cell_size: The area of a celldonor: The donor that the cell is fromCluster: The cluster / cell type that the cell belongs tobatch: The batch that the cell is from (usually with respect to the donor or point / FOV)library_id: An identifier pointing to whichSpatialElementthe observation belongs to.
mibitof_sdata.tables["table"].obs
| row_num | point | cell_id | X1 | center_rowcoord | center_colcoord | cell_size | category | donor | Cluster | batch | library_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9376-1 | 9479 | 8 | 2 | 65222.0 | 37.0 | 6.0 | 474.0 | carcinoma | 90de | Epithelial | 1 | point8_labels |
| 9377-1 | 9480 | 8 | 4 | 65224.0 | 314.0 | 3.0 | 126.0 | carcinoma | 90de | Epithelial | 1 | point8_labels |
| 9378-1 | 9481 | 8 | 5 | 65225.0 | 407.0 | 6.0 | 398.0 | carcinoma | 90de | Epithelial | 1 | point8_labels |
| 9379-1 | 9482 | 8 | 6 | 65226.0 | 439.0 | 20.0 | 1749.0 | carcinoma | 90de | Epithelial | 1 | point8_labels |
| 9380-1 | 9483 | 8 | 7 | 65227.0 | 479.0 | 6.0 | 407.0 | carcinoma | 90de | Imm_other | 1 | point8_labels |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4270-0 | 4322 | 23 | 1479 | 61793.0 | 519.0 | 1018.0 | 125.0 | carcinoma | 21d7 | Tcell_CD4 | 0 | point23_labels |
| 4271-0 | 4323 | 23 | 1480 | 61794.0 | 929.0 | 1018.0 | 190.0 | carcinoma | 21d7 | Imm_other | 0 | point23_labels |
| 4272-0 | 4324 | 23 | 1481 | 61795.0 | 999.0 | 1019.0 | 173.0 | carcinoma | 21d7 | Imm_other | 0 | point23_labels |
| 4273-0 | 4325 | 23 | 1482 | 61796.0 | 322.0 | 1018.0 | 181.0 | carcinoma | 21d7 | Myeloid_CD11c | 0 | point23_labels |
| 4274-0 | 4326 | 23 | 1483 | 61797.0 | 785.0 | 1019.0 | 178.0 | carcinoma | 21d7 | Tcell_CD4 | 0 | point23_labels |
3309 rows × 12 columns
In this example, we’re picking donor “21d7” and keeping vars that either start with "CD" or are "ASCT2" or "ATP5A".
mibitof_sdata_filtered = sd.filter_by_table_query(
mibitof_sdata,
# filter_tables=False,
table_name="table",
obs_expr=an.col("donor") == "21d7",
var_names_expr=(an.var_names.is_in(["ASCT2", "ATP5A"]) | an.var_names.str.starts_with("CD")),
)
mibitof_sdata_filtered
SpatialData object
├── Labels
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (1241, 14)
with coordinate systems:
▸ 'point23', with elements:
point23_labels (Labels)
If your spatialdata object has a lot of SpatialElements and you only want to apply the filter to a subset of them, you can use the element_names parameter to specify which ones you want to use for the filter!
As a final example, let’s take it up a few notches and use most of the features of the filter_by_table_query function. We will also be using the method version of the query instead of the function. They behave the same way, except that the method version passes in it’s own SpatialData object.
We’ll be subsetting of specific SpatialElements, and applying filters across obs, var, and x components of the AnnData table with a variety of queries.
mibitof_sdata_filtered = mibitof_sdata_filtered.filter_by_table_query(
table_name="table",
element_names=["point23_labels", "point8_labels"],
# Filter observations (obs) based on multiple conditions
obs_expr=(
# Cells from donor 21d7 OR 90de
an.col("donor").is_in(["21d7", "90de"])
# AND cells with size greater than 400
& (an.col("cell_size") > 400)
# AND cells that are either Epithelial or contain "Tcell" in their cluster name
& (an.col("Cluster") == "Epithelial")
| (an.col("Cluster").str.contains("Tcell"))
),
# Filter variables (var) based on multiple conditions
var_names_expr=(
# Select columns that start with CD
an.var_names.str.starts_with("CD")
# OR columns that contain "ATP"
| an.var_names.str.contains("ATP")
# OR specific columns
| an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
),
# Filter based on expression values
x_expr=(
# Keep cells where ASCT2 is greater than 0.1
(an.col("ASCT2") > 0.1)
# AND less than 2 for ASCT2
& (an.col("ASCT2") < 2)
),
how="right",
)
mibitof_sdata_filtered
SpatialData object
├── Labels
│ └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
└── 'table': AnnData (103, 14)
with coordinate systems:
▸ 'point23', with elements:
point23_labels (Labels)
To wrap up, there are a few things to note:
NOTE:
SpatialElementsare filtered, and when the elements are not labels also the components within those elements are filtered thanks to the parameterhow="right".In the case above we have a Labels element, so when we’re filtering by the
obstable and we get a subset of the LabelsSpatialElement, the individual segmentation masks are not modified, they will have the exact same masks as the original LabelSpatialElement.
A layer of a given
AnnDatatable can be used by specifying thelayerparameter in thefilter_by_table_queryfunction.You can use either the method or the function, they behave exactly the same.