Advanced probeset selection

In this tutorial we look into customizing the gene set selection for more specific use cases and explore the parameters of the selection procedure.

[3]:
../_images/_tutorials_spapros_tutorial_advanced_selection_3_0.png

The Spapros probeset selection pipeline simultaneously optimizes for capturing transcriptional variation and cell type identification while taking into account prior knowledge, technical constraints and probe design. The main steps, as visualized in the figure above, are:

Constrains

In the first step, genes are filtered based on technology specific probe design constraints. This part is captured in the end-to-end probeset selection tutorial. (Besides probe design constraints also expression level constraints can be applied as shown in the Selection with experession constraints example below. Those affect multiple steps of the pipeline though)

Variable genes

The next step is a PCA based selection of genes that capture general variation. The subsequent steps of the pipeline are encouraged to select from that set of genes. If you want the selection to be focused only on marker genes and cell type classification, the PCA-based selection step can be omitted, see the Cell type classification only example for this case.

Distinguish celltypes

For each cell type, a binary classification tree is trained on the previously selected gene set. Here the pipeline builds for each cell type a combinatorial rule how to distinguish it from other cell types.

DE gene addition

Similarly to the last step, trees are also trained on DE genes and compared to the trees trained on the PCA based selection. If a DE tree performs better with respect to cell type classification, then DE genes are added to the already selected gene set.

Prior knowledge

There are two possibilities to incorporate prior knowledge. For both, we present use cases in this tutorial:

Probe design

Design the final technology specific probes for the selected genes. As the Constraints step, this part is also only captured in the end-to-end probeset selection tutorial.

Content

  1. Import packages

  2. Load and preprocess data

  3. Use cases

    1. Select a few additional genes

    2. Cell type classification only

    3. Selection for high numbers of genes (> 150)

    4. Selection with curated marker list

    5. Selection with expression constraints

    6. Your use case is not captured?

  4. What’s next

0. Import packages

[4]:
import numpy as np
import scanpy as sc
import spapros as sp
[5]:
sc.settings.verbosity = 0
sc.logging.print_header()
print(f"spapros=={sp.__version__}")
scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.21.6 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.13.2 python-igraph==0.9.11 pynndescent==0.5.7
spapros==0.1.0

1. Load and preprocess data

For Spapros selections the count data should be log-normalised. Notably genes should not be scaled to mean=0 and std=1. We pre-select a given number of highly variable genes, here 1000. In real world applications we typically go for 8000.

The example data set that we use here has a raw version and a processed version with scaled counts. To get log normalised counts we use the raw version. Cell and gene filters, cell type annotations, and the umap embedding we get from the processed version.

[6]:
adata = sc.datasets.pbmc3k()
adata_tmp = sc.datasets.pbmc3k_processed()

# Get infos from the processed dataset
adata = adata[adata_tmp.obs_names, adata_tmp.var_names]
adata.obs['celltype'] = adata_tmp.obs['louvain']
adata.obsm['X_umap'] = adata_tmp.obsm['X_umap']
del adata_tmp

# Preprocess counts and get highly variable genes
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=1000)

adata

[6]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'celltype'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
    obsm: 'X_umap'

2. Use cases

We will go through some specific use cases for gene selection.

i. Select a few additional genes

In some situations you already selected multiple genes that you want to measure in a targeted spatial transcriptomics experiment but you still have capacity for further genes. For this scenario the preselected_genes parameter was introduced.

Let’s distinguish two scenarios

  1. We preselected 40 genes and want to add 10 (adds up to 50 in total)

  2. We preselected 40 genes and want to add 100 (adds up to 140 in total)

For the 2. scenario simply adding the first 40 genes in preselected_genes and going with default parameters is the right way to go. For the 1. scenario (which we look at in the code below) the parameter n_pca_genes should be reduced. The reason is that Spapros will build trees based on the pre selected genes and the genes from the PCA selection. If there are a lot of PCA genes in the pool, then the trees might ignore most of the preselected genes. This is not extremly problematic, but it’s preferable to build up our combinatorial rules on the pre selected genes. We’ll reduce n_pca_genes from 100 (default) to 20 for the given case. Note however that the preselected_genes can include genes that overlap with the pca genes, therefore we more specifically set n_pca_genes to the number that provides us 20 more pca genes on top of the preselected list.

[7]:
# list of pre selected genes
preselected_genes = [
    'PF4', 'HLA-DPB1', 'FCGR3A', 'GZMB', 'CCL5', 'S100A8', 'IL32', 'HLA-DQA1', 'NKG7', 'AIF1', 'CD79A', 'LTB', 'TYROBP',
    'HLA-DMA', 'GZMK', 'HLA-DRB1', 'FCN1', 'S100A11', 'GNLY', 'GZMH', 'CST3', 'RBM39', 'HLA-DPA1', 'LST1', 'LGALS2',
    'FCER1A', 'MS4A1', 'TRAF3IP3', 'PRF1', 'LGALS1', 'PRDX1', 'ATP5O', 'C1orf162', 'CTSS', 'SELL', 'CTSW', 'GIMAP4',
    'LINC00926', 'MS4A6A', 'GSTP1'
]

# Get `n_pca_genes` to obtain the top 20 pca genes that are not already in preselected_genes
tmp = sp.se.select_pca_genes(adata, n=adata.n_vars, inplace=False)["selection_ranking"]
n_pca_genes = tmp.loc[~tmp.index.isin(preselected_genes)].sort_values().iloc[:20].max().astype(int)

# create an instance of the ProbesetSelector class
selector = sp.se.ProbesetSelector(
    adata,
    n=50,
    n_pca_genes=n_pca_genes,
    preselected_genes=preselected_genes,
    celltype_key="celltype",
    verbosity=1,
    save_dir=None,
)

# select probe set
selector.select_probeset()

Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().

Now we can investigate the selected gene set. The annotations in the table describe where genes originate from: - the first 40 genes are annotated with pre_selected=True - the added genes from 41 till 50 are annotated with selection=True and pre_selected=False

[8]:
display(selector.probeset.head(4))
display(selector.probeset.iloc[38:42])
display(selector.probeset.iloc[48:52])
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
PF4 1 True 1.0 1.0 1.0 1.000000 0.810399 True False False Megakaryocytes Megakaryocytes Megakaryocytes False True False
CD79A 2 True 1.0 1.0 1.0 0.895466 1.226536 True False False B cells B cells B cells False True False
CCL5 3 True 1.0 1.0 1.0 0.545002 2.553322 True False True CD8 T cells CD8 T cells CD8 T cells False True False
GZMB 4 True 1.0 1.0 1.0 0.530616 1.252898 True False False NK cells NK cells NK cells False True False
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
MS4A6A 39 True 1.0 NaN 3.0 0.003518 0.481730 True False False False False False
GIMAP4 40 True 1.0 NaN 3.0 0.001383 2.422619 True False True False False False
GPX1 41 True 2.0 2.0 1.0 0.002822 2.025547 False False True Megakaryocytes Megakaryocytes Megakaryocytes False True False
FCER1G 42 True 2.0 NaN 1.0 0.010767 1.324965 False False True FCGR3A+ Monocytes FCGR3A+ Monocytes FCGR3A+ Monocytes False False False
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
CD79B 49 True 3.0 NaN 3.0 0.022714 1.304566 False False True B cells B cells B cells False False False
ARL6IP5 50 True 4.0 NaN 4.0 0.016238 2.319210 False False True False False False
STK17A 51 False 5.0 NaN 5.0 0.006099 1.412628 False False True False False False
GIMAP7 52 False 5.0 NaN 5.0 0.001398 2.961351 False False True CD4 T cells CD4 T cells CD4 T cells False False False

Note that besides the preselected_genes parameter there is also the prior_genes parameter (for “prioritized genes”). This one acts similar to preselected_genes, however genes in prior_genes are not necessarily selected, they’re only added to the pool of PCA preselected genes from which we build trees.

ii. Cell type classification only

Spapros optimizes for two objectives: 1. cell type identification 2. recovery of transcriptomic variation

The procedure prioritzes the 1st objective over the 2nd. The variation recovery also leads to the selection of genes that show gradients in multiple cell types (see the example of FOS in our preprint, Fig. 3). In classical gene panel selection approaches the focus lies on marker genes, ideally markers that are specific to single cell types. In the downstream analysis of spatial data genes like FOS are not useful for identifying the major cell types, these genes should even be excluded from the clustering to get better separated cell type clusters. If you don’t want to select genes like FOS and only focus on the first objective you can skip the PCA-based selection by setting n_pca_genes=0. In that case only trees on DE genes are trained, the PCA related steps are neglected.

[9]:
# create an instance of the ProbesetSelector class
selector = sp.se.ProbesetSelector(
    adata,
    n=30,
    n_pca_genes=0,
    celltype_key="celltype",
    verbosity=1,
    save_dir=None,
)
Note: No PCA selection will be performed since n_pca_genes = 0. The selected genes will only be based on the DE forests. In that case it can happen that fewer than n = 30 genes are selected. To get n = 30 exclude the selected genes of the first run from adata, rerun the method, and combine the results of the two runs.

Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().
[10]:
# select probe set
selector.select_probeset()
[11]:
display(selector.probeset[selector.probeset["selection"]].head(3))
display(selector.probeset[selector.probeset["selection"]].tail(3))
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
PF4 1 True 1.0 1.0 1.0 1.000000 0 False False False Megakaryocytes Megakaryocytes Megakaryocytes False True False
CD79A 2 True 1.0 1.0 1.0 0.914911 0 False False False B cells B cells B cells False True False
S100A8 3 True 1.0 1.0 1.0 0.831315 0 False False False CD14+ Monocytes CD14+ Monocytes CD14+ Monocytes False True False
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
FGFBP2 28 True 1.0 NaN 1.0 0.009360 0 False False False NK cells NK cells NK cells False False False
HLA-DQB1 29 True 1.0 NaN 1.0 0.008832 0 False False False Dendritic cells Dendritic cells Dendritic cells False False False
CTSW 30 True 1.0 NaN 1.0 0.008201 0 False False False CD8 T cells CD8 T cells CD8 T cells False False False

iii. Selection for high numbers of genes (> 150)

Spapros works well over different ranges of numbers of genes to select. However, for higher numbers of genes (~ > 150, can be lower/higher for lower/higher nr of cell types) a dual selection can make sense. This is e.g. relevant for protocols like MERFISH, and also if you want to increase the robustness on the cell type classification objective (less focus on recovery of variation that is independent of cell type labels).

The idea is to run a selection first (we do not fix n and we do not use PCA selected genes for that run), exclude the genes of that selection, run a second (standard) selection, and combine the results of both runs.

[12]:
# First run

# create an instance of the ProbesetSelector class
selector = sp.se.ProbesetSelector(
    adata,
    n=None,
    n_pca_genes=0,
    celltype_key="celltype",
    verbosity=1,
    save_dir=None,
)

# select probe set
selector.select_probeset()

Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().
[13]:
# Save genes of the first selection
first_selection = selector.probeset[selector.probeset["selection"]].index.tolist()
n_genes_first_selection = len(first_selection)
print(n_genes_first_selection)

33

As the number of cell types is quite low in our example dataset the first selection only lead to 33 genes. Next we select the rest of genes to get to 150.

[14]:
# Second run

# create an instance of the ProbesetSelector class
selector2 = sp.se.ProbesetSelector(
    adata[:,[v for v in adata.var_names if v not in first_selection]],
    n=150 - n_genes_first_selection,
    n_pca_genes=100,
    celltype_key="celltype",
    verbosity=1,
    save_dir=None,
)

# select probe set
selector2.select_probeset()
Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().
[15]:
second_selection = selector2.probeset[selector2.probeset["selection"]].index.tolist()
final_selection = first_selection + second_selection
print(len(np.unique(final_selection)))
150

This adds up to a final gene set of 150 genes.

For even higher numbers of genes you might want to consider running multiple additional pre selections with n_pca_genes=0.

Side note: unfortunately our plotting functions are only accessible with each individual selector and not the combination of multiple selectors.

iv. Selection with curated marker list

Often you already have some curated list of markers. Such a list can be added as prior knowledge to achieve the following aims

  • make sure that the signal of the markers in the list are captured by the selected probe set otherwise add markers from the list

  • add markers for cell types that don’t occur in adata.obs[celltype_key]

The criteria how markers are added or how the “signal is captured” is described in the following. Note however that if you just want to add certain genes you can also add them with the preselected_genes (see i.) parameter.

Let’s group the cell types in the marker_list into:

  • ct_group1: cell types that also occur in adata.obs[celltype_key]

  • ct_group2: cell types that do not occur in adata.obs[celltype_key]

The key parameters of the ProbesetSelector that define how marker signals are checked for and if markers should be added are:

  • n_list_markers

  • marker_corr_th

You can set n_list_markers to an int, then the method will take care to have captured at least n_list_markers per cell type. Note however that the ct_group1 cell types will also already be captured by the genes of the trained trees. Therefore it can make sense to have different numbers for the two groups, which is achieved by setting e.g. n_list_markers = {'adata_celltypes':2, 'list_celltypes':3}. As mentioned above the markers for ct_group1 and ct_group2 are handeled differently. For ct_group1 the method checks if it already selected n_list_markers that have correlations above marker_corr_th (default: 0.5) with at least n_list_markers for a given cell type, otherwise markers from the list are added. For ct_group2 the correlation criteria doesn’t make much sense since the correlation is calculated on the given cell types in adata and those shouldn’t express a marker for a cell type of ct_group2. Therefore from ct_group2 n_list_markers markers are just added.

Note that there is also the n_min_markers parameter. This one makes sure that enough genes either from DE tests during the selection or (if given) from the marker_list are selected per cell type of ct_group1. So n_min_markers also applies if no marker_list is given.

Let’s define an example marker list. We added 'cell type A' as a ct_group2 (i.e. not in adata.obs[celltype_key]) cell type example.

[16]:
marker_list = {
    'CD4 T cells': ['IL7R'],
    'CD14+ Monocytes': ['CD14', 'LYZ'],
    'B cells': ['MS4A1'],
    'CD8 T cells': ['CD8A'],
    'NK cells': ['GNLY', 'NKG7'],
    'FCGR3A+ Monocytes': ['FCGR3A', 'MS4A7'],
    'Dendritic cells': ['FCER1A', 'CST3'],
    'Megakaryocytes': ['NAPA-AS1', 'PPBP'],
    'cell type A': ["Gene X", "Gene Y", "CAMK2N1"],
}
[17]:
# create an instance of the ProbesetSelector class
selector = sp.se.ProbesetSelector(
    adata,
    n=None,
    celltype_key="celltype",
    verbosity=1,
    save_dir=None,
    marker_list=marker_list,
    n_list_markers={'adata_celltypes': 2, 'list_celltypes': 3},
)

# select probe set
selector.select_probeset()

Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().
[18]:
selector.probeset[selector.probeset["selection"]]
[18]:
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
FCGR3A 1 True 1.0 1.0 1.0 0.612390 1.321373 False False True FCGR3A+ Monocytes FCGR3A+ Monocytes FCGR3A+ Monocytes,FCGR3A+ Monocytes False True False
NKG7 2 True 1.0 1.0 1.0 0.248463 1.658470 False False True NK cells,CD8 T cells NK cells,CD8 T cells NK cells,CD8 T cells,NK cells False True False
CST3 3 True 1.0 1.0 1.0 0.056431 1.557674 False False True CD14+ Monocytes,Dendritic cells CD14+ Monocytes,Dendritic cells CD14+ Monocytes,Dendritic cells,Dendritic cells False True False
MS4A1 4 True 1.0 1.0 1.0 0.012243 0.814808 False False True B cells False True False
HLA-DPB1 5 True 1.0 2.0 1.0 0.649235 1.997253 False False True B cells,Dendritic cells B cells,Dendritic cells B cells,Dendritic cells False True False
CCL5 6 True 1.0 2.0 1.0 0.478418 2.553322 False False True CD8 T cells CD8 T cells CD8 T cells False True False
IL32 7 True 1.0 2.0 1.0 0.185111 3.377428 False False True CD4 T cells CD4 T cells CD4 T cells False True False
GNLY 8 True 1.0 2.0 1.0 0.075589 2.167588 False False True NK cells NK cells NK cells,NK cells False True False
FCER1A 9 True 1.0 2.0 1.0 0.018118 0.298179 False False False Dendritic cells Dendritic cells Dendritic cells,Dendritic cells False True False
PF4 10 True 1.0 NaN 1.0 1.000000 0.810399 False False True Megakaryocytes Megakaryocytes Megakaryocytes False False False
HLA-DQA1 11 True 1.0 NaN 1.0 0.562045 0.755408 False False True Dendritic cells Dendritic cells Dendritic cells False False False
GZMB 12 True 1.0 NaN 1.0 0.512924 1.252898 False False True NK cells NK cells NK cells False False False
AIF1 13 True 1.0 NaN 1.0 0.238008 1.527263 False False True FCGR3A+ Monocytes FCGR3A+ Monocytes FCGR3A+ Monocytes False False False
CD79A 14 True 1.0 NaN 1.0 0.222419 1.226536 False False True B cells B cells B cells False False False
S100A8 15 True 1.0 NaN 1.0 0.203852 2.254612 False False True CD14+ Monocytes CD14+ Monocytes CD14+ Monocytes False False False
HLA-DMA 16 True 1.0 NaN 1.0 0.167341 0.694292 False False True False False False
LTB 17 True 1.0 NaN 1.0 0.139606 3.029459 False False True CD4 T cells CD4 T cells CD4 T cells False False False
GZMK 18 True 1.0 NaN 1.0 0.124733 1.680233 False False True CD8 T cells CD8 T cells CD8 T cells False False False
HLA-DRB1 19 True 1.0 NaN 1.0 0.097574 1.538738 False False True False False False
TYROBP 20 True 1.0 NaN 1.0 0.093040 1.352422 False False True CD14+ Monocytes CD14+ Monocytes CD14+ Monocytes False False False
FCN1 21 True 1.0 NaN 1.0 0.088904 1.070598 False False True CD14+ Monocytes CD14+ Monocytes CD14+ Monocytes False False False
S100A11 22 True 1.0 NaN 1.0 0.078515 2.468161 False False True False False False
GZMH 23 True 1.0 NaN 1.0 0.058625 1.264788 False False True False False False
RBM39 24 True 1.0 NaN 1.0 0.036062 0.657891 False False True False False False
HLA-DPA1 25 True 1.0 NaN 1.0 0.024234 1.429546 False False True Dendritic cells Dendritic cells Dendritic cells False False False
LST1 26 True 1.0 NaN 1.0 0.023708 1.263540 False False True FCGR3A+ Monocytes FCGR3A+ Monocytes FCGR3A+ Monocytes False False False
LGALS2 27 True 1.0 NaN 1.0 0.023468 1.126618 False False True CD14+ Monocytes CD14+ Monocytes CD14+ Monocytes False False False
TRAF3IP3 28 True 1.0 NaN 1.0 0.011468 2.615410 False False True False False False
PRF1 29 True 1.0 NaN 1.0 0.010033 0.807865 False False True NK cells NK cells NK cells False False False
LGALS1 30 True 1.0 NaN 1.0 0.009938 2.895099 False False True False False False
PRDX1 31 True 1.0 NaN 1.0 0.009259 0.625947 False False True False False False
ATP5O 32 True 1.0 NaN 1.0 0.007816 2.455435 False False True False False False
C1orf162 33 True 1.0 NaN 1.0 0.007754 0.889644 False False True False False False
CTSS 34 True 1.0 NaN 1.0 0.007547 2.285013 False False True CD14+ Monocytes CD14+ Monocytes CD14+ Monocytes False False False
SELL 35 True 1.0 NaN 1.0 0.007499 2.800353 False False True False False False
CTSW 36 True 1.0 NaN 1.0 0.006245 1.300789 False False True CD8 T cells CD8 T cells CD8 T cells False False False
GIMAP4 37 True 1.0 NaN 1.0 0.005317 2.422619 False False True False False False
LINC00926 38 True 1.0 NaN 1.0 0.004536 0.563208 False False True False False False
MS4A6A 39 True 1.0 NaN 1.0 0.004139 0.481730 False False True False False False
GSTP1 40 True 1.0 NaN 1.0 0.003561 1.483085 False False True False False False
PDIA3 41 True 1.0 NaN 1.0 0.003147 0.749059 False False True False False False
HOPX 42 True 1.0 NaN 1.0 0.003119 0.910448 False False True False False False
ERH 43 True 1.0 NaN 1.0 0.002811 0.586232 False False True False False False
COTL1 44 True 1.0 NaN 1.0 0.000730 2.993346 False False True FCGR3A+ Monocytes FCGR3A+ Monocytes FCGR3A+ Monocytes False False False
CAMK2N1 45 True 2.0 1.0 NaN NaN 0.029484 False False False cell type A True True False
Gene X 46 True 2.0 2.0 NaN NaN 0.000000 False False False cell type A True True False
Gene Y 47 True 2.0 3.0 NaN NaN 0.000000 False False False cell type A True True False
PPBP 48 True 3.0 1.0 2.0 1.000000 1.163331 False False True Megakaryocytes Megakaryocytes Megakaryocytes,Megakaryocytes False True False
CD14 49 True 3.0 1.0 NaN NaN 0.000000 False False False CD14+ Monocytes False True False
CD8A 50 True 3.0 1.0 NaN NaN 0.000000 False False False CD8 T cells False True False
IL7R 51 True 3.0 1.0 NaN NaN 0.000000 False False False CD4 T cells False True False
NAPA-AS1 52 True 4.0 2.0 NaN NaN 0.035167 False False False Megakaryocytes False True False
LYZ 53 True 4.0 2.0 NaN NaN 0.000000 False False False CD14+ Monocytes False True False
MS4A7 54 True 4.0 2.0 NaN NaN 0.000000 False False False FCGR3A+ Monocytes False True False

The required markers that don’t occur in the trees (tree_rank: NaN) are added to fulfill the n_list_markers criteria. Note that the markers for cell type A are ranked higher compared to the markers of the other cell types as those cell types are already captured by the tree_rank=1 genes.

v. Selection with expression constraints

Why epxression constraints?

For different technologies there can be different constraints on the expression levels of genes, e.g. to prevent optical crowding for methods with combinatorial barcoding (MERFISH, ISS, ..).

How expression constraints work

We support expression constraints by setting thresholds for gene expression levels. In practice it is really difficult to set exact thresholds, therefore we introduce smoothed penalty functions that penalize genes based on their expression. I.e. if a gene is just a bit above the threshold but is very important for the selection it will still be selected.

In general you can apply any function to penalize genes or just assign values to genes. The value for each gene needs to be given in adata.obs[penalty_key] and should be in [0,1] (0: maximal penalty, 1: no penalty). During the selection the penalties apply at 3 steps:

  • PCA-based selection - for each gene the pca score is multipled with the penalty value

  • DE-based selection - for each gene the DE score is multipled with the penalty value

  • adding markers from a curated list (marker_list) - filter genes that have a penalty value below 1 (the threshold can be adjusted via marker_selection_hparams['penalty_threshold']))

High and low expression constraints

We will construct two constraints: one for too highly expressed genes and one for too lowly expressed genes. We will base them on quantiles of the genes’ expression distributions.

  • we consider a gene as too highly expressed if at least 1% of cells are above the threshold, i.e. the 0.99 quantile > upper_threshold

  • we consider a gene as too lowly expressed if at least 10% of cells that do express the gene are below the threshold, i.e. the 0.9 quantile over cells that express the gene < lower_threshold

How to construct the constraints

  • First you need to find expression level thresholds from experimental experience. Ideally you have spatial measurements of some reference genes that are too highly (or too lowly) expressed (good candidates are abundant genes like MALAT1). Then you can estimate the thresholds based on the expression of those reference genes in a matched scRNA-seq data set. In our example we go with the tresholds 1 and 2 based on log-normalised data.

  • With the thresholds we construct a smoothed penalty function which needs to be calibrated by testing it on example selections.

[57]:
# Set thresholds
lower_th = 1.0
upper_th = 2

# Calculate quantiles
sp.ut.get_expression_quantile(adata, q=0.99, normalise=False, log1p=False, zeros_to_nan=False)
sp.ut.get_expression_quantile(adata, q=0.9, normalise=False, log1p=False, zeros_to_nan=True)
[69]:
# Calibrate constraints #

# Try different factors (also note that we fixed different base variances 0.1 and 0.5 for lower and upper penalties)
FACTORS = [0.1,1.0,2.0,10]

penalty_fcts = {}
test_selections = {}
for f in FACTORS:
    # Get penalty functions for given factor
    penalty_fcts[f"lower_{f}"] = sp.ut.plateau_penalty_kernel(var=0.1 * f, x_min=lower_th, x_max=None)
    penalty_fcts[f"upper_{f}"] = sp.ut.plateau_penalty_kernel(var=0.5*f, x_min=None, x_max=upper_th)
    # Calculate each gene's penalty value
    adata.var[f"expr_penalty_lower_{f}"] = penalty_fcts[f"lower_{f}"](adata.var['quantile_0.9 expr > 0'])
    adata.var[f"expr_penalty_upper_{f}"] = penalty_fcts[f"upper_{f}"](adata.var['quantile_0.99'])
    # PCA and DE selections with penalties
    penalty_keys = [f"expr_penalty_lower_{f}",f"expr_penalty_upper_{f}"]
    test_selections[f"PCA_{f}"] = sp.se.select_pca_genes(adata, n=100, penalty_keys=penalty_keys, inplace=False)
    test_selections[f"DE_{f}"] = sp.se.select_DE_genes(adata, n=100, obs_key="celltype", penalty_keys=penalty_keys, inplace=False)

# Plot selections for different factors
for method in ["DE", "PCA"]:
    print(f"### {method} ###")
    for f in FACTORS:
        used_penalties = {f"lower_{f}": penalty_fcts[f"lower_{f}"], f"upper_{f}": penalty_fcts[f"upper_{f}"]}
        sp.pl.selection_histogram(
            adata,
            {f"{method}_{f}": test_selections[f"{method}_{f}"]},
            x_axis_keys={f"lower_{f}": 'quantile_0.9 expr > 0', f"upper_{f}": 'quantile_0.99'},
            background_key="all",
            penalty_kernels={f"{method}_{f}": used_penalties},
        )

### DE ###
../_images/_tutorials_spapros_tutorial_advanced_selection_37_1.png
../_images/_tutorials_spapros_tutorial_advanced_selection_37_2.png
../_images/_tutorials_spapros_tutorial_advanced_selection_37_3.png
../_images/_tutorials_spapros_tutorial_advanced_selection_37_4.png
### PCA ###
../_images/_tutorials_spapros_tutorial_advanced_selection_37_6.png
../_images/_tutorials_spapros_tutorial_advanced_selection_37_7.png
../_images/_tutorials_spapros_tutorial_advanced_selection_37_8.png
../_images/_tutorials_spapros_tutorial_advanced_selection_37_9.png

In this setting we choose the factor 0.1 to not select too many highly expressed genes. Lowly expressed genes are not selected anyway for the given selection thresholds.

We apply the expression penalties on multiple selection steps (see below). The “m_penalties<…>” would not be necessary since we don’t provide a marker list. For the case that a marker list is given the penalties should be applied as below: no lower penalty for m_penalties_list_celltypes since we don’t expect cells from that cell type expressing those markers in the given data set.

[70]:
# Run selection with the calibrated expession constraints

FACTOR = 0.1

adata.var["expr_penalty_lower"] = adata.var[f"expr_penalty_lower_{FACTOR}"]
adata.var["expr_penalty_upper"] = adata.var[f"expr_penalty_upper_{FACTOR}"]

# create an instance of the ProbesetSelector class
selector = sp.se.ProbesetSelector(
    adata,
    n=None,
    celltype_key="celltype",
    verbosity=1,
    save_dir=None,
    pca_penalties=["expr_penalty_lower", "expr_penalty_upper"],
    DE_penalties=["expr_penalty_lower", "expr_penalty_upper"],
    m_penalties_adata_celltypes=["expr_penalty_lower", "expr_penalty_upper"],
    m_penalties_list_celltypes=["expr_penalty_upper"],
)


# select probe set
selector.select_probeset()

Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().

We can then investigate if the genes selected with spapros are within he expression constraints: (the selector.plot_histogram fct probably fails in spapros version 0.1.0, this will be fixed with the next release)

[75]:
selector.plot_histogram(
    selections = ["pca", "DE"],
    x_axis_keys={"expr_penalty_lower": 'quantile_0.9 expr > 0', "expr_penalty_upper": 'quantile_0.99'},
    unapplied_penalty_keys={}
)

../_images/_tutorials_spapros_tutorial_advanced_selection_41_0.png

vi. Your use case is not captured?

If you face a special use case and you’re unsure if the described schemes would fit to this one, please raise an issue at our github issue section and describe the use case.

3. What’s next?