Probeset selection with Spapros

In this tutorial we showcase how to select a gene set with Spapros.

Based on a scRNA-seq reference dataset Spapros selects genes that can distinguish the cell types in that data set and capture transcriptomic varation beyond cell type labels.

[2]:
../_images/_tutorials_spapros_tutorial_basic_selection_2_0.png

In this tutorial we will use the default selection settings only. Check out the advanced selection tutorial for adjustments for specific use cases and the end-to-end probe set selection tutorial which includes the probe design.

Import packages

[3]:
import scanpy as sc
import spapros as sp
[4]:
sc.settings.verbosity = 0
sc.logging.print_header()
print(f"spapros=={sp.__version__}")
scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.21.6 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.13.2 python-igraph==0.9.11 pynndescent==0.5.7
spapros==0.1.0

Load and preprocess data

For Spapros selections the count data should be log-normalised. Notably genes should not be scaled to mean=0 and std=1. We pre-select a given number of highly variable genes, here 1000. In real world applications we typically go for 8000.

The example data set that we use here has a raw version and a processed version with scaled counts. To get log normalised counts we use the raw version. Cell and gene filters, cell type annotations, and the umap embedding we get from the processed version.

[5]:
adata = sc.datasets.pbmc3k()
adata_tmp = sc.datasets.pbmc3k_processed()

# Get infos from the processed dataset
adata = adata[adata_tmp.obs_names,adata_tmp.var_names]
adata.obs['celltype'] = adata_tmp.obs['louvain']
adata.obsm['X_umap'] = adata_tmp.obsm['X_umap']
del adata_tmp

# Preprocess counts and get highly variable genes
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata,flavor="cell_ranger",n_top_genes=1000)

adata
[5]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'celltype'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
    obsm: 'X_umap'
[6]:
sc.pl.umap(adata,color="celltype")
../_images/_tutorials_spapros_tutorial_basic_selection_9_0.png

Run selection

First we create an instance of the ProbesetSelector class. We specify the number of genes n (20) and the key in adata.obs where we find the cell type annotations (celltype_key="celltype"). Note that you can specify a save_dir to save results during selection and reload them next time a ProbesetSelector with the given save_dir is instantiated.

[7]:
selector = sp.se.ProbesetSelector(adata, n=20, celltype_key="celltype", verbosity=1, save_dir=None)
Note: The following celltypes' test set sizes for forest training are below min_test_n (=20):
         Dendritic cells : 9
         Megakaryocytes  : 3
The genes selected for those cell types potentially don't generalize well. Find the genes for each of those cell types in self.genes_of_primary_trees after running self.select_probeset().

We get a warning that the cell type clusters for dendritic cells and megakaroycytes are quite small and therefore the genes that are selected to identify these cell types potentially don’t generalize very well. The method won’t exclude these cell types automatically, it could be done manually though by setting the parameter celltypes to a subset of cell types instead of celltypes="all".

Run the central method for probe set selection. (Takes ~5 min for the given example. Note that the progress bars should be properly displayed with jupyter notebook, however jupyter lab or e.g. vscode can lead to incomplete display)

[8]:
selector.select_probeset()
The selected probeset can be examined like this:
(See the ProbesetSelector.probeset attribute for a detailed description of the output table)
[9]:
selector.probeset
[9]:
gene_nr selection rank marker_rank tree_rank importance_score pca_score pre_selected prior_selected pca_selected celltypes_DE_1vsall celltypes_DE_specific celltypes_DE celltypes_marker list_only_ct_marker required_marker required_list_marker
PF4 1 True 1.0 1.0 1.0 1.000000 0.810399 False False True Megakaryocytes Megakaryocytes Megakaryocytes False True False
HLA-DPB1 2 True 1.0 1.0 1.0 0.649235 1.997253 False False True B cells,Dendritic cells B cells,Dendritic cells B cells,Dendritic cells False True False
FCGR3A 3 True 1.0 1.0 1.0 0.612390 1.321373 False False True FCGR3A+ Monocytes FCGR3A+ Monocytes FCGR3A+ Monocytes False True False
GZMB 4 True 1.0 1.0 1.0 0.512924 1.252898 False False True NK cells NK cells NK cells False True False
CCL5 5 True 1.0 1.0 1.0 0.478418 2.553322 False False True CD8 T cells CD8 T cells CD8 T cells False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ARHGAP19 996 False NaN NaN NaN NaN 0.016906 False False False False False False
SLC22A4 997 False NaN NaN NaN NaN 0.016718 False False False False False False
DRAXIN 998 False NaN NaN NaN NaN 0.016547 False False False False False False
C1QC 999 False NaN NaN NaN NaN 0.014339 False False False False False False
PARS2 1000 False NaN NaN NaN NaN 0.012845 False False False False False False

1000 rows × 17 columns

The list of selected genes can be retrieved with

[10]:
selector.probeset.index[selector.probeset.selection]
[10]:
Index(['PF4', 'HLA-DPB1', 'FCGR3A', 'GZMB', 'CCL5', 'S100A8', 'IL32',
       'HLA-DQA1', 'NKG7', 'AIF1', 'CD79A', 'LTB', 'TYROBP', 'HLA-DMA', 'GZMK',
       'HLA-DRB1', 'FCN1', 'S100A11', 'GNLY', 'GZMH'],
      dtype='object')

Vizualize and investigate the results

Let’s first plot the mean expression in each cell type cluster with some additional information for the selected genes.

[11]:
sp.pl.masked_dotplot(adata, selector)
../_images/_tutorials_spapros_tutorial_basic_selection_21_0.png

For each cell type the Spapros markers are those that build a combinatorial rule to classify the cell type. Often you find cell type specific markers, and negative markers. When selecting genes for high numbers of cell types including similar clusters the rules tend to show more sophisticated combinatorics.

The Spapros selection procedure consists of multiple steps, we can also investigate how many genes originate from the different steps:

[12]:
selector.plot_gene_overlap()
../_images/_tutorials_spapros_tutorial_basic_selection_24_0.png
  • All finally selected 20 genes occured in the PCA-based pre-selection which selects genes that capture a high amount of variation

  • they strongly overlap with the (16) differentially expressed genes since marker genes are one source of high variation in transcriptomics data. The overlap is especially high for datasets with only a few cell type clusters that are also well separated as in the given example.

  • The first DE tests to train optimal classification trees are tests between each cell type and all other cell types as background (DE_1vsall). To further optimise the trees we run DE tests between cell type clusters that are more difficult to distinguish (DE_specific). We find 4 genes that originate from the specific tests that didn’t occur in the 1 vs all tests.

What’s next?

You can now evaluate the selected probeset. See our basic evaluation tutorial.

If you want to customize and improve the probeset selection, have a look at the advanced selection tutorial.

Check out the end-to-end selection tutorial for the full pipline that includes probe design and gene panel selection.