Evaluation of probe sets

Spapros provides metrics to evaluate the quality of a gene panel. For evaluation, we mainly focus on 2 metric categories: general variation recovery and cell type classification. Additionally as a 3rd category we measure gene redundancy based on correlation between gene pairs. In each category we further distinguish between metrics that measure different aspects: - Cell type identification metrics measure cell type classification accuracy and the percentage of captured cell types (and additionally - if a curated marker list is provided - how well the marker expressions of a literature derived list are captured, via marker correlation and cell type balanced marker correlation). - Variation recovery metrics measure how well cellular variation of the full transcriptome is recovered with only a subset of features. These comprise coarse and fine clustering similarity, which quantify how well the gene set recovers cluster structure at different levels of granularity, and neighborhood similarity, which measures how well the local cell neighborhoods are preserved. - The amount of redundant genes in the gene set is assessed via gene correlation and the percentage of highly correlated genes

The evaluation pipeline measures summary metrics for each category. We can also get more detailed information (e.g. classification accuracy of each individual cell type). These fine grained evaluations are showcased in the advanced evaluation tutorial.

In this tutorial we show how to compare different gene sets based on the calculation of summary metrics. We examplary evaluate the probeset of 20 genes that was selected in our basic selection tutoral.

The following figure shows how the comparison of multiple selections can look like. As mentioned we’ll concentrate on the three categories: variation recovery (green), cell type classification (violet), and gene redundancy (blue).

Import packages and setup

[4]:
import scanpy as sc
import spapros as sp
[5]:
sc.settings.verbosity = 1
sc.logging.print_header()
print(f"spapros=={sp.__version__}")
scanpy==1.10.0 anndata==0.10.6 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==1.5.3 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11
spapros==0.1.5

Load dataset

[6]:
adata = sc.datasets.pbmc3k()
adata_tmp = sc.datasets.pbmc3k_processed()

# Get infos from the processed dataset
adata = adata[adata_tmp.obs_names, adata_tmp.var_names]
adata.obs['celltype'] = adata_tmp.obs['louvain']
adata.obsm['X_umap'] = adata_tmp.obsm['X_umap']
del adata_tmp

# Preprocess counts and get highly variable genes
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=1000)

adata
[6]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'celltype'
    var: 'gene_ids', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
    obsm: 'X_umap'

Get probesets

[7]:
# Probeset (selected with Spapros, see basic selection tutorial)
probeset = [
    'PF4', 'HLA-DPB1', 'FCGR3A', 'GZMB', 'CCL5', 'S100A8', 'IL32', 'HLA-DQA1', 'NKG7', 'AIF1', 'CD79A', 'LTB', 'TYROBP',
    'HLA-DMA', 'GZMK', 'HLA-DRB1', 'FCN1', 'S100A11', 'GNLY', 'GZMH'
]

# Reference probesets
reference_sets = sp.se.select_reference_probesets(adata, n=20)

Evaluate sets

[8]:
evaluator = sp.ev.ProbesetEvaluator(adata, verbosity=2, results_dir=None)
[9]:
evaluator.evaluate_probeset(probeset, set_id="Spapros")
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
The following cell types are not included in forest classifications since they have fewer 
than 40 cells: ['Dendritic cells', 'Megakaryocytes']
[10]:
for set_id, df in reference_sets.items():
    gene_set = df[df["selection"]].index.to_list()
    evaluator.evaluate_probeset(gene_set, set_id=set_id)

The following cell types are not included in forest classifications since they have fewer 
than 40 cells: ['Dendritic cells', 'Megakaryocytes']
The following cell types are not included in forest classifications since they have fewer 
than 40 cells: ['Dendritic cells', 'Megakaryocytes']
The following cell types are not included in forest classifications since they have fewer 
than 40 cells: ['Dendritic cells', 'Megakaryocytes']
The following cell types are not included in forest classifications since they have fewer 
than 40 cells: ['Dendritic cells', 'Megakaryocytes']
[11]:
evaluator.summary_results

[11]:
knn_overlap mean_overlap_AUC forest_clfs accuracy forest_clfs perct acc > 0.8 gene_corr 1 - mean gene_corr perct max < 0.8
DE 0.160027 0.916620 0.955688 0.757868 0.792178
random (seed=0) 0.023879 0.575532 0.007778 0.976067 1.000000
Spapros 0.167580 0.923499 0.987302 0.772479 0.993647
PCA 0.293958 0.887955 0.897735 0.820372 1.000000
HVG 0.056358 0.819195 0.720651 0.840755 0.900000

Visualize the results

[12]:
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 60
../_images/_tutorials_spapros_tutorial_basic_evaluation_18_0.png

In the generated summary table we see that the selection based on PCA shows the highest (fine level) variation recovery and Spapros shows the highest cell type classification accuracy.

Note that we didn’t measure all metrics in this tutorial since the ProbesetEvaluator sets as a default argument scheme='quick'. In the advanced evaluation tutorial we showcase how to adjust the evaluation and then go into more detailed comparisons.

Note: After running the spapros evaluation pipeline, the results are stored in a directory, that is by default called probeset_evaluation. If you run another evaluation after this tutorial, be careful that you either specify another dir or delete the previously created results directory because otherwise parts will be overwritten and parts will be falsely reused! If you do not want to save results, initialize the ProbesetEvaluator with dir=None, like in this tutorial.