spapros.ev.ProbesetEvaluator

class spapros.ev.ProbesetEvaluator(adata, celltype_key='celltype', results_dir='./probeset_evaluation/', scheme='quick', metrics=None, metrics_params={}, marker_list=None, reference_name='adata1', reference_dir=None, verbosity=1, n_jobs=-1)

General class for probe set evaluation, comparison, plotting.

Notes

The evaluator works on one given dataset and calculates metrics/analyses with respect to that dataset.

The calculation steps of the metrics can be divided into:

  1. calculations that need to be run one time for the given dataset (not all metrics have this step)

  2. calculations that need to be run for each probe set

    1. calculations independent of 1.

    2. calculations dependent on 1. (if 1. existed for a given metric)

  3. Summarize results into summary statistics

Run evaluations

Evaluate a single probeset:

evaluator = ProbesetEvaluator(adata)
evaluator.evaluate_probeset(gene_set)

In a pipeline to evaluate multiple probesets you would run

  • sequential setup:

    evaluator = ProbesetEvaluator(adata)
    for i, gene_set in enumerate(sets):
        evaluator.evaluate_probeset(gene_set, set_id=f"set_{i}")
    
  • parallelised setup:

    evaluator = ProbesetEvaluator(adata)
    # 1. step:
    evaluator.compute_or_load_shared_results()
    # 2. step: parallelised processes
    evaluator.evaluate_probeset(gene_set, set_id, update_summary=False, pre=True) # parallelised over set_ids
    # 3. step: parallelised processes (needs 1. to be finished)
    evaluator.evaluate_probeset(gene_set, set_id, update_summary=False) # parallelised over set_ids
    # 4. step: (needs 3. to be finished)
    evaluator.summary_statistics()
    

Reference evaluations

In practice the evaluations are meaningful when having reference evaluations to compare to.

A simple way to get reference probe sets:

reference_sets = spapros.selection.select_reference_probesets(adata)

Evaluate them (we also provide ids to keep track of the probesets):

evaluator = ProbesetEvaluator(adata)
for set_id, gene_set in reference_sets.items():
    evaluator.evaluate_probeset(gene_set, set_id=set_id)
evaluator.plot_summary()

Evaluation schemes

Some metrics take very long to compute, we prepared different metric sets for a quick or a full evaluation. You can also specify the list of metrics yourself by setting scheme="custom". Note that in any scheme it might still be reasonable to adjust metrics_params.

Saving of results

If results_dir is not None we save the results in files.

Why:

  • some computations are time demanding, especially when you evaluate multiple sets it’s reasonable to keep results.

  • load previous results when initializing a ProbesetEvaluator. Makes it very easy to access and compare old results.

Two saving directories need to be distinguished:

  1. results_dir: each probeset’s evaluation results are saved here

  2. reference_dir: for shared reference dataset results (default is reference_dir = results_dir + reference_name)

In which files the results are saved:

  • Shared computations are saved as:

    reference_dir # (default: results_dir+"references")
    └── {reference_name}_{metric}.csv # shared computations for given reference dataset
    
  • The final probeset specific results are saved as:

    results_dir
    ├── {metric} # one folder for each metric
    │   ├── {reference_name}_{set_id}_pre.csv # pre results file for given set_id, reference dataset, and metric
    │   │                                     # (only for some metrics)
    │   └── {reference_name}_{set_id}.csv # result file for given set_id, reference dataset, and metric
    └── {reference_name}_summary.csv # summary statistics
    

Plotting

Plot a summary metrics table to get an overall performance overview:

evaluator.plot_summary()

For each evaluation we provide a detailed plot, e.g.:

  • forest_clfs: heatmap of normalised confusion matrix

  • gene_corr: heatmap of ordered correlation matrix

Create detailed plots with:

evaluator.plot_evaluations()
Parameters:
  • adata (AnnData) – An already preprocessed annotated data matrix. Typically we use log normalised data.

  • celltype_key (Union[str, List[str]]) – The adata.obs key for cell type annotations. Provide a list of keys to calculate the according metrics on multiple keys.

  • results_dir (Optional[str]) – Directory where probeset results are saved. Defaults to ./probeset_evaluation/. Set to None if you don’t want to save results. When initializing the class we also check for existing results. Note if

  • scheme (str) –

    Defines which metrics are calculated

    • ’quick’ : knn, forest classification, marker correlation (if marker list given), gene correlation

    • ’full’ : nmi, knn, forest classification, marker correlation (if marker list given), gene correlation

    • ’custom’: define metrics of intereset in metrics

  • metrics (Optional[List[str]]) –

    Define which metrics are calculated. This is set automatically if scheme != "custom". Supported are:

    • ’cluster_similarity’

    • ’knn_overlap’

    • ’forest_clfs’

    • ’marker_corr’

    • ’gene_corr’

  • metrics_params (Dict[str, Dict]) –

    Provide parameters for the calculation of each metric. E.g.:

    metrics_params = {
        "nmi":{
            "ns": [5,20],
            "AUC_borders": [[7, 14], [15, 20]],
        }
    }
    

    This overwrites the arguments ns and AUC_borders of the nmi metric. See get_metric_default_parameters() for the default values of each metric

  • marker_list (Union[str, Dict[str, List[str]]]) – Dictionary containing celltypes as keys and the respective markers as a list as values.

  • reference_name (str) – Name of reference dataset. This is chosen automatically if None is given.

  • reference_dir (str) – Directory where reference results are saved. If None is given reference_dir is set to results_dir+'reference/'.

  • verbosity (int) – Verbosity level.

  • n_jobs (int) – Number of CPUs for multi processing computations. Set to -1 to use all available CPUs.

Attributes:
  • adata – An already preprocessed annotated data matrix. Typically we use log normalised data.

  • celltype_key – The adata.obs key for cell type annotations or list of keys.

  • dir – Directory where probeset results are saved.

  • scheme – Defines which metrics are calculated

  • marker_list – Celltypes and the respective markers.

  • metrics_params – Parameters for the calculation of each metric. Either default or user specified.

  • metrics – The metrics to be calculated. Either custom or defined according to scheme.

  • ref_name – Name of reference dataset.

  • ref_dir – Directory where reference results are saved.

  • verbosity – Verbosity level.

  • n_jobs – Number of CPUs for multi processing computations. Set to -1 to use all available CPUs. Verbosity level.

  • shared_results – Results of shared metric computations.

  • pre_results – Results of metric pre computations.

  • results – Results of probe set specific metric computations.

  • summary_results – Table of summary statistics.

Methods

compute_or_load_shared_results()

Compute results that are potentially reused for evaluations of different probesets.

evaluate_probeset(genes[, set_id, ...])

Compute probe set specific evaluations.

evaluate_probeset_pipeline(genes, set_id, ...)

Pipeline specific adaption of evaluate_probeset.

load_results([directories, reference_dir, ...])

Load existing results from files of one or multiple evaluation output directories

pipeline_summary_statistics(result_files, ...)

Adaptation of the function summary_statistics for the spapros-pipeline.

plot_cluster_similarity([set_ids, ...])

Plot cluster similarity as NMI over number of clusters

plot_coexpression([set_ids])

Plot heatmaps of gene correlation matrices

plot_confusion_matrix([set_ids])

Plot heatmaps of cell type classification confusion matrices

plot_evaluations([set_ids, metrics, show, ...])

Plot detailed results plots for specified metrics.

plot_knn_overlap([set_ids, selections_info])

Plot mean knn overlap over k

plot_marker_corr(**kwargs)

Plot maximal correlations with marker genes

plot_summary([set_ids])

Plot heatmap of summary metrics

summary_statistics(set_ids)

Compute summary statistics and update summary csv.