spapros.ev.ProbesetEvaluator
- class spapros.ev.ProbesetEvaluator(adata, celltype_key='celltype', results_dir='./probeset_evaluation/', scheme='quick', metrics=None, metrics_params={}, marker_list=None, reference_name='adata1', reference_dir=None, verbosity=1, n_jobs=-1)
General class for probe set evaluation, comparison, plotting.
Notes
The evaluator works on one given dataset and calculates metrics/analyses with respect to that dataset.
The calculation steps of the metrics can be divided into:
calculations that need to be run one time for the given dataset (not all metrics have this step)
calculations that need to be run for each probe set
calculations independent of 1.
calculations dependent on 1. (if 1. existed for a given metric)
Summarize results into summary statistics
Run evaluations
Evaluate a single probeset:
evaluator = ProbesetEvaluator(adata) evaluator.evaluate_probeset(gene_set)
In a pipeline to evaluate multiple probesets you would run
sequential setup:
evaluator = ProbesetEvaluator(adata) for i, gene_set in enumerate(sets): evaluator.evaluate_probeset(gene_set, set_id=f"set_{i}")
parallelised setup:
evaluator = ProbesetEvaluator(adata) # 1. step: evaluator.compute_or_load_shared_results() # 2. step: parallelised processes evaluator.evaluate_probeset(gene_set, set_id, update_summary=False, pre=True) # parallelised over set_ids # 3. step: parallelised processes (needs 1. to be finished) evaluator.evaluate_probeset(gene_set, set_id, update_summary=False) # parallelised over set_ids # 4. step: (needs 3. to be finished) evaluator.summary_statistics()
Reference evaluations
In practice the evaluations are meaningful when having reference evaluations to compare to.
A simple way to get reference probe sets:
reference_sets = spapros.selection.select_reference_probesets(adata)
Evaluate them (we also provide ids to keep track of the probesets):
evaluator = ProbesetEvaluator(adata) for set_id, gene_set in reference_sets.items(): evaluator.evaluate_probeset(gene_set, set_id=set_id) evaluator.plot_summary()
Evaluation schemes
Some metrics take very long to compute, we prepared different metric sets for a quick or a full evaluation. You can also specify the list of metrics yourself by setting
scheme="custom"
. Note that in any scheme it might still be reasonable to adjustmetrics_params
.Saving of results
If
results_dir
is not None we save the results in files.Why:
some computations are time demanding, especially when you evaluate multiple sets it’s reasonable to keep results.
load previous results when initializing a
ProbesetEvaluator
. Makes it very easy to access and compare old results.
Two saving directories need to be distinguished:
results_dir
: each probeset’s evaluation results are saved herereference_dir
: for shared reference dataset results (default isreference_dir = results_dir + reference_name
)
In which files the results are saved:
Shared computations are saved as:
reference_dir # (default: results_dir+"references") └── {reference_name}_{metric}.csv # shared computations for given reference dataset
The final probeset specific results are saved as:
results_dir ├── {metric} # one folder for each metric │ ├── {reference_name}_{set_id}_pre.csv # pre results file for given set_id, reference dataset, and metric │ │ # (only for some metrics) │ └── {reference_name}_{set_id}.csv # result file for given set_id, reference dataset, and metric └── {reference_name}_summary.csv # summary statistics
Plotting
Plot a summary metrics table to get an overall performance overview:
evaluator.plot_summary()
For each evaluation we provide a detailed plot, e.g.:
forest_clfs: heatmap of normalised confusion matrix
gene_corr: heatmap of ordered correlation matrix
Create detailed plots with:
evaluator.plot_evaluations()
- Parameters:
adata (AnnData) – An already preprocessed annotated data matrix. Typically we use log normalised data.
celltype_key (Union[str, List[str]]) – The adata.obs key for cell type annotations. Provide a list of keys to calculate the according metrics on multiple keys.
results_dir (Optional[str]) – Directory where probeset results are saved. Defaults to ./probeset_evaluation/. Set to None if you don’t want to save results. When initializing the class we also check for existing results. Note if
scheme (str) –
Defines which metrics are calculated
’quick’ : knn, forest classification, marker correlation (if marker list given), gene correlation
’full’ : nmi, knn, forest classification, marker correlation (if marker list given), gene correlation
’custom’: define metrics of intereset in
metrics
metrics (Optional[List[str]]) –
Define which metrics are calculated. This is set automatically if
scheme != "custom"
. Supported are:’cluster_similarity’
’knn_overlap’
’forest_clfs’
’marker_corr’
’gene_corr’
metrics_params (Dict[str, Dict]) –
Provide parameters for the calculation of each metric. E.g.:
metrics_params = { "nmi":{ "ns": [5,20], "AUC_borders": [[7, 14], [15, 20]], } }
This overwrites the arguments
ns
andAUC_borders
of the nmi metric. Seeget_metric_default_parameters()
for the default values of each metricmarker_list (Union[str, Dict[str, List[str]]]) – Dictionary containing celltypes as keys and the respective markers as a list as values.
reference_name (str) – Name of reference dataset. This is chosen automatically if None is given.
reference_dir (str) – Directory where reference results are saved. If None is given
reference_dir
is set toresults_dir+'reference/'
.verbosity (int) – Verbosity level.
n_jobs (int) – Number of CPUs for multi processing computations. Set to -1 to use all available CPUs.
- Attributes:
adata – An already preprocessed annotated data matrix. Typically we use log normalised data.
celltype_key – The
adata.obs
key for cell type annotations or list of keys.dir – Directory where probeset results are saved.
scheme – Defines which metrics are calculated
marker_list – Celltypes and the respective markers.
metrics_params – Parameters for the calculation of each metric. Either default or user specified.
metrics – The metrics to be calculated. Either custom or defined according to
scheme
.ref_name – Name of reference dataset.
ref_dir – Directory where reference results are saved.
verbosity – Verbosity level.
n_jobs – Number of CPUs for multi processing computations. Set to -1 to use all available CPUs. Verbosity level.
shared_results – Results of shared metric computations.
pre_results – Results of metric pre computations.
results – Results of probe set specific metric computations.
summary_results – Table of summary statistics.
Methods
Compute results that are potentially reused for evaluations of different probesets.
evaluate_probeset
(genes[, set_id, ...])Compute probe set specific evaluations.
evaluate_probeset_pipeline
(genes, set_id, ...)Pipeline specific adaption of evaluate_probeset.
load_results
([directories, reference_dir, ...])Load existing results from files of one or multiple evaluation output directories
pipeline_summary_statistics
(result_files, ...)Adaptation of the function summary_statistics for the spapros-pipeline.
plot_cluster_similarity
([set_ids, ...])Plot cluster similarity as NMI over number of clusters
plot_coexpression
([set_ids])Plot heatmaps of gene correlation matrices
plot_confusion_matrix
([set_ids])Plot heatmaps of cell type classification confusion matrices
plot_evaluations
([set_ids, metrics, show, ...])Plot detailed results plots for specified metrics.
plot_knn_overlap
([set_ids, selections_info])Plot mean knn overlap over k
plot_marker_corr
(**kwargs)Plot maximal correlations with marker genes
plot_summary
([set_ids])Plot heatmap of summary metrics
summary_statistics
(set_ids)Compute summary statistics and update summary csv.