spapros.ev.single_forest_classifications

spapros.ev.single_forest_classifications(adata, selection, celltypes='all', ref_celltypes='all', ct_key='Celltypes', ct_spec_ref=None, save=False, seed=0, n_trees=50, max_depth=3, subsample=1000, test_subsample=3000, sort_by_tree_performance=True, verbose=False, return_clfs=False, n_jobs=1, backend='loky', progress=None, level=3, task='Train trees...')

Compute or load decision tree classification results.

Notes

As metrics we use: macro f1 score as summary statistic - it’s a uniformly weighted statistic wrt celltype groups in ‘others’ since we sample uniformly. For the reference celltype specific metric we use specificity = TN/(FP+TN) (also because FN and TP are not feasible in the given setting)

Parameters:

adata (AnnData) – An already preprocessed annotated data matrix. Typically we use log normalised data.
selection (Union[list, DataFrame]) – Trees are trained on genes of the list or genes defined in the bool column selection['selection'].
celltypes (Union[str, list]) – Trees are trained on the given celltypes
ref_celltypes (Union[str, list]) – List of celltypes used as reference or 'all'.
ct_key (str) – str Column name of adata.obs with celltype infos
ct_spec_ref (Optional[Dict[str, List[str]]]) – Celltype specific references (e.g.: {'AT1':['AT1','AT2','Club'],'Pericytes':['Pericytes','Smooth muscle']}). This argument was introduced to train secondary trees.
save (Union[str, bool]) – If not False load results if the given file exists, otherwise save results after computation.
n_trees (int) – Number of trees to train.
seed (int) – Random seed.
max_depth (int) – max_depth argument of DecisionTreeClassifier.
subsample (int) – int For each trained tree we use samples of maximal size=`subsample` for each celltype. If fewer cells are present for a given celltype all cells are used.
test_subsample (int) – Number of random choices for drawing test sets.
sort_by_tree_performance (bool) – Wether to sort results and trees by tree performance (best first) per celltype
verbose (bool) – Verbosity level > 1.
return_clfs (bool) – Wether to return the sklearn tree classifier objects. (if return_clfs and save_load we still on save the results tables, if you want to save the classifiers this needs to be done separately).
n_jobs (int) – Multiprocessing number of processes.
backend (str) – Which backend to use for multiprocessing. See class joblib.Parallel for valid options.
progress (Optional[Progress]) – rich.Progress object if progress bars should be shown.
level (int) – Progress bar level.
task (str) – Description of progress task.

Returns:

tuple containing:

summary_metric: pd.DataFrame
macro f1 scores for each celltype’s trees (Ordered according best performing trees)

ct_specific_metric: dict of pd.DataFrame
For each celltype’s tree: specificity (= TN / (FP+TN)) wrt each other celltype’s test sample

importances: dict of pd.DataFrame
Gene’s feature importances for each tree.

forests: dict
only returned if return_clfs=True. Then the other three return values will be packed in a list: [summary_metric,ct_specific_metric,importances], forests.

Return type:

tuple

Note

In all output files trees are ordered according macro f1 performance.