spapros.ev.single_forest_classifications

spapros.ev.single_forest_classifications(adata, selection, celltypes='all', ref_celltypes='all', ct_key='Celltypes', ct_spec_ref=None, save=False, seed=0, n_trees=50, max_depth=3, subsample=1000, test_subsample=3000, sort_by_tree_performance=True, verbose=False, return_clfs=False, n_jobs=1, backend='loky', progress=None, level=3, task='Train trees...')

Compute or load decision tree classification results.

Notes

As metrics we use: macro f1 score as summary statistic - it’s a uniformly weighted statistic wrt celltype groups in ‘others’ since we sample uniformly. For the reference celltype specific metric we use specificity = TN/(FP+TN) (also because FN and TP are not feasible in the given setting)

Parameters:
  • adata (AnnData) – An already preprocessed annotated data matrix. Typically we use log normalised data.

  • selection (Union[list, DataFrame]) – Trees are trained on genes of the list or genes defined in the bool column selection['selection'].

  • celltypes (Union[str, list]) – Trees are trained on the given celltypes

  • ref_celltypes (Union[str, list]) – List of celltypes used as reference or 'all'.

  • ct_key (str) – str Column name of adata.obs with celltype infos

  • ct_spec_ref (Optional[Dict[str, List[str]]]) – Celltype specific references (e.g.: {'AT1':['AT1','AT2','Club'],'Pericytes':['Pericytes','Smooth muscle']}). This argument was introduced to train secondary trees.

  • save (Union[str, bool]) – If not False load results if the given file exists, otherwise save results after computation.

  • n_trees (int) – Number of trees to train.

  • seed (int) – Random seed.

  • max_depth (int) – max_depth argument of DecisionTreeClassifier.

  • subsample (int) – int For each trained tree we use samples of maximal size=`subsample` for each celltype. If fewer cells are present for a given celltype all cells are used.

  • test_subsample (int) – Number of random choices for drawing test sets.

  • sort_by_tree_performance (bool) – Wether to sort results and trees by tree performance (best first) per celltype

  • verbose (bool) – Verbosity level > 1.

  • return_clfs (bool) – Wether to return the sklearn tree classifier objects. (if return_clfs and save_load we still on save the results tables, if you want to save the classifiers this needs to be done separately).

  • n_jobs (int) – Multiprocessing number of processes.

  • backend (str) – Which backend to use for multiprocessing. See class joblib.Parallel for valid options.

  • progress (Optional[Progress]) – rich.Progress object if progress bars should be shown.

  • level (int) – Progress bar level.

  • task (str) – Description of progress task.

Returns:

tuple containing:

  • summary_metric: pd.DataFrame

    macro f1 scores for each celltype’s trees (Ordered according best performing trees)

  • ct_specific_metric: dict of pd.DataFrame

    For each celltype’s tree: specificity (= TN / (FP+TN)) wrt each other celltype’s test sample

  • importances: dict of pd.DataFrame

    Gene’s feature importances for each tree.

  • forests: dict

    only returned if return_clfs=True. Then the other three return values will be packed in a list: [summary_metric,ct_specific_metric,importances], forests.

Return type:

tuple

Note

In all output files trees are ordered according macro f1 performance.