spapros.se.ProbesetSelector

class spapros.se.ProbesetSelector(adata, celltype_key, genes_key='highly_variable', n=None, preselected_genes=[], prior_genes=[], n_pca_genes=100, min_mean_difference=None, n_min_markers=2, celltypes='all', marker_list=None, n_list_markers=2, marker_corr_th=0.5, pca_penalties=[], DE_penalties=[], m_penalties_adata_celltypes=[], m_penalties_list_celltypes=[], pca_selection_hparams={}, DE_selection_hparams={'n': 3, 'per_group': True}, forest_hparams={'n_trees': 50, 'subsample': 1000, 'test_subsample': 3000}, forest_DE_baseline_hparams={'max_step': 3, 'min_outlier_dif': 0.02, 'min_score': 0.9, 'n_DE': 1, 'n_stds': 1.0, 'n_terminal_repeats': 3}, add_forest_genes_hparams={'importance_th': 0, 'n_max_per_it': 5, 'performance_th': 0.02}, marker_selection_hparams={'penalty_threshold': 1}, verbosity=2, seed=0, save_dir=None, n_jobs=-1)

General class for probeset selection.

Notes

The selector creates a probeset which identifies the celltypes of interest and captures transcriptomic variation beyond cell type labels.

The Spapros selection pipeline combines basic feature selection builing blocks while optionally taking into account prior knowledge.

The main steps of the selection pipeline are:

  1. PCA based selection of variation recovering genes.

  2. Selection of DE genes.

  3. Train decision trees on the DE genes (including an iterative optimization with additional DE tests).

  4. Train decision trees on the PCA genes (and optionally on pre-selected and prioritized genes).

  5. Enhancement of the PCA trees by adding beneficial DE genes.

  6. Rank genes, eventually add missing marker genes and compile probe set.

The result of the selection is given in ProbesetSelector.probeset.

Genes are ranked as follows (sorry it’s a bit complicated):

  • First the following groups are built
    1. preselected genes (optional, see parameter preselected_genes)

    2. genes that occur in the best decision trees of each cell type

    3. genes that are needed to achieve the minimal number of markers per cell type that occurs in ProbesetSelector.marker_list but not in ProbesetSelector.adata_celltypes (optional, see parameter n_list_markers). This group is separated from 3. because genes of 2. take care of classifying cell types in ProbesetSelector.adata_celltypes.

    4. genes that are needed to achieve the minimal number of markers per cell type in ProbesetSelector.adata_celltypes. (optional, see parameter n_min_markers)

    5. all other genes

  • Afterwards within each “rank” group genes are further ranked by
    1. the marker_rank: first the best markers of celltypes, then 2nd best markers of celltypes, …, then n_min_markers th best marker of celltypes, then genes that are not identified as required markers.

    2. the tree_rank: for each cell type the genes that occur in cell type classification trees with 2nd best performance, then 3rd best performance, and so on. Genes that don’t occur in trees have the worst tree_rank.

    3. the importance_score from the best cell type classification tree of each gene. Genes that don’t occur in any tree score worst.

    4. the pca_score which scores how much variation of the dataset each gene captures.

Parameters:
  • adata (AnnData) – Data with log normalised counts in adata.X. The selection runs with an adata subsetted on fewer genes. It might be helpful though to keep all genes (when a marker_list and penalties are provided). The genes can be subsetted for selection via genes_key.

  • celltype_key (str) – Key in adata.obs with celltype annotations.

  • genes_key (str) – Key in adata.var for preselected genes (typically ‘highly_variable_genes’).

  • n (Optional[int]) – Optionally set the number of finally selected genes. Note that when n is None we automatically infer n as the minimal number of recommended genes. This includes all preselected genes, genes in the best decision tree of each celltype, and the minimal number of identified and added markers defined by n_min_markers and n_list_markers. Als note that setting n might change the gene ranking since the final added list_markers are added based on the theoretically added genes without list_markers.

  • preselected_genes (List[str]) – Pre selected genes (these will also have the highest ranking in the final list).

  • prior_genes (List[str]) – Prioritized genes.

  • n_pca_genes (int) – Optionally set the number of preselected pca genes. If not set or set <1, this step will be skipped.

  • min_mean_difference (float) – Minimal difference of mean expression between at least one celltype and the background. In this test only cell types from celltypes are taken into account (also for the background). This minimal difference is applied as an additional binary penalty in pca_penalties, DE_penalties and m_penalties_adata_celltypes.

  • n_min_markers (int) – The minimal number of identified and added markers.

  • celltypes (Union[List[str], str]) –

    Cell types for which trees are trained.

    • The probeset is optimised to be able to distinguish each of these cell types from all other cells occuring in the dataset.

    • The pca selection is based on all cell types in the dataset (not only on celltypes).

    • The optionally provided marker list can include additional cell types not listed in celltypes (and adata.obs[celltype_key]).

  • marker_list (Union[str, Dict[str, List[str]]]) –

    List of marker genes. Can either be a dictionary like this:

    {
    "celltype_1": ["S100A8", "S100A9", "LYZ", "BLVRB"],
    "celltype_2": ["BIRC3", "TMEM116"],
    "celltype_4": ["CD74", "CD79B", "MS4A1"],
    "celltype_3": ["C5AR1"],
    }
    

    Or the path to a csv-file containing the one column of markers for each celltype. The column names need to be the celltype identifiers used in adata.obs[celltype_key].

  • n_list_markers (Union[int, Dict[str, int]]) – Minimal number of markers per celltype that are at least selected. Selected means either selecting genes from the marker list or having correlated genes in the already selected panel. (Set the correlation threshold with marker_selection_hparams[‘penalty_threshold’]). The correlation based check only applies to cell types that also occur in adata.obs[celltype_key] while for cell types that only occur in the marker_list the markers are just added. If you want to select a different number of markers for celltypes in adata and celltypes only in the marker list, set e.g.: n_list_markers = {'adata_celltypes':2,'list_celltypes':3}.

  • marker_corr_th (float) – Minimal correlation to consider a gene as captured.

  • pca_penalties (list) – List of keys for columns in adata.var containing penalty factors that are multiplied with the scores for PCA based gene selection.

  • DE_penalties (list) – List of keys for columns in adata.var containing penalty factors that are multiplied with the scores for DE based gene selection.

  • m_penalties_adata_celltypes (list) – List of keys for columns in adata.var containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes in adata.

  • m_penalties_list_celltypes (list) – List of keys for columns in adata.var containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes not in adata.

  • pca_selection_hparams (Dict[str, Any]) – Dictionary with hyperparameters for the PCA based gene selection.

  • DE_selection_hparams (Dict[str, Any]) – Dictionary with hyperparameters for the DE based gene selection.

  • forest_hparams (Dict[str, Any]) – Dictionary with hyperparameters for the forest based gene selection.

  • forest_DE_baseline_hparams (Dict[str, Any]) – Dictionary with hyperparameters for adding DE genes to decision trees.

  • add_forest_genes_hparams (Dict[str, Any]) – Dictionary with hyperparameters for adding marker genes to decision trees.

  • marker_selection_hparams (Dict[str, Any]) – Dictionary with hyperparameters. So far only the threshold for the penalty filtering of marker genes if a gene’s penalty < threshold.

  • verbosity (int) – Verbosity level.

  • seed (int) – Random number seed.

  • save_dir (Optional[str]) –

    Directory path where all results are saved and loaded from if results already exist. Note for the case that results already exist:

    • if self.select_probeset() was fully run through and all results exist: then the initialization arguments don’t matter much

    • if only partial results were generated, make sure that the initialization arguments are the same as before!

  • n_jobs (int) – Number of cpus for multi processing computations. Set to -1 to use all available cpus.

Attributes:
  • adata – Data with log normalised counts in adata.X.

  • ct_key – Key in adata.obs with celltype annotations.

  • g_key – Key in adata.var for preselected genes (typically ‘highly_variable_genes’).

  • n – Number of finally selected genes.

  • genes – Pre selected genes (these will also have the highest ranking in the final list).

  • selection – Dictionary with the final and several other gene set selections.

  • n_pca_genes – The number of preselected pca genes. If None or <1, this step is skipped.

  • min_mean_difference – Minimal difference of mean expression between at least one celltype and the background.

  • n_min_markers – The minimal number of identified and added markers for cell types of adata.obs[ct_key].

  • celltypes – Cell types for which trees are trained.

  • adata_celltypes – List of all celltypes occuring in adata.obs[ct_key].

  • obs – Keys of adata.obs on which most of the selections are run.

  • marker_list – Dictionary of the form {'celltype': list of markers of celltype}.

  • n_list_markers – Minimal number of markers from the marker_list that are at least selected per cell type. Note that for those cell types in the marker_list that also occur in adata.obs[ct_key] genes that are correlated with the markers might be selected (see marker_corr_th).

  • marker_corr_th – Minimal correlation to consider a gene as captured.

  • pca_penalties – List of keys for columns in adata.var containing penalty factors that are multiplied with the scores for PCA based gene selection.

  • DE_penalties – List of keys for columns in adata.var containing penalty factors that are multiplied with the scores for DE based gene selection.

  • m_penalties_adata_celltypes – List of keys for columns in adata.var containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes in adata.

  • m_penalties_list_celltypes – List of keys for columns in adata.var containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes not in adata.

  • pca_selection_hparams – Dictionary with hyperparameters for the PCA based gene selection.

  • DE_selection_hparams – Dictionary with hyperparameters for the DE based gene selection.

  • forest_hparams – Dictionary with hyperparameters for the forest based gene selection.

  • forest_DE_baseline_hparams – Dictionary with hyperparameters for adding DE genes to decision trees.

  • add_forest_genes_hparams – Dictionary with hyperparameters for adding marker genes to decision trees.

  • m_selection_hparams – Dictionary with hyperparameters. So far only the threshold for the penalty filtering of marker genes if a gene’s penalty < threshold.

  • verbosity – Verbosity level.

  • seed – Random number seed.

  • save_dir – Directory path where all results are saved and loaded from if results already exist.

  • n_jobs – Number of cpus for multi processing computations. Set to -1 to use all available cpus.

  • forest_results – Forest results.

  • forest_clfs – Forest classifier.

  • min_test_n – Minimal number of samples in each celltype’s test set

  • loaded_attributes – List of which results were loaded from disc.

  • disable_pbars. – Disable progress bars.

  • probeset – The final probeset list. Available only after calling select_probeset(). The table contains the following columns:

    • index

      Gene symbol.

    • gene_nr

      Integer assigned to each gene.

    • selection

      Wether a gene was selected.

    • rank

      Gene ranking as describes in Notes above.

    • marker_rank

      Rank of the required markers per cell type. The best marker per cell type has marker_rank 1, the second best 2, and so on. Required markers are ranked till n_min_markers or n_list_markers depending on the cell type.

    • tree_rank

      Ranking of the best tree the gene occured in. Per cell type multiple decision trees are trained and the best one is selected. To extend the ranking of genes in the probeset list, the 2nd, 3rd, … best performing trees are considered.

    • importance_score

      Highest importance score of a gene in the highest ranked trees that the gene occured in. (see TODO: reference tree training fct and there the description of the output)

    • pca_score

      Score from PCA-based selection (see TODO: document pca based selection and reference procedure here). Genes with high scores capture high amounts of general transcriptomic variation.

    • pre_selected

      Whether a gene was in the list of pre-selected genes.

    • prior_selected

      Whether a gene was in the list of prioritized genes.

    • pca_selected

      Whether a gene was in the list of n_pca_genes of PCA selected genes.

    • celltypes_DE_1vsall

      Cell type in which a given gene is up-regulated (compared to all other cell types as background, identified via differential expression tests during the selection).

    • celltypes_DE_specific

      Like celltypes_DE_1vsall but for DE tests that use a subset of the background (typically genes that distinguish similar cell types).

    • celltypes_DE

      celltypes_DE_1vsall and celltypes_DE_specific combined.

    • celltypes_marker

      celltypes_DE_1vsall combined with celltypes_DE_specific and the cell type of marker_list if the gene was listed as a marker there.

    • list_only_ct_marker

      Whether a gene is listed as a marker in marker_list.

    • required_marker

      Whether a gene was required to reach the minimal number of markers per cell type (n_min_markers, n_list_markers).

    • required_list_marker

      Whether a gene was required to reach the minimal number of markers for cell types that only occur in marker_list but not in adata_celltypes.

  • genes_of_primary_trees – The genes of the best tree of each cell type. Available only after calling select_probeset(). The table contains the following columns:

    • gene

      Gene symbol.

    • celltype

      Cell type in which the tree occurs.

    • importance

      Importance score of the gene for the given cell type.

    • nr_of_celltypes

      Number of primary trees i.e. cell types in which the gene occurs.

Methods

info()

Print info.

plot_clf_genes([basis, celltypes, ...])

Plot umaps of selected genes needed for cell type classification of each cell type.

plot_coexpression([selections])

Plot correlation matrix of selected genes

plot_explore_constraint([selection_method, ...])

Plot histogram of quantiles for selected genes for different penalty kernels.

plot_gene_overlap([origins])

Plot the overlap of origins for the selected genes

plot_histogram([x_axis_keys, selections, ...])

Plot histograms of (basic) selections under given penalties.

select_probeset()

Run full selection procedure.