spapros.se.ProbesetSelector
- class spapros.se.ProbesetSelector(adata, celltype_key, genes_key='highly_variable', n=None, preselected_genes=[], prior_genes=[], n_pca_genes=100, min_mean_difference=None, n_min_markers=2, celltypes='all', marker_list=None, n_list_markers=2, marker_corr_th=0.5, pca_penalties=[], DE_penalties=[], m_penalties_adata_celltypes=[], m_penalties_list_celltypes=[], pca_selection_hparams={}, DE_selection_hparams={'n': 3, 'per_group': True}, forest_hparams={'n_trees': 50, 'subsample': 1000, 'test_subsample': 3000}, forest_DE_baseline_hparams={'max_step': 3, 'min_outlier_dif': 0.02, 'min_score': 0.9, 'n_DE': 1, 'n_stds': 1.0, 'n_terminal_repeats': 3}, add_forest_genes_hparams={'importance_th': 0, 'n_max_per_it': 5, 'performance_th': 0.02}, marker_selection_hparams={'penalty_threshold': 1}, verbosity=2, seed=0, save_dir=None, n_jobs=-1)
General class for probeset selection.
Notes
The selector creates a probeset which identifies the celltypes of interest and captures transcriptomic variation beyond cell type labels.
The Spapros selection pipeline combines basic feature selection builing blocks while optionally taking into account prior knowledge.
The main steps of the selection pipeline are:
PCA based selection of variation recovering genes.
Selection of DE genes.
Train decision trees on the DE genes (including an iterative optimization with additional DE tests).
Train decision trees on the PCA genes (and optionally on pre-selected and prioritized genes).
Enhancement of the PCA trees by adding beneficial DE genes.
Rank genes, eventually add missing marker genes and compile probe set.
The result of the selection is given in
ProbesetSelector.probeset
.Genes are ranked as follows (sorry it’s a bit complicated):
- First the following groups are built
preselected genes (optional, see parameter preselected_genes)
genes that occur in the best decision trees of each cell type
genes that are needed to achieve the minimal number of markers per cell type that occurs in
ProbesetSelector.marker_list
but not inProbesetSelector.adata_celltypes
(optional, see parameter n_list_markers). This group is separated from 3. because genes of 2. take care of classifying cell types inProbesetSelector.adata_celltypes
.genes that are needed to achieve the minimal number of markers per cell type in
ProbesetSelector.adata_celltypes
. (optional, see parameter n_min_markers)all other genes
- Afterwards within each “rank” group genes are further ranked by
the marker_rank: first the best markers of celltypes, then 2nd best markers of celltypes, …, then n_min_markers th best marker of celltypes, then genes that are not identified as required markers.
the tree_rank: for each cell type the genes that occur in cell type classification trees with 2nd best performance, then 3rd best performance, and so on. Genes that don’t occur in trees have the worst tree_rank.
the importance_score from the best cell type classification tree of each gene. Genes that don’t occur in any tree score worst.
the pca_score which scores how much variation of the dataset each gene captures.
- Parameters:
adata (AnnData) – Data with log normalised counts in
adata.X
. The selection runs with an adata subsetted on fewer genes. It might be helpful though to keep all genes (when a marker_list and penalties are provided). The genes can be subsetted for selection viagenes_key
.celltype_key (str) – Key in
adata.obs
with celltype annotations.genes_key (str) – Key in
adata.var
for preselected genes (typically ‘highly_variable_genes’).n (Optional[int]) – Optionally set the number of finally selected genes. Note that when
n
is None we automatically infern
as the minimal number of recommended genes. This includes all preselected genes, genes in the best decision tree of each celltype, and the minimal number of identified and added markers defined byn_min_markers
andn_list_markers
. Als note that settingn
might change the gene ranking since the final added list_markers are added based on the theoretically added genes withoutlist_markers
.preselected_genes (List[str]) – Pre selected genes (these will also have the highest ranking in the final list).
prior_genes (List[str]) – Prioritized genes.
n_pca_genes (int) – Optionally set the number of preselected pca genes. If not set or set <1, this step will be skipped.
min_mean_difference (float) – Minimal difference of mean expression between at least one celltype and the background. In this test only cell types from
celltypes
are taken into account (also for the background). This minimal difference is applied as an additional binary penalty in pca_penalties, DE_penalties and m_penalties_adata_celltypes.n_min_markers (int) – The minimal number of identified and added markers.
celltypes (Union[List[str], str]) –
Cell types for which trees are trained.
The probeset is optimised to be able to distinguish each of these cell types from all other cells occuring in the dataset.
The pca selection is based on all cell types in the dataset (not only on
celltypes
).The optionally provided marker list can include additional cell types not listed in
celltypes
(andadata.obs[celltype_key])
.
marker_list (Union[str, Dict[str, List[str]]]) –
List of marker genes. Can either be a dictionary like this:
{ "celltype_1": ["S100A8", "S100A9", "LYZ", "BLVRB"], "celltype_2": ["BIRC3", "TMEM116"], "celltype_4": ["CD74", "CD79B", "MS4A1"], "celltype_3": ["C5AR1"], }
Or the path to a csv-file containing the one column of markers for each celltype. The column names need to be the celltype identifiers used in
adata.obs[celltype_key]
.n_list_markers (Union[int, Dict[str, int]]) – Minimal number of markers per celltype that are at least selected. Selected means either selecting genes from the marker list or having correlated genes in the already selected panel. (Set the correlation threshold with marker_selection_hparams[‘penalty_threshold’]). The correlation based check only applies to cell types that also occur in adata.obs[celltype_key] while for cell types that only occur in the marker_list the markers are just added. If you want to select a different number of markers for celltypes in adata and celltypes only in the marker list, set e.g.:
n_list_markers = {'adata_celltypes':2,'list_celltypes':3}
.marker_corr_th (float) – Minimal correlation to consider a gene as captured.
pca_penalties (list) – List of keys for columns in
adata.var
containing penalty factors that are multiplied with the scores for PCA based gene selection.DE_penalties (list) – List of keys for columns in
adata.var
containing penalty factors that are multiplied with the scores for DE based gene selection.m_penalties_adata_celltypes (list) – List of keys for columns in
adata.var
containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes in adata.m_penalties_list_celltypes (list) – List of keys for columns in
adata.var
containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes not in adata.pca_selection_hparams (Dict[str, Any]) – Dictionary with hyperparameters for the PCA based gene selection.
DE_selection_hparams (Dict[str, Any]) – Dictionary with hyperparameters for the DE based gene selection.
forest_hparams (Dict[str, Any]) – Dictionary with hyperparameters for the forest based gene selection.
forest_DE_baseline_hparams (Dict[str, Any]) – Dictionary with hyperparameters for adding DE genes to decision trees.
add_forest_genes_hparams (Dict[str, Any]) – Dictionary with hyperparameters for adding marker genes to decision trees.
marker_selection_hparams (Dict[str, Any]) – Dictionary with hyperparameters. So far only the threshold for the penalty filtering of marker genes if a gene’s penalty < threshold.
verbosity (int) – Verbosity level.
seed (int) – Random number seed.
save_dir (Optional[str]) –
Directory path where all results are saved and loaded from if results already exist. Note for the case that results already exist:
if self.select_probeset() was fully run through and all results exist: then the initialization arguments don’t matter much
if only partial results were generated, make sure that the initialization arguments are the same as before!
n_jobs (int) – Number of cpus for multi processing computations. Set to -1 to use all available cpus.
- Attributes:
adata – Data with log normalised counts in
adata.X
.ct_key – Key in
adata.obs
with celltype annotations.g_key – Key in
adata.var
for preselected genes (typically ‘highly_variable_genes’).n – Number of finally selected genes.
genes – Pre selected genes (these will also have the highest ranking in the final list).
selection – Dictionary with the final and several other gene set selections.
n_pca_genes – The number of preselected pca genes. If None or <1, this step is skipped.
min_mean_difference – Minimal difference of mean expression between at least one celltype and the background.
n_min_markers – The minimal number of identified and added markers for cell types of adata.obs[ct_key].
celltypes – Cell types for which trees are trained.
adata_celltypes – List of all celltypes occuring in
adata.obs[ct_key]
.obs – Keys of
adata.obs
on which most of the selections are run.marker_list – Dictionary of the form
{'celltype': list of markers of celltype}
.n_list_markers – Minimal number of markers from the marker_list that are at least selected per cell type. Note that for those cell types in the marker_list that also occur in adata.obs[ct_key] genes that are correlated with the markers might be selected (see
marker_corr_th
).marker_corr_th – Minimal correlation to consider a gene as captured.
pca_penalties – List of keys for columns in
adata.var
containing penalty factors that are multiplied with the scores for PCA based gene selection.DE_penalties – List of keys for columns in
adata.var
containing penalty factors that are multiplied with the scores for DE based gene selection.m_penalties_adata_celltypes – List of keys for columns in
adata.var
containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes in adata.m_penalties_list_celltypes – List of keys for columns in
adata.var
containing penalty factors to filter out marker genes if a gene’s penalty < threshold for celltypes not in adata.pca_selection_hparams – Dictionary with hyperparameters for the PCA based gene selection.
DE_selection_hparams – Dictionary with hyperparameters for the DE based gene selection.
forest_hparams – Dictionary with hyperparameters for the forest based gene selection.
forest_DE_baseline_hparams – Dictionary with hyperparameters for adding DE genes to decision trees.
add_forest_genes_hparams – Dictionary with hyperparameters for adding marker genes to decision trees.
m_selection_hparams – Dictionary with hyperparameters. So far only the threshold for the penalty filtering of marker genes if a gene’s penalty < threshold.
verbosity – Verbosity level.
seed – Random number seed.
save_dir – Directory path where all results are saved and loaded from if results already exist.
n_jobs – Number of cpus for multi processing computations. Set to -1 to use all available cpus.
forest_results – Forest results.
forest_clfs – Forest classifier.
min_test_n – Minimal number of samples in each celltype’s test set
loaded_attributes – List of which results were loaded from disc.
disable_pbars. – Disable progress bars.
probeset – The final probeset list. Available only after calling
select_probeset()
. The table contains the following columns:- index
Gene symbol.
- gene_nr
Integer assigned to each gene.
- selection
Wether a gene was selected.
- rank
Gene ranking as describes in Notes above.
- marker_rank
Rank of the required markers per cell type. The best marker per cell type has marker_rank 1, the second best 2, and so on. Required markers are ranked till
n_min_markers
orn_list_markers
depending on the cell type.
- tree_rank
Ranking of the best tree the gene occured in. Per cell type multiple decision trees are trained and the best one is selected. To extend the ranking of genes in the probeset list, the 2nd, 3rd, … best performing trees are considered.
- importance_score
Highest importance score of a gene in the highest ranked trees that the gene occured in. (see TODO: reference tree training fct and there the description of the output)
- pca_score
Score from PCA-based selection (see TODO: document pca based selection and reference procedure here). Genes with high scores capture high amounts of general transcriptomic variation.
- pre_selected
Whether a gene was in the list of pre-selected genes.
- prior_selected
Whether a gene was in the list of prioritized genes.
- pca_selected
Whether a gene was in the list of n_pca_genes of PCA selected genes.
- celltypes_DE_1vsall
Cell type in which a given gene is up-regulated (compared to all other cell types as background, identified via differential expression tests during the selection).
- celltypes_DE_specific
Like celltypes_DE_1vsall but for DE tests that use a subset of the background (typically genes that distinguish similar cell types).
- celltypes_DE
celltypes_DE_1vsall and celltypes_DE_specific combined.
- celltypes_marker
celltypes_DE_1vsall combined with celltypes_DE_specific and the cell type of
marker_list
if the gene was listed as a marker there.
- list_only_ct_marker
Whether a gene is listed as a marker in
marker_list
.
- required_marker
Whether a gene was required to reach the minimal number of markers per cell type (
n_min_markers
,n_list_markers
).
- required_list_marker
Whether a gene was required to reach the minimal number of markers for cell types that only occur in
marker_list
but not inadata_celltypes
.
genes_of_primary_trees – The genes of the best tree of each cell type. Available only after calling
select_probeset()
. The table contains the following columns:- gene
Gene symbol.
- celltype
Cell type in which the tree occurs.
- importance
Importance score of the gene for the given cell type.
- nr_of_celltypes
Number of primary trees i.e. cell types in which the gene occurs.
Methods
info
()Print info.
plot_clf_genes
([basis, celltypes, ...])Plot umaps of selected genes needed for cell type classification of each cell type.
plot_coexpression
([selections])Plot correlation matrix of selected genes
plot_explore_constraint
([selection_method, ...])Plot histogram of quantiles for selected genes for different penalty kernels.
plot_gene_overlap
([origins])Plot the overlap of origins for the selected genes
plot_histogram
([x_axis_keys, selections, ...])Plot histograms of (basic) selections under given penalties.
Run full selection procedure.