pyviper package
pyviper.aREA
- pyviper.aREA(gex_data, interactome, layer=None, eset_filter=False, min_targets=30, mvws=1, device='cpu', rank_ordinal=False, verbose=True)
Allows the individual to infer normalized enrichment scores from gene expression data using the Analytical Ranked Enrichment Analysis (aREA)[1] function.
It is the original basis of the VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm.
The Interactome object must not contain any targets that are not in the features of gex_data. This can be accomplished by running:
interactome.filter_targets(gex_data.var_names)
It is highly recommended to do this on the unPruned network and then prune to ensure the pruned network contains a consistent number of targets per regulator, all of which exist within gex_data. A consistent number of targets allows regulators to have NES scores that are comparable to one another. A regulator that has more targets than others will have “boosted” NES scores, such that they cannot be compared to those with fewer targets.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy) or in a pd.DataFrame.
interactome – An object of class Interactome.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
min_targets (default: 30) – The minimum number of targets that each regulator in the interactome should contain. Regulators that contain fewer targets than this minimum will be culled from the network (via the Interactome.cull method). The reason users may choose to use this threshold is because adequate targets are needed to accurately predict enrichment.
mvws (default: 1) – (A) Number indicating either the exponent score for the metaViper weights. These are only applicable when enrichment = ‘area’ and are not used when enrichment = ‘narnea’. Roughly, a lower number (e.g. 1) results in networks being treated as a consensus network (useful for multiple networks of the same celltype with the same epigenetics), while a higher number (e.g. 10) results in networks being treated as separate (useful for multiple networks of different celltypes with different epigenetics). (B) The name of a column in gex_data that contains the manual assignments of samples to networks using list position or network names. (C) “auto”: assign samples to networks based on how well each network allows for sample enrichment.
device (default: 'cpu') – Whether to use the cpu or gpu on your device for the calculation of the aREA function. Using a gpu can improve the speed of the function. Using ‘mps’ or ‘cuda’ will producte slight differences (mean difference in NES around 1E-6), while Pearson and Spearman correlation remain >0.999.
rank_ordinal (default: False) – (A) Whether to use ordinal ranking from PyTorch instead of averaged ranking from Scipy. Setting to False will use averaged ranking, which is slower but more stable/consistent. (B) Using the ranks, it then assigns each gene a score based off of the inverse CDF for a standard distribution (z-like score), so some genes can receive different value. The sign of the NES is based soley off of the sign of the dES. Therefore, if the dES was already close to 0, this small difference can have the effect of flipping the sign of some protein NES scores (around every 1 out of 1.5 million NES scores). Mean difference in NES from averaged ranking are magnitude around 1E-6, with Pearson and Spearman correlation remaining >0.999.
verbose (default: True) – Whether extended output about the progress of the algorithm should be given.
- Return type:
A dataframe of
DataFramecontaining NES values.
References
[1] Alvarez, M. J., Shen, Y., Giorgi, F. M., Lachmann, A., Ding, B. B., Ye, B. H., & Califano, A. (2016). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature genetics, 48(8), 838-847.
pyviper.NaRnEA
- pyviper.NaRnEA(gex_data, interactome, layer=None, eset_filter=False, min_targets=30, verbose=True)
Allows the individual to infer normalized enrichment scores and proportional enrichment scores from gene expression data using the Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA)[1] function. NaRnEA is an updated basis for the VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm.
The Interactome object must not contain any targets that are not in the features of gex_data. This can be accomplished by running:
interactome.filter_targets(gex_data.var_names)
It is highly recommended to do this on the unpruned network and then prune to ensure the pruned network contains a consistent number of targets per regulator, all of which exist within gex_data. A regulator that has more targets than others will have “boosted” NES scores, such that they cannot be compared to those with fewer targets.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy) or in a pd.DataFrame.
interactome – An object of class Interactome.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
min_targets (default: 30) – The minimum number of targets that each regulator in the interactome should contain. Regulators that contain fewer targets than this minimum will be culled from the network (via the Interactome.cull method). The reason users may choose to use this threshold is because adequate targets are needed to accurately predict enrichment.
verbose (default: True) – Whether extended output about the progress of the algorithm should be given.
- Returns:
A dictionary containing :class:`~numpy.ndarray` containing NES values (key
- Return type:
‘nes’) and PES values (key: ‘pes’).
References
[1] Griffin, A. T., Vlahos, L. J., Chiuzan, C., & Califano, A. (2023). NaRnEA: An Information Theoretic Framework for Gene Set Analysis. Entropy, 25(3), 542.
pyviper.config
- pyviper.config.set_regulators_filepath(group, species, new_filepath)
Allows the user to use a custom list of regulatory proteins instead of the default ones within pyVIPER’s data folder.
- Parameters:
group – A group of regulatory proteins of either: “tfs”, “cotfs”, “sig” or “surf”.
species – The species to which the group of proteins belongs to: “human” or “mouse”.
new_filepath – The new filepath that should be used to retrieve these sets of proteins.
- Return type:
None
- pyviper.config.set_regulators_species_to_use(species)
Allows the user to specify which species they are currently studying, so the correct sets of regulatory proteins will be used during analysis.
- Parameters:
species – The species to which the group of proteins belongs to: “human” or “mouse”.
- Return type:
None
- pyviper.config.set_regulators_filepath(group, species, new_filepath)
Allows the user to use a custom list of regulatory proteins instead of the default ones within pyVIPER’s data folder.
- Parameters:
group – A group of regulatory proteins of either: “tfs”, “cotfs”, “sig” or “surf”.
species – The species to which the group of proteins belongs to: “human” or “mouse”.
new_filepath – The new filepath that should be used to retrieve these sets of proteins.
- Return type:
None
- pyviper.config.set_regulators_species_to_use(species)
Allows the user to specify which species they are currently studying, so the correct sets of regulatory proteins will be used during analysis.
- Parameters:
species – The species to which the group of proteins belongs to: “human” or “mouse”.
- Return type:
None
pyviper.Interactome
- class pyviper.Interactome(name, net_table=None, input_type=None)
Bases:
objectCreate an Interactome object to contain the results of ARACNe. This object describes the relationship between regulator proteins (e.g., TFs and CoTFs) and their downstream target genes with mor (Mode Of Regulation, e.g., spearman correlation) indicating directionality and likelihood (e.g., mutual information) indicating weight of association. An Interactome object can be given to pyviper.viper along with a gene expression signature to generate a protein activity matrix with the VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm [1].
- Parameters:
name (str) – A filepath on disk to store the Interactome.
net_table (pd.DataFrame or str, default: None) –
Either: 1. A pd.DataFrame containing four columns in this order:
”regulator”, “target”, “mor”, “likelihood”.
A filepath to this pd.DataFrame stored as .csv, .tsv, .parquet, .parquet.gzip, .pkl, or .loom.
A filepath to an Interactome object stored as a .pkl.
input_type (str, default: None) – Only relevant when net_table is a filepath. If None, the input_type will be inferred from net_table. Otherwise, specify “csv”, “tsv”, or “pkl”.
References
- [1] Alvarez, M. J., Shen, Y., Giorgi, F. M., Lachmann, A., Ding, B. B.,
Ye, B. H., & Califano, A. (2016). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature Genetics, 48(8), 838-847.
- copy()
Create a copy of this Interactome object.
- Return type:
An object of
Interactome.
- filter_regulators(regulators_keep=None, regulators_remove=None, verbose=True)
Filter regulators by choosing by name or by group which ones you intend to keep and which ones you intend to remove from this Interactome.
Note that the names of regulator that belong to the groups “tfs”, “cotfs”, “sig” and “surf” will be sourced via the paths specified in pyviper.config. To update these paths, use the pyviper.config.set_regulators_filepath function.
- Parameters:
regulators_keep (default: None) – This should be either: (1) An array or list containing the names of specific regulators you wish to keep in the network. When left as None, this parameter is not used to filter. (2) An array or list containing a group or groups of regulators that you wish to keep in the network. These groups should be one of the following: “tfs”, “cotfs”, “sig”, “surf”.
regulators_remove (default: None) – This should be either: (1) An array or list containing the names of specific regulators you wish to remove from the network. When left as None, this parameter is not used to filter. (2) An array or list containing a group or groups of regulators that you wish to remove from the network. These groups should be one of the following: “tfs”, “cotfs”, “sig”, “surf”.
verbose (default: True) – Report the number of regulators removed during filtering
- filter_targets(targets_keep=None, targets_remove=None, verbose=True)
Filter targets by specifying which ones to keep or remove from this Interactome.
When working with an anndata object or a gene expression array, it is highly recommended to filter the unPruned network before pruning. This ensures the pruned network contains a consistent number of targets per regulator, all of which exist within gex_data. A regulator with more targets than others will have “boosted” NES scores that cannot be compared to regulators with fewer targets.
For example, with an anndata object named gex_data, it is suggested to do:
interactome.filter_targets(gex_data.var_names)
- Parameters:
targets_keep (array-like, default: None) – Names of targets to keep in the network. If None, no targets are kept explicitly.
targets_remove (array-like, default: None) – Names of targets to remove from the network. If None, no targets are removed explicitly.
verbose (bool, default: True) – Report the number of targets removed during filtering.
- get_reg(regName)
Get the rows of the net_table where the regulator is regName.
- Parameters:
regName – The name of a regulator in this Interactome.
- Return type:
A dataframe of
DataFrame.
- get_reg_names()
Get an array of all unique regulators in this Interactome.
- Return type:
An array of strings of
ndarray.
- get_regulon_from_loom(file_path)
- get_target_names()
Get a set of the unique targets in this Interactome
- Return type:
A 1D NumPy array.
- ic_mat()
Get the DataFrame of all the likelihood values. Targets are in the rows, while Regulators are in the columns.
- Return type:
A dataframe of
DataFrame.
- icp_vec()
Get the vector containing the proportion of the “Interaction Confidence” (IC) score for each interaction in a network, relative to the maximum IC score in the network. This vector is generated by taking each individual regulon in the newtork and calculating the likelihood index proportion to all interactions.
- Return type:
An array of
ndarray.
- integrate(network_list, network_weights=None, normalize_likelihoods=False, verbose=False)
Integrate this Interactome object with one or more other Interactome objects to create a consensus network. This operation modifies the current Interactome in place; no new object is returned. In general, this should be done when interactome objects have the same epigenetics (e.g. due to being made from different datasets of same celltype). MetaVIPER should be used instead when you have multiple interactomes with different epigenetics (e.g. due to being made of data with different celltypes).
- Parameters:
network_list – A single object or a list of objects of class Interactome.
network_weights (default: None) – An array containing weights for each network being integrated. The first weight corresponds to this network, while the others correspond to those in the network list in order. If None, equal weights are used.
normalize_likelihoods (default: False) – An extra operation that can be performed after the integration operation where within each regulator, likelihood values are ranked and scaled from 0 to 1.
verbose (default: False) – If True prints progress messages
- mor_mat()
Get the DataFrame of all the correlation values. Targets are in the rows, while Regulators are in the columns.
- Return type:
A dataframe of
DataFrame.
- prune(max_targets=50, min_targets=None, eliminate=True, verbose=True)
Prune the Interactome by eliminating extra targets from regulators and, with eliminate = True, remove regulators with too few targets from the network. Note that by ensuring the pruned networks contains the same number of targets for each regulator, NES scores are comparable. If one regulator has more targest than another, than its NES score will be “boosted” and they cannot be compared against each other.
- Parameters:
max_targets (default: 50) – The maximum number of targets that each regulon is allowed.
min_targets (default: None) – The minimum number of targets that each regulon is required.
eliminate (default: True) – If eliminate = True, then any regulators with fewer targets than max_targets will be removed from the network. In other words, after pruning, all regulators will have exactly max_targets number of targets. This essentially sets min_targets equal to max_targets and ensures all NES scores are comparable with aREA.
verbose (default: True) – Report the number of targets and regulators removed during pruning
- remove_duplicate_pairs(keep='likelihood')
If you have the same pairs of regulator and target across multiple rows, you can remove these duplicates.
- Parameters:
keep (default: 'likelihood') – Determines which duplicates (if any) to keep. ‘first’ : Drop duplicates except for the first occurrence. ‘last’ : Drop duplicates except for the last occurrence. ‘likelihood’: Drop duplicates except for the occurrence with the highest likelihood value. False : Drop all duplicates.
- save(file_path, output_type=None)
Save the Interactome object to one’s disk. If saved as “csv” or “tsv”, just the interactome.net_table will be saved. If saved as “pkl”, the whole interactome object will be saved.
- Parameters:
file_path – A filepath to one’s disk to store the Interactome.
output_type (default: None) – If None, the output_type will be inferred from the file_path. Otherwise, specify “csv”, “tsv”, “pkl”, “parquet”, or “parquet.gzip”.
- Return type:
None
- size()
Get the the number of regulators in this Interactome.
- Return type:
An int
- targets_per_regulon()
- translate_regulators(desired_format, verbose=True)
Translate the regulators of the Interactome. The current name format of the regulators should be one of the following:
mouse_symbol
mouse_ensembl
mouse_entrez
human_symbol
human_ensembl
human_entrez
- Parameters:
desired_format (str) – Desired format can be one of four strings: “mouse_symbol”, “mouse_ensembl”, “mouse_entrez”, “human_symbol”, “human_ensembl” or “human_entrez”.
verbose (bool, default: True) – Report the number of regulators successfully and unsucessfully translated
- translate_targets(desired_format, verbose=True)
Translate the targets of the Interactome. The current name format of the targets should be one of the following:
mouse_symbol, mouse_ensembl, mouse_entrez, human_symbol, human_ensembl or human_entrez
It is recommended to do this before pruning to ensure a consistent number of targets because if targets do not have a translation, they will be deleted, resulting in different numbers of targets in a pruned interactome that once had consistent number of targets.
- Parameters:
desired_format – Desired format can be one of four strings: “mouse_symbol”, “mouse_ensembl”, “mouse_entrez”, “human_symbol”, “human_ensembl” or “human_entrez”.
verbose (default: True) – Report the number of targets successfully and unsucessfully translated
pyviper.load
- pyviper.load.TFs(species=None, path_to_tfs=None)
Retrieves a list of transcription factors (TFs).
- Parameters:
species (default: None) – When left as None, the species setting in pyviper.config will be used. Otherwise, manually specify “human” or “mouse”.
path_to_tfs (default: None) – When left as None, the path to TFs setting in pyviper.config will be used. Otherwise, manually specify a filepath to a .txt file containing TFs, one on each line.
- Return type:
A list containing transcription factors.
- pyviper.load.coTFs(species=None, path_to_cotfs=None)
Retrieves a list of co-transcription factors (coTFs).
- Parameters:
species (default: None) – When left as None, the species setting in pyviper.config will be used. Otherwise, manually specify “human” or “mouse”.
path_to_cotfs (default: None) – When left as None, the path to coTFs setting in pyviper.config will be used. Otherwise, manually specify a filepath to a .txt file containing coTFs, one on each line.
- Return type:
A list containing co-transcription factors.
- pyviper.load.human2mouse()
Retrieves the human to mouse translation pd.DataFrame from pyVIPER’s data folder. This dataframe contains six columns: human_symbol, mouse_symbol, human_ensembl, mouse_ensembl, human_entrez, mouse_entrez
- Return type:
A dataframe of
DataFrame.
- pyviper.load.msigdb_regulon(collection)
Retrieves an object or a list of objects of class Interactome from pyviper’s data folder containing a set of pathways from the Molecular Signatures Database (MSigDB), downloaded from https://www.gsea-msigdb.org/gsea/msigdb.
Collections can be one of the following:
- ‘h’ for Hallmark gene sets. Coherently expressed signatures derived by
aggregating many MSigDB gene sets to represent well-defined biological states or processes.
- ‘c2’ for curated gene sets. From online pathway databases, publications
in PubMed, and knowledge of domain experts.
- ‘c5’ for ontology gene sets. Consists of genes annotated by the same
ontology term.
- ‘c6’ for oncogenic signature gene sets. Defined directly from microarray
gene expression data from cancer gene perturbations.
- ‘c7’ for immunologic signature gene sets. Represents cell states and
perturbations within the immune system.
- Parameters:
collection (str or list of str) – A individual string or a list of strings containing the following: [“h”, “c2”, “c5”, “c6”, “c7”], corresponding to the collections above.
- Return type:
An individual object or list of objects of class pyviper.interactome.Interactome.
- pyviper.load.sig(species=None, path_to_sig=None)
Retrieves a list of signalling proteins (sig).
- Parameters:
species (default: None) – When left as None, the species setting in pyviper.config will be used. Otherwise, manually specify “human” or “mouse”.
path_to_sig (default: None) – When left as None, the path to sig setting in pyviper.config will be used. Otherwise, manually specify a filepath to a .txt file containing signaling proteins, one on each line.
- Return type:
A list containing signaling proteins.
- pyviper.load.surf(species=None, path_to_surf=None)
Retrieves a list of surface proteins (surf).
- Parameters:
species (default: None) – When left as None, the species setting in pyviper.config will be used. Otherwise, manually specify “human” or “mouse”.
path_to_sig (default: None) – When left as None, the path to surf setting in pyviper.config will be used. Otherwise, manually specify a filepath to a .txt file containing surface proteins, one on each line.
- Return type:
A list containing signaling proteins.
- pyviper.load.pisces_network(tissue, celltype, desired_format='human_ensembl')
The pipeline for Protein Activity Inference in Single Cells (PISCES) is a regulatory-network-based methdology for the analysis of single cell gene expression profiles. PISCES leverages the assembly of lineage-specific gene regulatory networks, to accurately measure activity of each protein based on the expression its transcriptional targets (regulon), using the ARACNe and metaVIPER algorithms, respectively. These networks are available to be loaded from the GitHub repository with this function. Reference using the citation below.
- Parameters:
tissue – The tissue type of the network. See options below.
celltype – The celltype of the network. See options below.
desired_format (default: human_ensembl) – The gene name format of the regulators and targets in the network. Choose from: human_ensembl (default), human_symbol, human_entrez, mouse_ensembl, mouse_symbol, or mouse_entrez.
- Returns:
An individual object of class pyviper.interactome.Interactome.
Overview of celltypes
——-
Choose from one of the following networks (tissue (celltype):) –
- adipose_tissue:
adipocytes dendritic fibroblasts lymphoid myeloid smc
- bone_marrow
lymphoid
- brain
neuron
- breast
adipocytes dendritic endothelial epithelial fibroblasts lymphoid myeloid smc
- bronchus
epithelial keratinocytes
- colon
enterocyte goblet lymphoid
- endometrium
adipocytes endothelial epithelial fibroblasts lymphoid
- esophagus
epithelial fibroblasts
- eye
bipolar muller-glia photoreceptor
- heart_muscle
cardiomyocyte endothelial fibroblasts smc
- kidney
lymphoid myeloid tubular
- liver
hepatocytes kupffer lymphoid
- lung
alveolar macrophages
- lymph_node
lymphoid
- ovary
endothelial fibroblasts granulosa macrophages smc theca
- pancreas
ductal endocrine exocrine-glandular
- pbmc
lymphoid myeloid
- placenta
fibroblasts hofbauer-cells trophoblasts
- prostate
basal-prostatic endothelial fibroblasts prostatic-glandular smc urothelial
- rectum
enterocytes goblet-cells paneth
- skeletal_muscle
endothelial fibroblasts lymphoid myeloid myocytes smc
- skin
endothelial fibroblasts keratinocytes langerhans-cells lymphoid smc
- small_intestine
enterocytes
- spleen
lymphoid
- stomach
lymphoid
- testis
endothelial leydig-cells monocyte peritubular-cells spermatids spermatocytes spermatogonia
References
[1] Obradovic, A., Vlahos, L., Laise, P., Worley, J., Tan, X., Wang, A., & Califano, A. (2021). PISCES: A pipeline for the systematic, protein activity-based analysis of single cell RNA sequencing data. Biorxiv, 6, 22.
pyviper.pl
- pyviper.pl.__get_stored_uns_data_and_prep_to_plot(adata, uns_data_slot, obsm_slot=None, uns_slot=None)
- pyviper.pl.pca(adata, *, plot_pax=True, plot_gex=False, cmap_pax='RdBu_r', cmap_gex='viridis', cmap_obs='inferno', **kwargs)
A wrapper for the scanpy function sc.pl.pca.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot VIPER stored in adata on adata.obsm[‘X_pca’].
plot_gex (default: False) – Plot gExpr stored in adata.uns[‘gex_data’] on adata.obsm[‘X_pca’].
cmap_pax (default: "RdBu_r") – cmap to use for visualizing VIPER proteins.
cmap_gex (default: "viridis") – cmap to use for visualizing stored gExpr.
cmap_obs (default: "inferno") – cmap to use for visualizing stored numeric obs.
**kwargs – Arguments to provide to the sc.pl.pca function.
- Return type:
A plot of
Axes.
- pyviper.pl.umap(adata, *, plot_pax=True, plot_gex=False, cmap_pax='RdBu_r', cmap_gex='viridis', cmap_obs='inferno', **kwargs)
A wrapper for the scanpy function sc.pl.umap.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot VIPER stored in adata on adata.obsm[‘X_umap’].
plot_gex (default: False) – Plot gExpr stored in adata.uns[‘gex_data’] on adata.obsm[‘X_umap’].
cmap_pax (default: "RdBu_r") – cmap to use for visualizing VIPER proteins.
cmap_gex (default: "viridis") – cmap to use for visualizing stored gExpr.
cmap_obs (default: "inferno") – cmap to use for visualizing stored numeric obs.
**kwargs – Arguments to provide to the sc.pl.pca function.
- Return type:
A plot of
Axes.
- pyviper.pl.tsne(adata, *, plot_pax=True, plot_gex=False, cmap_pax='RdBu_r', cmap_gex='viridis', cmap_obs='inferno', **kwargs)
A wrapper for the scanpy function sc.pl.tsne.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot VIPER stored in adata on adata.obsm[‘X_tsne’].
plot_gex (default: False) – Plot gExpr stored in adata.uns[‘gex_data’] on adata.obsm[‘X_tsne’].
cmap_pax (default: "RdBu_r") – cmap to use for visualizing VIPER proteins.
cmap_gex (default: "viridis") – cmap to use for visualizing stored gExpr.
cmap_obs (default: "inferno") – cmap to use for visualizing stored numeric obs.
**kwargs – Arguments to provide to the sc.pl.tsne function.
- Return type:
A plot of
Axes.
- pyviper.pl.diffmap(adata, *, plot_pax=True, plot_gex=False, cmap_pax='RdBu_r', cmap_gex='viridis', cmap_obs='inferno', **kwargs)
A wrapper for the scanpy function sc.pl.diffmap.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot VIPER stored in adata on adata.obsm[‘X_diffmap’].
plot_gex (default: False) – Plot gExpr stored in adata.uns[‘gex_data’] on adata.obsm[‘X_diffmap’].
cmap_pax (default: "RdBu_r") – cmap to use for visualizing VIPER proteins.
cmap_gex (default: "viridis") – cmap to use for visualizing stored gExpr.
cmap_obs (default: "inferno") – cmap to use for visualizing stored numeric obs.
**kwargs – Arguments to provide to the sc.pl.diffmap function.
- Return type:
A plot of
Axes.
- pyviper.pl.draw_graph(adata, *, plot_pax=True, plot_gex=False, cmap_pax='RdBu_r', **kwargs)
A wrapper for the scanpy function sc.pl.draw_graph.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot VIPER stored in adata on adata.obsm[‘X_draw_graph_fa’] or adata.obsm[‘X_draw_graph_fr’].
plot_gex (default: False) – Plot gExpr stored in adata.uns[‘gex_data’] on adata.obsm[‘X_draw_graph_fa’] or adata.obsm[‘X_draw_graph_fr’].
**kwargs – Arguments to provide to the sc.pl.draw_graph function.
- Return type:
A plot of
Axes.
- pyviper.pl.spatial(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.spatial.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata on adata.uns[‘spatial’].
plot_gex (default: False) – Plot adata.uns[‘gex_data’] on adata.uns[‘spatial’].
**kwargs – Arguments to provide to the sc.pl.spatial function.
- Return type:
A plot of
Axes.
- pyviper.pl.embedding(adata, *, basis, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.embedding.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
basis – The name of the represenation in adata.obsm that should be used for plotting.
plot_pax (default: True) – Plot adata on adata.obsm[basis].
plot_gex (default: True) – Plot adata.uns[‘gex_data’] on adata.obsm[basis].
**kwargs – Arguments to provide to the sc.pl.embedding function.
- Return type:
A plot of
Axes.
- pyviper.pl.embedding_density(adata, *, basis='umap', plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.embedding_density.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
basis (default: 'umap') – The name of the represenation in adata.obsm that should be used for plotting.
plot_pax (default: True) – Plot adata on adata.obsm[basis].
plot_gex (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[basis].
**kwargs – Arguments to provide to the sc.pl.embedding_density function.
- Return type:
A plot of
Axes.
- pyviper.pl.scatter(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.scatter.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.scatter function.
- Return type:
A plot of
Axes.
- pyviper.pl.heatmap(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.heatmap.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.heatmap function.
- Return type:
A plot of
Axes.
- pyviper.pl.dotplot(adata, *, plot_pax=True, plot_gex=False, cmap_pax='Reds', cmap_gex='Greens', spacing_factor=1, **kwargs)
A wrapper for the scanpy function sc.pl.dotplot.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot VIPER stored in adata.
plot_gex (default: False) – Plot gExpr stored in adata.uns[‘gex_data’].
cmap_pax (default: "Reds") – cmap to use for visualizing VIPER proteins.
cmap_gex (default: "Greens") – cmap to use for visualizing stored gExpr.
spacing_factor (default: 1) – When plotting both pax and gex, adjust the size of the plot.
**kwargs – Arguments to provide to the sc.pl.dotplot function.
- Return type:
A plot of
Axes.
- pyviper.pl.tracksplot(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.tracksplot.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.tracksplot function.
- Return type:
A plot of
Axes.
- pyviper.pl.violin(adata, *, plot_pax=True, plot_gex=False, n_cols=4, w_spacing_factor=1, h_spacing_factor=1, **kwargs)
A wrapper for the scanpy function sc.pl.violin.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
n_cols (default: 4) – Number of columns in the violin plot.
w_spacing_factor (default: 1) – Multiply the current width by this factor to stretch the plot.
h_spacing_factor (default: 1) – Multiply the current height by this factor to stretch the plot.
**kwargs – Arguments to provide to the sc.pl.violin function.
- Return type:
A plot of
Axes.
- pyviper.pl.stacked_violin(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.stacked_violin.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.stacked_violin function.
- Return type:
A plot of
Axes.
- pyviper.pl.matrixplot(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.matrixplot.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.matrixplot function.
- Return type:
A plot of
Axes.
- pyviper.pl.clustermap(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.clustermap.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.clustermap function.
- Return type:
A plot of
Axes.
- pyviper.pl.ranking(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.ranking.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.ranking function.
- Return type:
A plot of
Axes.
- pyviper.pl.dendrogram(adata, *, plot_pax=True, plot_gex=False, **kwargs)
A wrapper for the scanpy function sc.pl.dendrogram.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
plot_pax (default: True) – Plot adata.
plot_gex (default: False) – Plot adata.uns[‘gex_data’].
**kwargs – Arguments to provide to the sc.pl.dendrogram function.
- Return type:
A plot of
Axes.
- pyviper.pl.ss_heatmap(adata, var_names, cluster_column=None, show_gex_heatmap=False, plot_gex_norm=False, obs_metadata=None, gex_metadata=None, pax_metadata=None, h_clust_rows=True, h_clust_cols=False)
A function to display VIPER and gExpr along with multiple rows of metadata with samples organized by silhouette score within each cluster.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
var_names – A list of genes/proteins to be visualized in the heatmap.
cluster_column (default: None) – A column in adata.obs to sort the samples based on. Samples will be arranged from highest silhouette score to lowest silhouette score within each cluster.
show_gex_heatmap (default: False) – Whether to plot var_names on an additional gExpr heatmap.
plot_gex_norm (default: False) – Whether gExpr in the gex_heatmap and gex_metadata is normalized gExpr instead of scaled. Requires adata.uns[‘gex_data’].raw to contain unnormalized counts.
obs_metadata (default: None) – A str or list of columns in adata.obs to visualize as an annotation above the heatmap.
gex_metadata (default: None) – A str or list of vars in adata.uns[‘gex_data’] to visualize as an annotation above the heatmap.
pax_metadata (default: None) – A str or list of vars in adata to visualize as an annotation above the heatmap.
h_clust_rows (default: True) – Whether to hierarchically cluster the rows.
h_clust_cols (default: False) – Whether to hierarchically cluster the columns. With False and cluster_column supplied, samples will be ordered by silhouette score.
- Return type:
A plot of
ClusterGrid.
- pyviper.pl.mrs(adata, cluster_column, show_gex_heatmap=False, plot_gex_norm=False, obs_metadata=None, gex_metadata=None, pax_metadata=None, h_clust_rows=True, h_clust_cols=False, n_top_mrs=10, method='stouffer', top_mrs_list=None, mr_col=None)
A wrapper around pyviper.pl.ss_heatmap to visualize the top master regulators (MRs) from each cluster using the VIPER anndata object.
- Parameters:
adata – Protein activity stored in an anndata object. Gene expression stored in adata.uns[‘gex_data’].
var_names – A list of genes/proteins to be visualized in the heatmap.
cluster_column (default: None) – A column in adata.obs to sort the samples based on. Samples will be arranged from highest silhouette score to lowest silhouette score within each cluster.
show_gex_heatmap (default: False) – Whether to plot var_names on an additional gExpr heatmap.
plot_gex_norm (default: False) – Whether gExpr in the gex_heatmap and gex_metadata is normalized gExpr instead of scaled. Requires adata.uns[‘gex_data’].raw to contain unnormalized counts.
obs_metadata (default: None) – A str or list of columns in adata.obs to visualize as an annotation above the heatmap.
gex_metadata (default: None) – A str or list of vars in adata.uns[‘gex_data’] to visualize as an annotation above the heatmap.
pax_metadata (default: None) – A str or list of vars in adata to visualize as an annotation above the heatmap.
h_clust_rows (default: True) – Whether to hierarchically cluster the rows.
h_clust_cols (default: False) – Whether to hierarchically cluster the columns. With False and cluster_column supplied, samples will be ordered by silhouette score.
n_top_mrs (default: 10) – If top_mrs_list is None and mr_col is None, how many MRs per cluster to identify.
method (default: 'stouffer') – If top_mrs_list is None and mr_col is None, what method should be used to calculate the top MRs. Options include ‘stouffer’ (Stouffer signature), “mwu” (Mann-Whitney U-Test), ‘spearman’ (correlation of proteins with proximity to each cluster’s center).
top_mrs_list (default: None) – Whether to manually supply a list of MRs to be visualized.
mr_col (default: None) – Whether to select a column from adata.var where MRs are marked, e.g. by running pyviper.tl.find_top_mrs.
- pyviper.pl.vis_net(net_Pruned, pax_data, gex_data, mr_list, cluster_labels=None, layout_alg='davidson_harel', seed=0, figsize=(15, 15), size_target=2)
Creates an igraph to visualize the relationship between regulators and targets of an Interactome object.
- Parameters:
net_Pruned – An object of class Interactome or a list of Interactome objects.
pax_data – Protein activity stored in an anndata object. Used to color the regulators.
gex_data – Gene expression stored in an anndata object. Used to color the targets.
mr_list – A list of proteins in net_Pruned to be visualized in the plot.
cluster_labels (default: None) – Provide if you would like to create cluster-specific graphs.
layout_alg (default: 'davidson_harel') – Algorithm for the layout parameter from igraph.
seed (default: 0) – Random seed for graph construction.
figsize (default: (15, 15)) – figure dimensions for Matplotlib.
size_target (default: 2) – Scaling factor for target node sizes.
pyviper.pp
- pyviper.pp.rank_norm(adata, NUM_FUN=<function _median>, DEM_FUN=<function _mad_from_R>, layer=None, key_added=None, copy=False)
Compute a double rank normalization on an anndata, np.array, or pd.DataFrame.
- Parameters:
adata – Data stored in an anndata object, np.array or pd.DataFrame.
NUM_FUN (default: np.median) – The first function to be applied across each column.
DEM_FUN (default: _mad_from_R) – The second function to be applied across each column.
layer (default: None) – For an anndata input, the layer to use. When None, the input layer is anndata.X.
key_added (default: None) – For an anndata input, the name of the layer where to store. When None, this is anndata.X.
copy (default: False) – Whether to return a rank-transformed copy (True) or to instead transform the original input (False).
- Returns:
When copy = False, saves the input data as a double rank transformed version.
When copy = True, return a double rank transformed version of the input data.
- pyviper.pp.stouffer(adata, groupby=None, layer=None, filter_by_feature_groups=None, key_added='stouffer', compute_pvals=True, null_iters=1000, verbose=True, return_as_df=False, copy=False)
Compute a stouffer signature on each of your clusters in an anndata object.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data.
groupby – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
layer (default: None) – The layer to use as input data to compute the signatures.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: 'stouffer') – The slot in adata.var to store the stouffer signatures.
compute_pvals (default: True) – Whether to compute a p-value for each score to return in the results.
null_iters (default: 1000) – The number of iterations to use to compute a null model to assess the p-values of each of the stouffer scores.
verbose (default: True) – Whether to provide additional output during the execution of the function.
return_as_df (default: False) – If True, returns the stouffer signature in a pd.DataFrame. If False, stores it in adata.var[key_added].
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When return_as_df is False, adds the cluster stouffer signatures to adata.var[key_added]. When return_as_df is True, returns as pd.DataFrame.
- pyviper.pp.mwu(adata, groupby=None, layer=None, filter_by_feature_groups=None, key_added='mwu', compute_pvals=True, verbose=True, return_as_df=False, copy=False)
Compute a Mann-Whitney U-Test signature on each of your clusters in an anndata object.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data.
groupby – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
layer (default: None) – The layer to use as input data to compute the signatures.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: 'mwu') – The slot in adata.var to store the MWU signatures.
compute_pvals (default: True) – Whether to compute a p-value for each score to return in the results.
verbose (default: True) – Whether to provide additional output during the execution of the function.
return_as_df (default: False) – If True, returns the MWU signature in a pd.DataFrame. If False, stores it in adata.var[key_added].
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When return_as_df is False, adds the cluster MWU signatures to adata.var[key_added]. When return_as_df is True, returns as pd.DataFrame.
- pyviper.pp.spearman(adata, pca_slot='X_pca', groupby=None, layer=None, filter_by_feature_groups=None, key_added='spearman', compute_pvals=True, null_iters=1000, verbose=True, return_as_df=False, copy=False)
Compute spearman correlation between each gene product and the cluster centroids along with the statistical significance for each of your clusters in an anndata object.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data.
pca_slot – The slot in adata.obsm where a PCA is stored.
groupby – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
layer (default: None) – The layer to use as input data to compute the correlation.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: 'spearman') – The slot in adata.var to store the spearman correlation.
compute_pvals (default: True) – Whether to compute a p-value for each score to return in the results.
null_iters (default: 1000) – The number of iterations to use to compute a null model to assess the p-values of each of the spearman scores.
verbose (default: True) – Whether to provide additional output during the execution of the function.
return_as_df (default: False) – If True, returns the spearman signature in a pd.DataFrame. If False, stores it in adata.var[key_added].
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When return_as_df is False, adds the cluster spearman correlation to adata.var[key_added]. When return_as_df is True, returns as pd.DataFrame.
- pyviper.pp.viper_similarity(nes: DataFrame | AnnData, nn: int | None = None, ws: Tuple[float, ...] = (4.0, 2.0), method: Literal['two-sided', 'greater', 'less'] = 'two.sided', random_state: int = 0, store_in_adata: bool = False, key_added: str = 'viper_similarity')
Compute the similarity between the columns of a VIPER-predicted activity or gene expression matrix. While following the same concept as the two-tail Gene Set Enrichment Analysis (GSEA)[1], it is based on the aREA algorithm[2].
If ws is a single number, weighting is performed using an exponential function. If ws is a 2 numbers vector, weighting is performed with a symmetric sigmoid function using the first element as inflection point and the second as trend.
- Parameters:
nes (pandas.DataFrame or anndata.AnnData) – Matrix of normalized enrichment scores (NES) from VIPER, where rows are regulators (e.g., transcription factors) and columns are samples. If an AnnData is provided, the .to_df() representation is used.
nn (int, optional) – Number of top regulators to consider per sample. If provided, only the nn most extreme regulators (according to method) are retained; all others are set to zero. If None (default), continuous weighting is applied instead.
ws (tuple of float, default (4.0, 2.0)) – Weighting parameters. If a single value is given, the exponent applied to regulator weights. If two values are given, they define a symmetric sigmoid weighting function where the first is the inflection point and the second corresponds to the input value giving a weight ≈ 0.1.
method ({'two.sided', 'greater', 'less'}, default 'two.sided') – Specifies which tail(s) of the signature to use when computing similarity: - ‘greater’: only positive regulators are used, - ‘less’: only negative regulators are used, - ‘two.sided’: both tails are used. The alias ‘two-sided’ is also accepted.
random_state (int, default 0) – Random seed used for breaking ties in rank-based operations.
store_in_adata (bool, default False) – If True and nes is an AnnData object, store the resulting similarity matrix in nes.obsp[key_added].
key_added (str, default "viper_similarity") – Key name under which to store the result in AnnData.obsp if store_in_adata is True.
- Returns:
A symmetric sample-by-sample similarity matrix where each element (i,j) reflects the weighted concordance of regulator activity between samples i and j.
- Return type:
pandas.DataFrame
Notes
This implementation mirrors the R viperSimilarity function and follows the principles of the aREA (analytic rank-based enrichment analysis) algorithm used in VIPER. It supports both continuous and discrete (top-N) similarity computation.
References
- [1] Julio M. K.-d. et al. Regulation of extra-embryonic endoderm stem cell differentiation
by Nodal and Cripto signaling. Development, 138, 3885–3895 (2011).
- [2] Alvarez M. J. et al. Functional characterization of somatic mutations in cancer using
network-based inference of protein activity. Nature Genetics, 48(8), 838–847 (2016).
- pyviper.pp.aracne3_to_regulon(net_file, net_df=None, anno=None, MI_thres=0, regul_size=50, normalize_MI_per_regulon=True)
Process an output from ARACNe3 to return a pd.DataFrame describing a gene regulatory network with suitable columns for conversion to an object of the Interactome class.
- Parameters:
net_file – A string containing the path to the ARACNe3 output
net_df (default: None) – Whether to passt a pd.DataFrame instead of the path
anno (default: None) – Gene ID annotation
MI_thres (default: 0) – Threshold on Mutual Information (MI) to select the regulators and target pairs
regul_size (default: 50) – Number of (top) targets to include in each regulon
normalize_MI_per_regulon (default: True) – Whether to normalize MI values each regulon by the maximum value
- Returns:
A pd.DataFrame containing an ARACNe3-inferred gene regulatory network with the following 4 columns
- Return type:
“regulator”, “target”, “mor” (mode of regulation) and “likelihood”.
- pyviper.pp.nes_to_pval(adata, layer=None, key_added=None, lower_tail=True, adjust=True, axs=1, neg_log=False, pseudocount=1e-300, copy=False)
Transform VIPER-computed NES into p-values.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data, where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
layer (default: None) – Entry of layers to tranform.
key_added (default: None) – Name of layer to save result in a new layer instead of adata.X.
lower_tail (default: True) – If True (default), returns two-tailed p-values P(|X| > |x|). If False, returns upper-tail probabilities P(X > x). Note: unlike R’s pnorm, here lower_tail=True compute two-tailed probabilities.
adjust (default: True) – If True, returns adjusted p values using FDR Benjamini-Hochberg procedure. If False, does not adjust p values
axs (default: 1) – axis along which to perform the p-value correction (Used only if the input is a pd.DataFrame). Possible values are 0 or 1.
neg_log (default: False) – Whether to transform VIPER-computed NES into -log10(p-value).
pseudocount (default: 1e-300) – When neg_log is True, add a small value to pvals to avoid log(0).
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Returns:
Saves the input data as a transformed version. If key_added is specified,
saves the results in adata.layers[key_added].
References
Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300. http://www.jstor.org/stable/2346101
- pyviper.pp.repr_subsample(adata, pca_slot='X_pca', size=1000, seed=0, key_added='repr_subsample', eliminate=False, verbose=True, njobs=1, copy=False)
A tool for create a subsample of the input data such it is well representative of all the populations within the input data rather than being a random sample. This is accomplished by pairing samples together in an iterative fashion until the desired sample size is reached.
- Parameters:
adata – An anndata object containing a distance object in adata.obsp.
pca_slot (default: "X_pca") – The slot in adata.obsm where the PCA object is stored. One way of generating this object is with sc.pp.pca.
size (default: 1000) – The size of the representative subsample
eliminate (default: False) – Whether to trim down adata to the subsample (True) or leave the subsample as an annotation in adata.obs[key_added].
seed (default: 0) – The random seed used when taking samples of the data.
verbose (default: True) – Whether to provide runtime information.
njobs (default: 1) – The number of cores to use for the analysis. Using more than 1 core (multicore) speeds up the analysis.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Returns:
When copy is False, saves the subsample annotation in adata.var[key_added].
When copy is True, return an anndata with this annotation.
When eliminate is True, modify the adata by subsetting it down to the subsample.
- pyviper.pp.repr_metacells(adata, counts=None, pca_slot='X_pca', dist_slot='corr_dist', clusters_slot=None, score_slot=None, score_min_thresh=None, minimize_replacement=True, size=500, n_cells_per_metacell=None, min_median_depth=10000, perc_data_to_use=None, perc_incl_data_reused=None, seed=0, key_added='metacells', verbose=True, njobs=1, copy=False)
A tool to create a representative selection of metacells from the data that aims to maximize reusing samples from the data, while simultaneously ensuring that all neighbors are close to the metacell they construct. When using this function, exactly two of the following parameters must be set: size, min_median_depth or n_cells_per_metacell, perc_data_to_use or perc_incl_data_reused. Note that min_median_depth and n_cells_per_metacell cannot both be set at the same time, since they directly relate (e.g. higher n_cells_per_metacell means more neighbors are used to construct a single metacell, meaning each metacell will have more counts, resulting in a higher median depth). Note that perc_data_to_use and perc_incl_data_reused cannot both be set at the same time, since they directly relate (e.g. higher perc_data_to_use means you include more data, which means it’s more likely to reuse more data, resulting in a higher perc_incl_data_reused).
- Parameters:
adata – An anndata object containing a distance object in adata.obsp.
counts (default: None) – A pandas DataFrame or AnnData object of unnormalized gene expression counts that has the same samples in the same order as that of adata. If counts are left as None, adata must have counts stored in adata.raw.
pca_slot (default: "X_pca") – The slot in adata.obsm where the PCA object is stored. One way of generating this object is with sc.pp.pca.
dist_slot (default: "corr_dist") – The slot in adata.obsp where the distance object is stored. One way of generating this object is with pyviper.pp.corr_distance.
clusters_slot (default: None) – The slot in adata.obs where cluster labels are stored. Cluster-specific metacells will be generated using the same parameters with the results for each cluster being stored separately in adata.uns.
score_slot (default: None) – The slot in adata.obs where a score used to determine and filter cell quality are stored (e.g. silhouette score).
score_min_thresh (default: None) – The score from adata.obs[score_slot] that a cell must have at minimum to be used for metacell construction (e.g. 0.25 is the rule of thumb for silhouette score).
minimize_replacement (default: True) – If True, then identify a subsample that minimizes overlap between the KNN of the metacells. If False, use an entirely random subsample, but have faster runtime.
size (default: 500) – A specific number of metacells to generate. If set to None, perc_data_to_use or perc_incl_data_reused can be used to specify the size when n_cells_per_metacell or min_median_depth is given.
n_cells_per_metacell (default: None) – The number of cells that should be used to generate single metacell. Note that this parameter and min_median_depth cannot both be set as they directly relate: e.g. higher n_cells_per_metacell leads to higher min_median_depth. If left as None, perc_data_to_use or perc_incl_data_reused can be used to specify n_cells_per_metacell when size is given.
min_median_depth (default: 10000) – The desired minimum median depth for the metacells (indirectly specifies n_cells_per_metacell). The default is set to 10000 as this is recommend by PISCES[1]. Note that this parameter and n_cells_per_metacell cannot both be set as they directly relate: e.g. higher min_median_depth leads to higher n_cells_per_metacell.
perc_data_to_use (default: None) – The percent of the total amount of provided samples that will be used in the creation of metacells. Note that this parameter and perc_incl_data_reused cannot both be set as they directly relate: e.g. higher perc_data_to_use leads to higher perc_incl_data_reused.
perc_incl_data_reused (default: None) – The percent of samples that are included in the creation of metacells that will be reused (i.e. used in more than one metacell). Note that this parameter and perc_data_to_use cannot both be set as they directly relate: e.g. higher perc_incl_data_reused leads to higher perc_data_to_use.
seed (default: 0) – The random seed used when taking samples of the data.
key_added (default: "metacells") – The name of the slot in the adata.uns to store the output.
verbose (default: True) – Whether to provide runtime information and quality statistics.
njobs (default: 1) – The number of cores to use for the analysis. Using more than 1 core (multicore) speeds up the analysis.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
Saves the metacells as a pandas dataframe in adata.uns[key_added]. Attributes that contain parameters for and statistics about the construction of the metacells are stored in adata.uns[key_added].attrs. Set copy = True to return a new AnnData object.
References
Obradovic, A., Vlahos, L., Laise, P., Worley, J., Tan, X., Wang, A., & Califano, A. (2021). PISCES: A pipeline for the systematic, protein activity -based analysis of single cell RNA sequencing data. bioRxiv, 6, 22.
pyviper.tl
- pyviper.tl.pca(adata, *, layer=None, filter_by_feature_groups=None, **kwargs)
A wrapper for the scanpy function sc.tl.pca.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
layer (default: None) – The layer to use as input data.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
**kwargs – Arguments to provide to the sc.tl.pca function.
- pyviper.tl.dendrogram(adata, *, groupby, key_added=None, layer=None, filter_by_feature_groups=None, **kwargs)
A wrapper for the scanpy function sc.tl.dendrogram.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
groupby – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
key_added (default: None) – The key in adata.uns where the dendrogram should be stored.
layer (default: None) – The layer to use as input data.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
**kwargs – Arguments to provide to the sc.tl.dendrogram function.
- pyviper.tl.oncomatch(pax_data_to_test, pax_data_for_cMRs, tcm_size=50, both_ways=False, om_max_NES_threshold=30, om_min_logp_threshold=0, enrichment='aREA', key_added='om', return_as_df=False, copy=False)
The OncoMatch algorithm[1] assesses the overlap in differentially active MR proteins between two sets of samples (e.g. to validate GEMMs as effective models of human tumor samples). It does so by computing -log10 p-values for each sample in pax_data_to_test of the MRs of each sample in pax_data_for_cMRs.
- Parameters:
pax_data_to_test – An anndata.AnnData or pd.DataFrame containing protein activity (NES), where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
pax_data_for_cMRs – An anndata.AnnData or pd.DataFrame containing protein activity (NES), where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
tcm_size (default: 50) – Number of top MRs from each sample to use to compute regulators.
both_ways (default: False) – Whether to also use the candidate MRs of pax_data_to_test to compute NES for the samples in pax_data_for_cMRs, and then average.
om_max_NES_threshold (default: 30) – The maximum NES scores before using a cutoff.
om_min_logp_threshold (default: 0) – The minimum logp value threshold, such that all logp values smaller than this value are set to 0.
enrichment (default: 'aREA') – The method of compute enrichment. ‘aREA’ or ‘NaRnEA’
key_added (default: 'om') – The slot in pax_data_to_test.obsm to store the oncomatch results.
return_as_df (default: False) – Instead of adding the OncoMatch DataFrame to pax_data_to_test.obsm, return it directly.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When copy is False, stores a pd.DataFrame objects of -log10 p-values with shape (n_samples in pax_data_to_test, n_samples in pax_data_for_cMRs) in pax_data_to_test.obsm[key_added]. When copy is True, a copy of the AnnData is returned with these pd.DataFrames stored. When return_as_df is True, the OncoMatch DataFrame alone is directly returned by the function.
References
[1] Alvarez, M. J. et al. A precision oncology approach to the pharmacological targeting of mechanistic dependencies in neuroendocrine tumors. Nat Genet 50, 979–989, doi:10.1038/s41588-018-0138-4 (2018).
[2] Alvarez, M. J. et al. Reply to ’H-STS, L-STS and KRJ-I are not authentic GEPNET cell lines’. Nat Genet 51, 1427–1428, doi:10.1038/s41588-019-0509-5 (2019).
- pyviper.tl.find_top_mrs(adata, pca_slot='X_pca', groupby=None, layer=None, N=50, both=True, method='stouffer', key_added='mr', filter_by_feature_groups=None, rank=False, filter_by_top_mrs=False, return_as_df=False, copy=False, verbose=True)
Identify the top N master regulator proteins in a VIPER AnnData object
- Parameters:
adata – An anndata object containing a distance object in adata.obsp.
pca_slot – The slot in adata.obsm where a PCA is stored. Only required when method is “spearman”.
groupby – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations. Required when method is “mwu” or “spearman”.
N (default: 50) – The number of MRs to return
both (default: True) – Whether to return both the top N and bottom N MRs (True) or just the top N (False).
method (default: "stouffer") – The method used to compute a signature to identify the top candidate master regulators (MRs). The options come from functions in pyviper.pp. Choose between “stouffer”, “mwu”, or “spearman”.
key_added (default: "mr") – The name of the slot in the adata.var to store the output.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
rank (default: False) – When False, a column is added to var with identified MRs labeled as “True”, while all other proteins are labeled as “False”. When True, top MRs are labeled N,N-1,N-2,…,1, bottom MRs are labeled -N,-N-1,-N-2, …,-1, and all other proteins are labeled 0. Higher rank means greater activity, while lower rank means less.
filter_by_top_mrs (default: False) – Whether to filter var to only the top MRs in adata
return_as_df (default: False) – Returns a pd.DataFrame of the top MRs per cluster
copy (default: False) – Determines whether a copy of the input AnnData is returned.
verbose (default: True) – Whether extended output about the progress of the algorithm is given.
- Return type:
Add a column to adata.var[key_added] or, when clusters given, adds multiple columns (e.g. key_added_clust1name, key_added_clust2name, etc) to adata.var. If copy, returns a new adata transformed by this function. If return_as_df, returns a DataFrame.
- pyviper.tl.path_enrich(adata, interactome, layer=None, eset_filter=True, method=None, enrichment='aREA', mvws=1, njobs=1, batch_size=10000, verbose=True)
Run the variation of VIPER that is specific to pathway enrichment analysis: a single interactome and min_targets is set to 0.
- Parameters:
adata – An anndata object (e.g. from Scanpy).
interactome – An object of class Interactome or one of the following strings that corresponds to msigdb regulons: “c2”, “c5”, “c6”, “c7”, “h”.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: True) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
method (default: None) – A method used to create a gene expression signature from gex_data.X. The default of None is used when gex_data.X is already a gene expression signature. Alternative inputs include “scale”, “rank”, “doublerank”, “mad”, and “ttest”.
enrichment (default: 'aREA') – The algorithm to use to calculate the enrichment. Choose betweeen Analytical Ranked Enrichment Analysis (aREA) and Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA) function. Default =’aREA’, alternative = ‘NaRnEA’.
mvws (default: 1) – (A) Number indicating either the exponent score for the metaViper weights. These are only applicable when enrichment = ‘aREA’ and are not used when enrichment = ‘NaRnEA’. Roughly, a lower number (e.g. 1) results in networks being treated as a consensus network (useful for multiple networks of the same celltype with the same epigenetics), while a higher number (e.g. 10) results in networks being treated as separate (useful for multiple networks of different celltypes with different epigenetics). (B) The name of a column in gex_data that contains the manual assignments of samples to networks using list position or network names. (C) “auto”: assign samples to networks based on how well each network allows for sample enrichment.
njobs (default: 1) – Number of cores to distribute sample batches into.
batch_size (default: 10000) – Maximum number of samples to process at once. Set to None to split all samples across provided njobs.
verbose (default: True) – Whether extended output about the progress of the algorithm is given.
- Returns:
Stores the resulting enrichment of each the genesets in the interactome in
adata.obs.
pyviper.viper
- pyviper.viper(gex_data, interactome, layer=None, eset_filter=True, method=None, enrichment='aREA', mvws=1, min_targets=30, njobs=1, batch_size=10000, verbose=True, return_as_df=False, transfer_obs=True, store_input_data=True, device='cpu', rank_ordinal=False, pleiotropy: bool = False)
The VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm[1] allows individuals to compute protein activity using a gene expression signature and an Interactome object that describes the relationship between regulators and their downstream targets. Users can infer normalized enrichment scores (NES) using Analytical Ranked Enrichment Analysis (aREA)[1] or Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA)[2]. NaRnEA also compute proportional enrichment scores (PES).
The Interactome object must not contain any targets that are not in the features of gex_data. This can be accomplished by running:
interactome.filter_targets(gex_data.var_names)
It is highly recommend to do this on the unPruned network and then prune to ensure the pruned network contains a consistent number of targets per regulator, allow of which exist within gex_data.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy).
objects. (interactome An object of class Interactome or a list of Interactome)
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
method (default: None) – A method used to create a gene expression signature from gex_data.X. The default of None is used when gex_data.X is already a gene expression signature. Alternative inputs include “scale”, “rank”, “doublerank”, “mad”, and “ttest”.
enrichment (default: 'aREA') – The algorithm to use to calculate the enrichment. Choose betweeen Analytical Ranked Enrichment Analysis (aREA) and Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA) function. Default =’aREA’, alternative = ‘NaRnEA’.
mvws (default: 1) – (A) Number indicating either the exponent score for the metaViper weights. These are only applicable when enrichment = ‘aREA’ and are not used when enrichment = ‘NaRnEA’. Roughly, a lower number (e.g. 1) results in networks being treated as a consensus network (useful for multiple networks of the same celltype with the same epigenetics), while a higher number (e.g. 10) results in networks being treated as separate (useful for multiple networks of different celltypes with different epigenetics). (B) The name of a column in gex_data that contains the manual assignments of samples to networks using list position or network names. (C) “auto”: assign samples to networks based on how well each network allows for sample enrichment.
min_targets (default: 30) – The minimum number of targets that each regulator in the interactome should contain. Regulators that contain fewer targets than this minimum will be pruned from the network (via the Interactome.prune method). The reason users may choose to use this threshold is because adequate targets are needed to accurately predict enrichment.
njobs (default: 1) – Number of cores to distribute sample batches into.
batch_size (default: 10000) – Maximum number of samples to process at once. Set to None to split all samples across provided njobs.
verbose (default: True) – Whether extended output about the progress of the algorithm should be given.
return_as_df (default: False) – Way of delivering output. If True, return as pd.DataFrame. If False, return as anndata.AnnData.
transfer_obs (default: True) – Whether to transfer the observation metadata from the input anndata to the output anndata. Thus, not applicable when return_as_df==True.
store_input_data (default: True) – Whether to store the input anndata in an unstructured data slot (.uns) of the output anndata. Thus, not applicable when return_as_df==True. If input anndata already contains ‘gex_data’ in .uns, the input will assumed to be protein activity and will be stored in .uns as ‘pax_data’. Otherwise, the data will be stored as ‘gex_data’ in .uns.
device (default: 'cpu') – Whether to use the cpu or gpu on your device for the calculation of the aREA function. Using a gpu can improve the speed of the function. Using ‘mps’ or ‘cuda’ will producte slight differences (mean difference in NES around 1E-6), while Pearson and Spearman correlation remain >0.999.
rank_ordinal (default: False) – (A) Whether to use ordinal ranking from PyTorch instead of averaged ranking from Scipy. Setting to False will use averaged ranking, which is slower but more stable/consistent. (B) Using the ranks, it then assigns each gene a score based off of the inverse CDF for a standard distribution (z-like score), so some genes can receive different value. The sign of the NES is based soley off of the sign of the dES. Therefore, if the dES was already close to 0, this small difference can have the effect of flipping the sign of some protein NES scores (around every 1 out of 1.5 million NES scores). Mean difference in NES from averaged ranking are magnitude around 1E-6, with Pearson and Spearman correlation remaining >0.999.
pleiotropy (default: False) – Whether to apply correction for pleiotropic regulation with aREA. This typically impacts a small percentage of regulators.
- Returns:
A dictionary containing :class:`~numpy.ndarray` containing NES values (key (‘nes’) and PES values (key: ‘pes’) when return_as_df=True and enrichment = “NaRnEA”.)
A dataframe of
DataFramecontaining NES values when return_as_df=True and enrichment = “aREA”.An anndata object containin NES values in .X when return_as_df=False (default). Will contain PES values in the layer ‘pes’ when enrichment = ‘NaRnEA’. Will contain .gex_data and/or .pax_data in the unstructured data slot (.uns) when store_input_data = True. Will contain identical .obs to the input anndata when transfer_obs = True.
References
- [1] Alvarez, M. J., et al. (2016). Functional characterization of somatic mutations in
cancer using network-based inference of protein activity. Nature Genetics, 48(8), 838–847.
- [2] Griffin, A. T., et al. (2023). NaRnEA: An Information Theoretic Framework for Gene Set
Analysis. Entropy, 25(3), 542.