-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scVelo doesn't select the n_top_genes when doing scv.pp.filter_and_normalize() #775
Comments
@hyjforesight, I suppose this is caused by multiple cells with the same value in Also, a quick note: In general, the data in |
Hello @WeilerP , adata = sc.read('C:/Users/Park_Lab/Documents/adata.h5ad')
adata
AnnData object with n_obs × n_vars = 2636 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
adata.var['dispersions_norm'].duplicated().sum()
0 scv.pl.proportions(adata, groupby='leiden')
scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=2000)
adata
Filtered out 1562 genes that are detected 30 counts (shared).
WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
Normalized count data: spliced, unspliced.
Extracted 2845 highly variable genes.
WARNING: Did not modify X as it looks preprocessed already.
AnnData object with n_obs × n_vars = 2636 × 2845
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances' For the adata.X issue, our data was UMAPed by Scanpy and saved as h5ad by For using the same embedding between Scanpy and scVelo, we need the adata.ubsm['X_umap'] and adata.obs['leiden'], which should be generated by Scanpy after lognormalization, HVGs and scaling. In this case, how shall we use the raw adata.X because it has been normalized by Scanpy? The Scanpy coding is as below: sc.pl.highest_expr_genes(adata, n_top=20)
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=25)
adata.var['mt'] = adata.var_names.str.startswith('mt-')
adata.var['rpl'] = adata.var_names.str.startswith('Rpl')
adata.var['rps'] = adata.var_names.str.startswith('Rps')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :]
adata = adata[adata.obs.pct_counts_rpl < 50, :]
adata = adata[adata.obs.pct_counts_rps < 50, :]
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)
print(sum(adata.var.highly_variable))
adata.raw=adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=0.3)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data', frameon=False, title='', use_raw=False)
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon', use_raw=False)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:3]: result[key][group]
for group in groups for key in ['names', 'scores', 'logfoldchanges', pvals_adj']}).head(5000).to_csv("C:/Users/Park_Lab/Documents/adata.csv")
new_cluster_names = ['…']
adata.rename_categories('leiden', new_cluster_names)
adata.write('C:/Users/Park_Lab/Documents/adata.h5ad', compression='gzip') Thanks a lot! |
@hyjforesight, would you please run the standard scvelo preprocessing pipeline, i.e., with raw counts in About the preprocessing: The problem with your approach is that the data used to calculate the neighbor graph and the data used to infer RNA velocity are not the same. This could lead to all kinds of unexpected results, and I am not sure how you'd interpret a velocity stream, for example, plotted on a 2D UMAP embedding generated from different data. |
Hello @WeilerP , adata = sc.read_loom(filename='C:/Users/Park_Lab/Documents/sample.loom')
adata.var_names_make_unique()
adata
AnnData object with n_obs × n_vars = 13499 × 32285
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
Filtered out 21387 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
computing PCA
on highly variable genes
with n_comps=30
finished (0:00:00)
computing neighbors
finished (0:00:01) --> added
'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
finished (0:00:00) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
scv.tl.recover_dynamics(adata, n_jobs=16)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata, n_jobs=16)
adata
recovering dynamics (using 16/16 cores)
100%
1219/1219 [04:55<00:00, 2.18gene/s]
finished (0:05:02) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
computing velocities
finished (0:00:15) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph (using 16/16 cores)
100%
13499/13499 [00:09<00:00, 429.80cells/s]
finished (0:00:12) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
AnnData object with n_obs × n_vars = 13499 × 2000
obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts', 'velocity_self_transition'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'fit_r2', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'velocity_genes'
uns: 'pca', 'neighbors', 'recover_dynamics', 'velocity_params', 'velocity_graph', 'velocity_graph_neg'
obsm: 'X_pca'
varm: 'PCs', 'loss'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'fit_t', 'fit_tau', 'fit_tau_', 'velocity', 'velocity_u'
obsp: 'distances', 'connectivities'
scv.pl.velocity_embedding_stream(adata, basis='umap')
KeyError: 'X_umap' About the preprocessing, thanks for the information. We understand that ACT_sub2 = sc.read('C:/Users/Park_Lab/Documents/ACT_sub2.h5ad') # Scanpy proceeded data
ACT_sub2
AnnData object with n_obs × n_vars = 2636 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
raw = sc.read_loom(filename='C:/Users/Park_Lab/Documents/sample.loom') # raw data
raw.var_names_make_unique()
raw
AnnData object with n_obs × n_vars = 13499 × 32285
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
raw.obs['leiden']=ACT_sub2.obs['leiden']
adata=raw[raw.obs['leiden'].isin(['0', '1', '2', '3', '4'])]
adata.obsm['X_umap'] = scv.load('C:/Users/Park_Lab/Documents/ACT_sub2/uns/umap.csv')
ValueError: Lengths must match to compare` Thanks! |
Sorry, ATM, I'm not sure what's causing this problem. I'll try to replicate it using the Pancreas dataset.
You first need to calculate the UMAP embedding as you did in your Scanpy workflow.
The PCA embedding and neighbor graph are not necessarily recalculated. You can always run sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None) to use Scanpy and its results.
There's a mismatch in dimensions. Based on your code snippet, |
Hello @WeilerP, For using Scanpy and its results for scVelo, based on your explanation, my understanding is that I can proceed my data with Scanpy to generate the pca, neighbors, clusters and umap. Then run a new scVelo pipeline, loading the raw loom file as a new AnnData object and copy these parameters Based on this understanding, I tried again. My aim is to do scVelo on
Could you please also help me to check whether Thanks and happy holiday! ACT_sub2 = sc.read('C:/Users/Park_Lab/Documents/ACT_sub2.h5ad') # the subset after Scanpy proceeding.
ACT_sub2
AnnData object with n_obs × n_vars = 2636 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
adata = sc.read_loom(filename='C:/Users/Park_Lab/Documents/raw_adata.loom') # the raw data
adata.var_names_make_unique()
adata
AnnData object with n_obs × n_vars = 13499 × 32285
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
adata.obs['leiden']=ACT_sub2.obs['leiden']
adata=adata[adata.obs['leiden'].isin(['0', '1', '2', '3'])] # slice into the wanted obs
adata
View of AnnData object with n_obs × n_vars = 2636 × 32285
obs: 'leiden'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
adata.var['highly_variable']=ACT_sub2.var['highly_variable']
adata = adata[:, adata.var.highly_variable]
# I used HVG to calculate the UMAP, so I slice HVG to match the dimensions, but get errors
adata
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12812/4098982234.py in <module>
----> 1 adata = adata[:, adata.var.highly_variable]
2 adata
~\anaconda3\envs\Python3812\lib\site-packages\anndata\_core\anndata.py in __getitem__(self, index)
1114 def __getitem__(self, index: Index) -> "AnnData":
1115 """Returns a sliced view of the object."""
-> 1116 oidx, vidx = self._normalize_indices(index)
1117 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
1118
~\anaconda3\envs\Python3812\lib\site-packages\anndata\_core\anndata.py in _normalize_indices(self, index)
1095
1096 def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1097 return _normalize_indices(index, self.obs_names, self.var_names)
1098
1099 # TODO: this is not quite complete...
~\anaconda3\envs\Python3812\lib\site-packages\anndata\_core\index.py in _normalize_indices(index, names0, names1)
34 ax0, ax1 = unpack_index(index)
35 ax0 = _normalize_index(ax0, names0)
---> 36 ax1 = _normalize_index(ax1, names1)
37 return ax0, ax1
38
~\anaconda3\envs\Python3812\lib\site-packages\anndata\_core\index.py in _normalize_index(indexer, index)
99 if np.any(positions < 0):
100 not_found = indexer[positions < 0]
--> 101 raise KeyError(
102 f"Values {list(not_found)}, from {list(indexer)}, "
103 "are not valid obs/ var names or indices."
KeyError: 'Values [nan, nan, nan, nan, nan, nan, True,... nan, nan, True], are not valid obs/ var names or indices.'
adata.obsm['X_umap'] = scv.load('C:/Users/Park_Lab/Documents/ACT_sub2/uns/umap.csv').values # Is this coding for copying 'X_umap' to adata right? |
@hyjforesight, I propose/suggest to run (with adapted arguments) adata = sc.read_loom(filename='C:/Users/Park_Lab/Documents/raw_adata.loom') # the raw data
adata.var_names_make_unique()
scv.pp.filter_and_normalizer(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
sc.tl.umap(adata)
sc.tl.leiden(adata)
You cannot just copy and past data from one object to another. You need to ensure that the observations and variables in both objects are the same. In |
cell number is matched by adata=adata[adata.obs['leiden'].isin(['0', '1', '2', '3'])]. to match the gene numbers, check scverse/scanpy#2095. Then we can copy UMAP to the raw subseted data and do scVelo on it. adata.uns['leiden_colors']=ACI_sec.uns['leiden_colors']
adata.obsm['X_umap']=ACI_sec.obsm['X_umap']
adata |
...
Hello scVelo,
My dataset has 5000 genes, and I set n_top_genes=2000 to do scv.pp.filter_and_normalize(). But it only filter out the genes with min_shared_counts and doesnt't select the 2000 top genes. I still get more than 2000 genes for the downstream analysis. Please see the picture.
Could you please help me with this issue?
Thanks!
Best,
YJ
# paste your code here, if applicable
Error output
Versions
The text was updated successfully, but these errors were encountered: