Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

highly_variable_genes - issue #391

Closed
ltosti opened this issue Dec 6, 2018 · 21 comments
Closed

highly_variable_genes - issue #391

ltosti opened this issue Dec 6, 2018 · 21 comments

Comments

@ltosti
Copy link

ltosti commented Dec 6, 2018

Hi there,

While running sc.pp.highly_variable_genes(adata.X) I got the following error:

AttributeError: X not found

I then ran sc.pp.highly_variable_genes(adata) and got the following:

ValueError: Bin edges must be unique: array([nan, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,inf, inf, inf, inf, inf, inf, inf, inf]). You can drop duplicate edges by setting the duplicates kwarg

The older sc.pp.filter_genes_dispersion(adata.X) works fine.

Do you know how to fix this?

Thank you!

Info: scanpy==1.3.4 anndata==0.6.13 numpy==1.15.3 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.20.0 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

@fidelram
Copy link
Collaborator

fidelram commented Dec 6, 2018 via email

@ltosti
Copy link
Author

ltosti commented Dec 6, 2018

When I run adata.X I get
<14636x24181 sparse matrix of type '<class 'numpy.float32'>' with 16866605 stored elements in Compressed Sparse Row format>

That looks fine?

@Koncopd
Copy link
Member

Koncopd commented Dec 6, 2018

Hi,
could you please try

sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

As highly_variable_genes expects logarithmized data.

@ltosti
Copy link
Author

ltosti commented Dec 7, 2018

Hi @Koncopd, my data are indeed already normalised.

@fidelram I generated the data merging a few datasets using bbknn. But when I tried on a single sample, I got the same error.

@Koncopd
Copy link
Member

Koncopd commented Dec 7, 2018

Hm, very hard to say anything without looking at the dataset. Any negative values in the dataset?
Or

X = adata.X
X = np.log1p(X)
X = np.expm1(X)
mean = X.mean(axis=0).A1
mean[mean == 0] = 1e-12
mean = np.log1p(mean)
np.any(mean == np.inf)

Does it show true?

@ltosti
Copy link
Author

ltosti commented Dec 7, 2018

When I run this on the single sample I get False.

When I run this on the merged (batch-removed) sample I get:

AttributeError                            Traceback (most recent call last)
<ipython-input-23-4cda28f741b3> in <module>
      2 X = np.log1p(X)
      3 X = np.expm1(X)
----> 4 mean = X.mean(axis=0).A1
      5 mean[mean == 0] = 1e-12
      6 mean = np.log1p(mean)

AttributeError: 'numpy.ndarray' object has no attribute 'A1'```

@Koncopd
Copy link
Member

Koncopd commented Dec 7, 2018

I mean on non-normalized dataset, which is a sparse matrix.

@ltosti
Copy link
Author

ltosti commented Dec 7, 2018

On non-normalized dataset I get False.

@falexwolf
Copy link
Member

The initial problem is due to the fact that the new 'highly_variable_genes' function does not take numpy arrays anymore: https://github.com/theislab/scanpy/blob/master/scanpy/preprocessing/highly_variable_genes.py

It's also mentioned in the docs, but we should, of course, have thrown a clear error message. Now it does: a578ced

To return the annotation, one can set inplace=False. But the updated plotting function also takes the full AnnData object.

@sophiamaedler
Copy link

I am experiencing a similar issue with a dataset I am using.

This runs fine:

variable_genes_min_mean = 0.01
variable_genes_max_mean = 5
variable_genes_min_disp = 0.5

sc.pp.filter_genes_dispersion(adata_gex, 
                                                  min_mean=variable_genes_min_mean, 
                                                  max_mean=variable_genes_max_mean, 
                                                  min_disp=variable_genes_min_disp,
                                                  flavor='seurat',
                                                  log = True)

But this:

variable_genes_min_mean = 0.01
variable_genes_max_mean = 5
variable_genes_min_disp = 0.5

sc.pp.highly_variable_genes(adata_gex, 
                            min_mean=variable_genes_min_mean, 
                            max_mean=variable_genes_max_mean, 
                            min_disp=variable_genes_min_disp,
                            flavor = 'seurat') 

Throws the following error:

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scipy/sparse/data.py:135: RuntimeWarning: overflow encountered in expm1
  result = op(self._deduped_data())
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_utils.py:18: RuntimeWarning: overflow encountered in square
  var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_utils.py:18: RuntimeWarning: invalid value encountered in subtract
  var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py:85: RuntimeWarning: overflow encountered in log
  dispersion = np.log(dispersion)
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py:85: RuntimeWarning: invalid value encountered in log
  dispersion = np.log(dispersion)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-69d6424effb2> in <module>
      3                             max_mean=variable_genes_max_mean,
      4                             min_disp=variable_genes_min_disp,
----> 5                             flavor = 'seurat') 

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace, batch_key)
    255                                                  n_top_genes=n_top_genes,
    256                                                  n_bins=n_bins,
--> 257                                                  flavor=flavor)
    258     else:
    259         sanitize_anndata(adata)

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_single_batch(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor)
     90     df['dispersions'] = dispersion
     91     if flavor == 'seurat':
---> 92         df['mean_bin'] = pd.cut(df['means'], bins=n_bins)
     93         disp_grouped = df.groupby('mean_bin')['dispersions']
     94         disp_mean_bin = disp_grouped.mean()

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/pandas/core/reshape/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates)
    226             # GH 24314
    227             raise ValueError(
--> 228                 "cannot specify integer `bins` when input data contains infinity"
    229             )
    230         elif mn == mx:  # adjust end points before binning

ValueError: cannot specify integer `bins` when input data contains infinity

I am assuming its something wrong with the dataset (it's a publicly available one which I needed to convert from a Seurat Object), but I can't figure out what.

I have checked if there are any Inf values included in adata.X or adata.raw.X but there are not. Also both adata.X and adata.raw.X are sparse matrices. Any ideas would be greatly appreciated.

Screen Shot 2020-03-13 at 6 09 35 PM

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 13, 2020

Hi!
Sorry, I don't really have the time to get into this atm, but I have an idea... I think the default for expecting logarithmized data vs non-logarithmized data changed between the two functions for the method='seurat' case.

@massonix
Copy link

I am experiencing the same problem, and it also comes from a Seurat object that I converted to anndata with SeuratDisk.

@rpeys
Copy link

rpeys commented Oct 29, 2020

I am also getting the error RuntimeWarning: invalid value encountered in log dispersion = np.log(dispersion) when running sc.pp.highly_variable_genes(adata, min_mean=1.7, max_mean=5, min_disp=0.5, flavor='seurat') on log scale data in the adata.X slot with mean=0 and max=16.336065. Any ideas?

Update: I just noticed that my adata.X contains a numpy array instead of a sparse matrix. Perhaps that's the issue? Will try updating to a sparse matrix and will report back

@rpeys
Copy link

rpeys commented Oct 29, 2020

FIXED: Updating adata.X to a scipy csr sparse matrix using adata.X = scipy.sparse.csr_matrix(adata.X) fixed this error.

I still get RuntimeWarning: invalid value encountered in sqrt std = np.sqrt(var) when running sc.pp.scale(adata, max_value=10) even after forcing to a csr matrix, but doesn't seem to affect downstream results...

@massonix
Copy link

Thanks for your update @rpeys, I will try to convert to scipy csr sparse matrix :)

@kanefos
Copy link

kanefos commented Dec 27, 2020

I have an AnnData object whose .X matrix has been transformed by size factor division, +1 and log. Subsequent sc.pp.highly_variable_genes(dataset, flavor='cell_ranger', n_top_genes=1000) yields the ValueError: Bin edges must be unique: ... You can drop duplicate edges by setting the 'duplicates' kwarg error discussed above. Transformation to a sparse matrix did not alleviate the error, and neither did any other solutions suggested.

Edit: However! While I could not get flavor='cell_ranger' to work on the data I normalised myself, flavor='seurat' has worked okay. Therefore, I recommend people also encountering this error to stick with this second flavour, because as I understand it they utilise a similar methodology.

@LisaSikkema
Copy link
Contributor

LisaSikkema commented Jun 29, 2021

For me this was solved by filtering out genes that were not expressed in any cell!
sc.pp.filter_genes(adata, min_cells=1)
If I include a batch_key in the hvg function, I still get the error. I guess in that case you have to ensure that every gene is expressed in every batch? Seems like a bug to fix

@ivirshup
Copy link
Member

@LisaSikkema, could you please open a new issue for that? It'd be helpful if you could include a reproducible example as well.

@sky1ove
Copy link

sky1ove commented May 15, 2022

This one works! thanks!!

@Harender04
Copy link

Thanks for your update @rpeys, I will try to convert to scipy csr sparse matrix :)

Hello, Massonix, was the problem resolved?

@Harender04
Copy link

FIXED: Updating adata.X to a scipy csr sparse matrix using adata.X = scipy.sparse.csr_matrix(adata.X) fixed this error.

I still get RuntimeWarning: invalid value encountered in sqrt std = np.sqrt(var) when running sc.pp.scale(adata, max_value=10) even after forcing to a csr matrix, but doesn't seem to affect downstream results...

hi Rebecca, I have been trying to process scRNA (converted seurat to h5ad format) in python (processing like QC, normalisation, scaling, high variables, clustering etc) and have been getting stuck at the highly variable genes. Can you please help me out with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests