highly_variable_genes - issue #391

ltosti · 2018-12-06T16:56:23Z

Hi there,

While running sc.pp.highly_variable_genes(adata.X) I got the following error:

AttributeError: X not found

I then ran sc.pp.highly_variable_genes(adata) and got the following:

ValueError: Bin edges must be unique: array([nan, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,inf, inf, inf, inf, inf, inf, inf, inf]). You can drop duplicate edges by setting the duplicates kwarg

The older sc.pp.filter_genes_dispersion(adata.X) works fine.

Do you know how to fix this?

Thank you!

Info: scanpy==1.3.4 anndata==0.6.13 numpy==1.15.3 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.20.0 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

The text was updated successfully, but these errors were encountered:

fidelram · 2018-12-06T17:09:40Z

It looks like your adata object is corrupted. You should be able to type `adata.X` to get the matrix. How are you generating the adata object?

…

On Thu, Dec 6, 2018 at 5:56 PM ltosti ***@***.***> wrote: Hi there, When running sc.pp.highly_variable_genes(adata.X) I get the following error: AttributeError: X not found I then ran sc.pp.highly_variable_genes(adata) and got the following: ValueError: Bin edges must be unique: array([nan, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,inf, inf, inf, inf, inf, inf, inf, inf]). You can drop duplicate edges by setting the duplicates kwarg The older sc.pp.filter_genes_dispersion(adata.X) works fine. Do you know how to fix this? Thank you! *Info*: scanpy==1.3.4 anndata==0.6.13 numpy==1.15.3 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.20.0 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#391>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEu_1RPErIznAoUd0DwpbdlEjkOUyjTdks5u2Uw4gaJpZM4ZG6Jw> .

ltosti · 2018-12-06T17:12:50Z

When I run adata.X I get
<14636x24181 sparse matrix of type '<class 'numpy.float32'>' with 16866605 stored elements in Compressed Sparse Row format>

That looks fine?

Koncopd · 2018-12-06T21:38:44Z

Hi,
could you please try

sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

As highly_variable_genes expects logarithmized data.

ltosti · 2018-12-07T07:51:36Z

Hi @Koncopd, my data are indeed already normalised.

@fidelram I generated the data merging a few datasets using bbknn. But when I tried on a single sample, I got the same error.

Koncopd · 2018-12-07T13:52:14Z

Hm, very hard to say anything without looking at the dataset. Any negative values in the dataset?
Or

X = adata.X
X = np.log1p(X)
X = np.expm1(X)
mean = X.mean(axis=0).A1
mean[mean == 0] = 1e-12
mean = np.log1p(mean)
np.any(mean == np.inf)

Does it show true?

ltosti · 2018-12-07T14:40:00Z

When I run this on the single sample I get False.

When I run this on the merged (batch-removed) sample I get:

AttributeError                            Traceback (most recent call last)
<ipython-input-23-4cda28f741b3> in <module>
      2 X = np.log1p(X)
      3 X = np.expm1(X)
----> 4 mean = X.mean(axis=0).A1
      5 mean[mean == 0] = 1e-12
      6 mean = np.log1p(mean)

AttributeError: 'numpy.ndarray' object has no attribute 'A1'```

Koncopd · 2018-12-07T16:15:21Z

I mean on non-normalized dataset, which is a sparse matrix.

ltosti · 2018-12-07T16:39:42Z

On non-normalized dataset I get False.

falexwolf · 2018-12-09T06:48:07Z

The initial problem is due to the fact that the new 'highly_variable_genes' function does not take numpy arrays anymore: https://github.com/theislab/scanpy/blob/master/scanpy/preprocessing/highly_variable_genes.py

It's also mentioned in the docs, but we should, of course, have thrown a clear error message. Now it does: a578ced

To return the annotation, one can set inplace=False. But the updated plotting function also takes the full AnnData object.

sophiamaedler · 2020-03-13T17:11:32Z

I am experiencing a similar issue with a dataset I am using.

This runs fine:

variable_genes_min_mean = 0.01
variable_genes_max_mean = 5
variable_genes_min_disp = 0.5

sc.pp.filter_genes_dispersion(adata_gex, 
                                                  min_mean=variable_genes_min_mean, 
                                                  max_mean=variable_genes_max_mean, 
                                                  min_disp=variable_genes_min_disp,
                                                  flavor='seurat',
                                                  log = True)

But this:

variable_genes_min_mean = 0.01
variable_genes_max_mean = 5
variable_genes_min_disp = 0.5

sc.pp.highly_variable_genes(adata_gex, 
                            min_mean=variable_genes_min_mean, 
                            max_mean=variable_genes_max_mean, 
                            min_disp=variable_genes_min_disp,
                            flavor = 'seurat')

Throws the following error:

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scipy/sparse/data.py:135: RuntimeWarning: overflow encountered in expm1
  result = op(self._deduped_data())
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_utils.py:18: RuntimeWarning: overflow encountered in square
  var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_utils.py:18: RuntimeWarning: invalid value encountered in subtract
  var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1))
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py:85: RuntimeWarning: overflow encountered in log
  dispersion = np.log(dispersion)
/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py:85: RuntimeWarning: invalid value encountered in log
  dispersion = np.log(dispersion)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-69d6424effb2> in <module>
      3                             max_mean=variable_genes_max_mean,
      4                             min_disp=variable_genes_min_disp,
----> 5                             flavor = 'seurat') 

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace, batch_key)
    255                                                  n_top_genes=n_top_genes,
    256                                                  n_bins=n_bins,
--> 257                                                  flavor=flavor)
    258     else:
    259         sanitize_anndata(adata)

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_single_batch(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor)
     90     df['dispersions'] = dispersion
     91     if flavor == 'seurat':
---> 92         df['mean_bin'] = pd.cut(df['means'], bins=n_bins)
     93         disp_grouped = df.groupby('mean_bin')['dispersions']
     94         disp_mean_bin = disp_grouped.mean()

/usr/local/anaconda3/envs/pySCENIC/lib/python3.6/site-packages/pandas/core/reshape/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates)
    226             # GH 24314
    227             raise ValueError(
--> 228                 "cannot specify integer `bins` when input data contains infinity"
    229             )
    230         elif mn == mx:  # adjust end points before binning

ValueError: cannot specify integer `bins` when input data contains infinity

I am assuming its something wrong with the dataset (it's a publicly available one which I needed to convert from a Seurat Object), but I can't figure out what.

I have checked if there are any Inf values included in adata.X or adata.raw.X but there are not. Also both adata.X and adata.raw.X are sparse matrices. Any ideas would be greatly appreciated.

LuckyMD · 2020-03-13T18:25:32Z

Hi!
Sorry, I don't really have the time to get into this atm, but I have an idea... I think the default for expecting logarithmized data vs non-logarithmized data changed between the two functions for the method='seurat' case.

massonix · 2020-10-28T14:15:28Z

I am experiencing the same problem, and it also comes from a Seurat object that I converted to anndata with SeuratDisk.

rpeys · 2020-10-29T01:03:50Z

I am also getting the error RuntimeWarning: invalid value encountered in log dispersion = np.log(dispersion) when running sc.pp.highly_variable_genes(adata, min_mean=1.7, max_mean=5, min_disp=0.5, flavor='seurat') on log scale data in the adata.X slot with mean=0 and max=16.336065. Any ideas?

Update: I just noticed that my adata.X contains a numpy array instead of a sparse matrix. Perhaps that's the issue? Will try updating to a sparse matrix and will report back

rpeys · 2020-10-29T02:26:38Z

FIXED: Updating adata.X to a scipy csr sparse matrix using adata.X = scipy.sparse.csr_matrix(adata.X) fixed this error.

I still get RuntimeWarning: invalid value encountered in sqrt std = np.sqrt(var) when running sc.pp.scale(adata, max_value=10) even after forcing to a csr matrix, but doesn't seem to affect downstream results...

massonix · 2020-10-29T11:07:03Z

Thanks for your update @rpeys, I will try to convert to scipy csr sparse matrix :)

kanefos · 2020-12-27T17:37:36Z

I have an AnnData object whose .X matrix has been transformed by size factor division, +1 and log. Subsequent sc.pp.highly_variable_genes(dataset, flavor='cell_ranger', n_top_genes=1000) yields the ValueError: Bin edges must be unique: ... You can drop duplicate edges by setting the 'duplicates' kwarg error discussed above. Transformation to a sparse matrix did not alleviate the error, and neither did any other solutions suggested.

Edit: However! While I could not get flavor='cell_ranger' to work on the data I normalised myself, flavor='seurat' has worked okay. Therefore, I recommend people also encountering this error to stick with this second flavour, because as I understand it they utilise a similar methodology.

LisaSikkema · 2021-06-29T08:19:38Z

For me this was solved by filtering out genes that were not expressed in any cell!
sc.pp.filter_genes(adata, min_cells=1)
If I include a batch_key in the hvg function, I still get the error. I guess in that case you have to ensure that every gene is expressed in every batch? Seems like a bug to fix

ivirshup · 2021-06-30T04:03:01Z

@LisaSikkema, could you please open a new issue for that? It'd be helpful if you could include a reproducible example as well.

sky1ove · 2022-05-15T19:33:06Z

This one works! thanks!!

Harender04 · 2022-06-08T13:17:46Z

Thanks for your update @rpeys, I will try to convert to scipy csr sparse matrix :)

Hello, Massonix, was the problem resolved?

Harender04 · 2022-06-08T13:24:13Z

FIXED: Updating adata.X to a scipy csr sparse matrix using adata.X = scipy.sparse.csr_matrix(adata.X) fixed this error.

I still get RuntimeWarning: invalid value encountered in sqrt std = np.sqrt(var) when running sc.pp.scale(adata, max_value=10) even after forcing to a csr matrix, but doesn't seem to affect downstream results...

hi Rebecca, I have been trying to process scRNA (converted seurat to h5ad format) in python (processing like QC, normalisation, scaling, high variables, clustering etc) and have been getting stuck at the highly variable genes. Can you please help me out with it?

falexwolf closed this as completed Dec 9, 2018

ivirshup mentioned this issue Jun 30, 2021

Handle non-expressed genes in all highly_variable_genes variations #1910

Closed

RuiqiaoHe mentioned this issue May 9, 2024

ValueError: Bin edges must be unique: bioinfo-biols/SEVtras#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

highly_variable_genes - issue #391

highly_variable_genes - issue #391

ltosti commented Dec 6, 2018 •

edited

Loading

fidelram commented Dec 6, 2018 via email

ltosti commented Dec 6, 2018 •

edited

Loading

Koncopd commented Dec 6, 2018

ltosti commented Dec 7, 2018

Koncopd commented Dec 7, 2018

ltosti commented Dec 7, 2018

Koncopd commented Dec 7, 2018

ltosti commented Dec 7, 2018

falexwolf commented Dec 9, 2018

sophiamaedler commented Mar 13, 2020

LuckyMD commented Mar 13, 2020

massonix commented Oct 28, 2020

rpeys commented Oct 29, 2020 •

edited

Loading

rpeys commented Oct 29, 2020

massonix commented Oct 29, 2020

kanefos commented Dec 27, 2020 •

edited

Loading

LisaSikkema commented Jun 29, 2021 •

edited

Loading

ivirshup commented Jun 30, 2021

sky1ove commented May 15, 2022

Harender04 commented Jun 8, 2022

Harender04 commented Jun 8, 2022

highly_variable_genes - issue #391

highly_variable_genes - issue #391

Comments

ltosti commented Dec 6, 2018 • edited Loading

fidelram commented Dec 6, 2018 via email

ltosti commented Dec 6, 2018 • edited Loading

Koncopd commented Dec 6, 2018

ltosti commented Dec 7, 2018

Koncopd commented Dec 7, 2018

ltosti commented Dec 7, 2018

Koncopd commented Dec 7, 2018

ltosti commented Dec 7, 2018

falexwolf commented Dec 9, 2018

sophiamaedler commented Mar 13, 2020

LuckyMD commented Mar 13, 2020

massonix commented Oct 28, 2020

rpeys commented Oct 29, 2020 • edited Loading

rpeys commented Oct 29, 2020

massonix commented Oct 29, 2020

kanefos commented Dec 27, 2020 • edited Loading

LisaSikkema commented Jun 29, 2021 • edited Loading

ivirshup commented Jun 30, 2021

sky1ove commented May 15, 2022

Harender04 commented Jun 8, 2022

Harender04 commented Jun 8, 2022

ltosti commented Dec 6, 2018 •

edited

Loading

ltosti commented Dec 6, 2018 •

edited

Loading

rpeys commented Oct 29, 2020 •

edited

Loading

kanefos commented Dec 27, 2020 •

edited

Loading

LisaSikkema commented Jun 29, 2021 •

edited

Loading