Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD: Clang error while installing snappy in requirements-dev.txt #32417

Closed
kendricng opened this issue Mar 3, 2020 · 26 comments · Fixed by #54633
Closed

BLD: Clang error while installing snappy in requirements-dev.txt #32417

kendricng opened this issue Mar 3, 2020 · 26 comments · Fixed by #54633
Labels
Build Library building on various platforms Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action

Comments

@kendricng
Copy link

Problem description

I was going through the instructions in setting up the dev environment on Mac here:

Contributing to pandas

And while I was trying to execute the below code:

# Install the build dependencies
python -m pip install -r requirements-dev.txt

I got the below clang error (redacted):

Running setup.py install for python-snappy ... error
    ERROR: Command errored out with exit status 1:
     …
    clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/Users/xxx/virtualenvs/pandas-dev/include -I/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c snappy/snappymodule.cc -o build/temp.macosx-10.15-x86_64-3.7/snappy/snappymodule.o
    snappy/snappymodule.cc:31:10: fatal error: 'snappy-c.h' file not found
    #include <snappy-c.h>
             ^~~~~~~~~~~~
    1 error generated.
    error: command 'clang' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: …

I went through the rest of the virtual environment set up fine besides this issue.

@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2020

Looks like this was added in #26657 and has a comment on master that it is required for pyarrow - @datapythonista any chance you know what this is?

@WillAyd WillAyd added the Dependencies Required and optional dependencies label Mar 3, 2020
@datapythonista
Copy link
Member

That PR was just to get the missing dependencies from the other builds into environment.yml so we had all the dependencies locally.

This library is to connect to gbq afaik.

But this error is expected I think. Libraries use binary dependencies, and pip is not good at those, and you need to install them manually, including their C headers. That's why conda exist, to avoid having to install binary dependencies manually.

So, @kendricng, please use conda, or install snappy, including its dependencies files manually.

Closing this issue, let me know if I miss anything and needs to be reopen.

@WillAyd
Copy link
Member

WillAyd commented Mar 4, 2020

Yea I figured something google-related but didn't see it in the pandas-gbq requirements; are we sure we even need it?

@datapythonista
Copy link
Member

Not sure if we need it. But it's a compression library, I guess it's an optional gbq dependency.

@WillAyd
Copy link
Member

WillAyd commented Mar 4, 2020

Gotcha. Yea so if we don't actually need it I would be OK to remove from the requirements files

@kendricng something you would be interested in submitting a PR for?

@WillAyd WillAyd reopened this Mar 4, 2020
@jbrockmendel
Copy link
Member

is this fixed by brew install snappy?

@simonjayhawkins
Copy link
Member

Gotcha. Yea so if we don't actually need it I would be OK to remove from the requirements files

doesn't that defeat the purpose of having "the missing dependencies from the other builds into environment.yml so we had all the dependencies locally." #32417 (comment) for testing/development

also see #32327 for another snappy issue.

@TomAugspurger
Copy link
Contributor

snappy is also an optional dependency for fastparquet and pyarrow.

Regardless, I don't think there's anything for pandas to do. If the library doesn't provide a wheel, users will need to install from source. And to do that they need the C library (built from source or through the system package manager).

If we deem that having pip users take that extra step is too large, then we could find where in the docs snappy is used and avoid it.

@kendricng
Copy link
Author

Gotcha. Yea so if we don't actually need it I would be OK to remove from the requirements files

@kendricng something you would be interested in submitting a PR for?

I would like to. I tried submitting one this morning but it got closed. I was wondering what the proper procedure for submitting one would be?

@kendricng
Copy link
Author

is this fixed by brew install snappy?

Yes, this is the fix. Thank you for pointing this out, and I've been able to set up my dev environment!

My bigger concern is that this type of error should not be happening when setting up a dev environment from scratch.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 4, 2020

It doesn’t if you use conda. But how that library is packaged is out of our control.

What we can control is avoiding its use in the docs. Does anyone know where it’s used?

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2020

@kendricng if interested can you try building the docs from a clean environment that doesn't have python-snappy? May clue in on actual requirement

@datapythonista
Copy link
Member

datapythonista commented Mar 5, 2020

@TomAugspurger looks like what fails without snappy is user_guide/io in the code:

df.to_parquet('example_fp.parquet', engine='fastparquet')

The default argument for the compression parameter is snappy, I guess if we specify the gzip compression there, no dependency is required. Do you think we should do that?

@TomAugspurger
Copy link
Contributor

Thanks for tracking that down! Yes, I think it'd be best to specify a different (or just no) compression with compression=None.

@datapythonista
Copy link
Member

Just realized that this will also affect the result in the performance comparisson: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations

Other than that it's 4 calls to to_parquet that would be updated.

@jorisvandenbossche
Copy link
Member

Personally, I would leave the docs as is.
But to smooth the development on pip, can we just remove snappy during the conversion from env -> requirements file? It's only when they would do a full doc build, contributors will locally see some errors in the parquet examples. I don't think that's too bad.

@datapythonista
Copy link
Member

My preference is to close this issue as invalid and leave everything as it is. pip not being able to easily install non pure Python dependencies is known and not pandas related. That's why conda exists, and why we recommend it. Users who still want to use pip will have to deal with these problems. Today is snappy, but tomorrow may be something else that we can't consider removing from the docs dependencies.

If anything, what I would do is stop providing the requirements-dev.txt file.

@WillAyd
Copy link
Member

WillAyd commented Mar 10, 2020

@datapythonista so I think we disagree but probably because we are coming to this from different angles. I generally am not clear on why we put optional dependencies for optional dependencies in the requirements file, so if you have insights into that maybe would be helpful.

IMO the downside of doing that is manifested in an error here, but even if conda doesn't throw an error it still adds complexity to the solvability of our environments that might not be necessary and could still cause issues down the road

@datapythonista
Copy link
Member

I think we have three options in this case:

  1. Leave snappy for conda and pip and pip will fail for snappy if the user doesn't have the right dependencies. What we have now, and my favourite option.
  2. Leave snappy for conda and remove it for pip, which adds some complexity to the keeping the dependencies synchronized, and will cause pip users get errors when building the docs. Joris favourite option.
  3. Remove snappy from both conda and pip, which requires not using snappy in the docs (using compression=False in all to_parquet calls, and causes a bit of trouble in the benchmarks. We can't have to_parquet with snappy on them, so we can add a note and show to_parquet without compression.

None of them is perfect. I think 1 is the simplest, and the drawback is IMHO quite reasonable.

3 is also quite ok for me, except we're limited on what we show in the docs, and they will become trickier. We expect users to use snappy (that's why it's by default), but we always show how to use no compression. I don't think it's great to have notes in the docs "we remove compression because we don't have snappy installed", and another "we show to_parquet without compression, but with compression would be much faster". But it's a reasonable option.

Having snappy only for conda is the best of both worlds. But at the cost of a hacky solution IMO. Things become trickier in the dependency synchronization, and if the same happens with another library, things will start to be out of control, as this doesn't escalate well I think.

In any case, I'm ok with whatever the rest decide. But my preference is not to complicate things more for what IMO is a small problem (I don't think we should support pip for pandas development).

@jorisvandenbossche
Copy link
Member

Note that you always get snappy if you install pyarrow with conda (at least for recent versions).
So another option is also to just leave it out: with conda we still get it automatically as a dependency of pyarrow, and with pip it is not installed (since it is not listed as a install requirement in setup.py). That gives no maintenance overhead in tracking it in the conversion script.

(I suppose this was the reason it was added originally: not because it was needed for the conda env, but to make the converted requirements file equivalent)

@datapythonista
Copy link
Member

We've got pyarrow>=0.13.1 in environmrnt.yml, and when I removed snappy in #32454, it wasn't installed and the doc build failed.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 10, 2020

Ah yes, I forgot that.
So we mixed up python-snappy and snappy. snappy is required for pyarrow (and with conda it is installed by default, and I suppose the wheels include it as well), python-snappy is required for fastparquet to support snappy compression.
(so the comment in the environment.yml file is wrong, it's not for pyarrow but for fastparquet)

@jorisvandenbossche
Copy link
Member

So an option could also be to only put compression=None in the fastparquet examples, and keep the normal compression (eg in the benchmark in io.rst) where pyarrow is used.

@jbrockmendel jbrockmendel added the Build Library building on various platforms label Sep 21, 2020
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Jul 29, 2021
@rhshadrach
Copy link
Member

rhshadrach commented Aug 18, 2023

We just ran into this again during the sprint at EuroSciPy. It is quite confusing for new contributors and I would like to make setting up a new environment as friction-less as possible. With the suggestion in #32417 (comment) we can remove it from both the conda environment and the requirements-dev as well as not have any errors building the documentation.

cc @datapythonista

@jorisvandenbossche
Copy link
Member

On of the things that was mentioned above is that python-snappy is needed for building the IO user guide. From a quick test, after removing it from my dev env, the page still seems to build fine.

Also, fastparquet no longer depends on python-snappy (according to https://fastparquet.readthedocs.io/en/latest/install.html#requirements it now uses cramjam for all different compression codecs). So we can actually remove some skips from the parquet tests.

So I think we can actually just remove this altogether.

@jorisvandenbossche
Copy link
Member

The only other mention of snappy in our codebase is in the context of HDF IO, but there pytables includes this (I assume through blosc, in any case it doesn't list python-snappy as dependency).

So I am removing python-snappy completely (not only from the pip dev requirements, but from all CI envs and the installation docs etc as well): #54633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action
Projects
None yet
9 participants