Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-conda builds failing with GLIBCXX errors #185

Closed
jameslamb opened this issue Sep 3, 2024 · 13 comments · Fixed by #186
Closed

ci-conda builds failing with GLIBCXX errors #185

jameslamb opened this issue Sep 3, 2024 · 13 comments · Fixed by #186
Assignees
Labels
bug Something isn't working

Comments

@jameslamb
Copy link
Member

jameslamb commented Sep 3, 2024

Description

48 of the ci-conda image builds jobs are deterministically failing with GLIBC errors like this:

ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /opt/conda/lib/python3.11/site-packages/libmambapy/bindings.cpython-311-x86_64-linux-gnu.so)

Reproducible Example

Observed this on multiple unrelated PRs, e.g. #179 and #183.

Example build link: https://github.com/rapidsai/ci-imgs/actions/runs/10686297019/job/29621203550?pr=183

Notes

N/A

@jameslamb jameslamb added the bug Something isn't working label Sep 3, 2024
@jameslamb jameslamb self-assigned this Sep 3, 2024
@jameslamb
Copy link
Member Author

I'm using #183 to investigate this.

@jameslamb
Copy link
Member Author

Worth noting that there was a new release of libmambapy 3 days ago:

I don't see anything obviously-relevant in those release notes though.

@jameslamb
Copy link
Member Author

jameslamb commented Sep 3, 2024

Still debugging, dumping some notes.

Here's a recent full-re-run: https://github.com/rapidsai/ci-imgs/actions/runs/10687166451?pr=183


What's succeeding

✅ all Python 3.10 builds
✅ all x86_64 Ubuntu 22.04 builds

What's failing:

❌ all Python 3.11 and 3.12 aarch64 builds
❌ Python 3.11/3.12 on x86_64 rockylinux8
❌ Python 3.11/3.12 on x86_64 ubuntu20.04


The failures are all happening at this step:

ci-imgs/ci-conda.Dockerfile

Lines 178 to 183 in d63e1aa

# Install prereq for envsubst
RUN <<EOF
rapids-mamba-retry install -y \
gettext
conda clean -aipty
EOF

@jameslamb
Copy link
Member Author

I truncated the ci-conda.Dockerfile down to stop just short of the line that's failing.

ci-conda-truncated.Dockerfile (click me)
ARG CUDA_VER=notset
ARG LINUX_VER=notset
ARG PYTHON_VER=notset
ARG YQ_VER=notset
ARG AWS_CLI_VER=notset

FROM nvidia/cuda:${CUDA_VER}-base-${LINUX_VER} AS miniforge-cuda

ARG LINUX_VER
ARG PYTHON_VER
ARG DEBIAN_FRONTEND=noninteractive
ENV PATH=/opt/conda/bin:$PATH
ENV PYTHON_VERSION=${PYTHON_VER}

SHELL ["/bin/bash", "-euo", "pipefail", "-c"]

# Create a conda group and assign it as root's primary group
RUN <<EOF
groupadd conda
usermod -g conda root
EOF

# Ownership & permissions based on https://docs.anaconda.com/anaconda/install/multi-user/#multi-user-anaconda-installation-on-linux
COPY --from=condaforge/miniforge3:24.3.0-0 --chown=root:conda --chmod=770 /opt/conda /opt/conda

# Ensure new files are created with group write access & setgid. See https://unix.stackexchange.com/a/12845
RUN chmod g+ws /opt/conda

RUN <<EOF
# Ensure new files/dirs have group write permissions
umask 002
# install expected Python version
conda install -y -n base "python~=${PYTHON_VERSION}.0=*_cpython"
conda update --all -y -n base
if [[ "$LINUX_VER" == "rockylinux"* ]]; then
  yum install -y findutils
  yum clean all
fi
find /opt/conda -follow -type f -name '*.a' -delete
find /opt/conda -follow -type f -name '*.pyc' -delete
conda clean -afy
EOF

# Reassign root's primary group to root
RUN usermod -g root root

RUN <<EOF
# ensure conda environment is always activated
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh
echo ". /opt/conda/etc/profile.d/conda.sh; conda activate base" >> /etc/skel/.bashrc
echo ". /opt/conda/etc/profile.d/conda.sh; conda activate base" >> ~/.bashrc
EOF

# tzdata is needed by the ORC library used by pyarrow, because it provides /etc/localtime
RUN <<EOF
case "${LINUX_VER}" in
  "ubuntu"*)
    apt-get update
    apt-get upgrade -y
    apt-get install -y --no-install-recommends \
      tzdata
    rm -rf "/var/lib/apt/lists/*"
    ;;
  "rockylinux"*)
    yum update -y
    yum clean all
    ;;
  *)
    echo "Unsupported LINUX_VER: ${LINUX_VER}" && exit 1
    ;;
esac
EOF

FROM mikefarah/yq:${YQ_VER} AS yq

FROM amazon/aws-cli:${AWS_CLI_VER} AS aws-cli

FROM miniforge-cuda

ARG TARGETPLATFORM=notset
ARG CUDA_VER=notset
ARG LINUX_VER=notset
ARG PYTHON_VER=notset

ARG DEBIAN_FRONTEND

# Set RAPIDS versions env variables
ENV RAPIDS_CUDA_VERSION="${CUDA_VER}"
ENV RAPIDS_PY_VERSION="${PYTHON_VER}"

SHELL ["/bin/bash", "-euo", "pipefail", "-c"]

# Install system packages depending on the LINUX_VER
RUN <<EOF
case "${LINUX_VER}" in
  "ubuntu"*)
    echo 'APT::Update::Error-Mode "any";' > /etc/apt/apt.conf.d/warnings-as-errors
    apt-get update
    apt-get upgrade -y
    apt-get install -y --no-install-recommends \
      curl \
      file \
      unzip \
      wget \
      gcc \
      g++
    rm -rf "/var/lib/apt/lists/*"
    ;;
  "rockylinux"*)
    yum -y update
    yum -y install --setopt=install_weak_deps=False \
      file \
      unzip \
      wget \
      which \
      yum-utils \
      gcc \
      gcc-c++
    yum clean all
    ;;
  *)
    echo "Unsupported LINUX_VER: ${LINUX_VER}"
    exit 1
    ;;
esac
EOF

# Install CUDA packages, only for CUDA 11 (CUDA 12+ should fetch from conda)
RUN <<EOF
case "${CUDA_VER}" in
  "11"*)
    PKG_CUDA_VER="$(echo ${CUDA_VER} | cut -d '.' -f1,2 | tr '.' '-')"
    echo "Attempting to install CUDA Toolkit ${PKG_CUDA_VER}"
    case "${LINUX_VER}" in
      "ubuntu"*)
        apt-get update
        apt-get upgrade -y
        apt-get install -y --no-install-recommends \
          cuda-gdb-${PKG_CUDA_VER} \
          cuda-cudart-dev-${PKG_CUDA_VER} \
          cuda-cupti-dev-${PKG_CUDA_VER}
        # ignore the build-essential package since it installs dependencies like gcc/g++
        # we don't need them since we use conda compilers, so this keeps our images smaller
        apt-get download cuda-nvcc-${PKG_CUDA_VER}
        dpkg -i --ignore-depends="build-essential" ./cuda-nvcc-*.deb
        rm ./cuda-nvcc-*.deb
        # apt will not work correctly if it thinks it needs the build-essential dependency
        # so we patch it out with a sed command
        sed -i 's/, build-essential//g' /var/lib/dpkg/status
        rm -rf "/var/lib/apt/lists/*"
        ;;
      "rockylinux"*)
        yum -y update
        yum -y install --setopt=install_weak_deps=False \
          cuda-cudart-devel-${PKG_CUDA_VER} \
          cuda-driver-devel-${PKG_CUDA_VER} \
          cuda-gdb-${PKG_CUDA_VER} \
          cuda-cupti-${PKG_CUDA_VER}
        rpm -Uvh --nodeps $(repoquery --location cuda-nvcc-${PKG_CUDA_VER})
        yum clean all
        ;;
      *)
        echo "Unsupported LINUX_VER: ${LINUX_VER}"
        exit 1
        ;;
    esac
    ;;
  *)
    echo "Skipping CUDA Toolkit installation for CUDA ${CUDA_VER}"
    ;;
esac
EOF

# Install gha-tools
RUN wget https://github.com/rapidsai/gha-tools/releases/latest/download/tools.tar.gz -O - \
  | tar -xz -C /usr/local/bin

That's sufficient to reproduce the error locally on my mac (aarch64), with the latest versions of Ubuntu, Python, and CUDA supported in this repo.

docker buildx build \
    --build-arg SCCACHE_VER=0.7.7 \
    --build-arg GH_CLI_VER=2.54.0 \
    --build-arg CODECOV_VER=0.7.3 \
    --build-arg YQ_VER=4.44.2 \
    --build-arg AWS_CLI_VER=2.17.20 \
    --build-arg CUDA_VER=12.5.1 \
    --build-arg LINUX_VER=ubuntu22.04 \
    --build-arg PYTHON_VER=3.12 \
    --file ci-conda-truncated.Dockerfile \
    --tag delete-me:ci-conda-py3.12 \
    ./context

docker run \
    --rm \
    -it delete-me:ci-conda-py3.12 \
    conda clean --yes --all
# Error while loading conda entry point: conda-libmamba-solver (/lib/aarch64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/conda/lib/python3.12/site-packages/libmambapy/bindings.cpython-312-aarch64-linux-gnu.so))

Checked that library's GLIBC symbols

docker run \
    --rm \
    --env LIB_FILE=/opt/conda/lib/python3.12/site-packages/libmambapy/bindings.cpython-312-aarch64-linux-gnu.so \
    -it delete-me:ci-conda-py3.12 \
    bash -c 'objdump -T ${LIB_FILE} | grep -oP "(?<=GLIBCXX_)([0-9.]+)" | sort -u'

And sure enough, I see GLIBCXX_3.4.32 symbols there

3.4
3.4.11
3.4.14
3.4.18
3.4.20
3.4.21
3.4.26
3.4.29
3.4.32
3.4.9

So it does look like root cause for CI failures here is something like "libmambapy is shipping libraries compiled against a too-new GLIBCXX".

In this case, that image got the following versions of mamba things:

docker run \
    --rm \
    -it delete-me:ci-conda-py3.12 \
    bash -c 'conda env export | grep -i mamba'
  - conda-libmamba-solver=24.7.0=pyhd8ed1ab_0
  - libmamba=1.5.9=hee7cc92_0
  - libmambapy=1.5.9=py312hc6280c9_0
  - mamba=1.5.9=py312hd80a4d2_0

Linking one related issue: conda-forge/mamba-feedstock#201.

Next, I'll try to reproduce this with a more minimal example.

@jameslamb jameslamb changed the title ci-conda builds failing with GLIBC errors ci-conda builds failing with GLIBCXX errors Sep 3, 2024
@jameslamb
Copy link
Member Author

Getting a lot closer!

Short summary

In successfully-building environments, there exists a symlink in /opt/conda/lib from libstdc++.so.6 -> libstdc++.so.6.33.

In failing environments, that symlink is missing.

bindings.cpython-312-aarch64-linux-gnu.so has a DT_NEEDED entry specifying libstdc++.so.6. Because of the missing symlink, the loader is not finding what it needs in /opt/conda/lib and is loading the system-installed libdstc++.so.6... which is slightly older, and therefore doesn't have new-enough GLIBCXX symbols.

Details

expand for details (click me)

It looks like on Python 3.12, libmambapy has a DT_NEEDED entry of libstdc++.so.6.

readelf -d /opt/conda/lib/python3.12/site-packages/libmambapy/bindings.cpython-312-aarch64-linux-gnu.so \
|  grep -E 'NEEDED|RPATH'
 0x0000000000000001 (NEEDED)             Shared library: [libmamba.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libfmt.so.10]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-aarch64.so.1]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../../..]

At loading time it's finding the one in /lib

/lib/aarch64-linux-gnu/libstdc++.so.6

Which only has GLIBCXX symbols up to 3.4.30

objdump -T /lib/aarch64-linux-gnu/libstdc++.so.6 \
| grep -oP '(?<=GLIBCXX_)([0-9.]+)' \
| sort -u

INSTEAD OF the one provided by conda and pointed to by that RPATH.

/opt/conda/lib/libstdc++.so.6.0.33

Which has GLIBCXX symbols up to 3.4.33.

objdump -T /opt/conda/lib/libstdc++.so.6.0.33 \
| grep -oP '(?<=GLIBCXX_)([0-9.]+)' \
| sort -u

Maybe conda is missing a symlink from libdstc++.so.6 -> libstdc++.so.6.0.33?

I rebuilt a Python 3.10 image using that "truncated" Dockerfile, and confirmed that I saw the symlink there.

docker buildx build \
    --build-arg SCCACHE_VER=0.7.7 \
    --build-arg GH_CLI_VER=2.54.0 \
    --build-arg CODECOV_VER=0.7.3 \
    --build-arg YQ_VER=4.44.2 \
    --build-arg AWS_CLI_VER=2.17.20 \
    --build-arg CUDA_VER=12.5.1 \
    --build-arg LINUX_VER=ubuntu22.04 \
    --build-arg PYTHON_VER=3.10 \
    --file ci-conda-truncated.Dockerfile \
    --tag delete-me:ci-conda-py3.10 \
    ./context

docker run \
    --rm \
    -it delete-me:ci-conda-py3.10 \
    bash -c 'ls /opt/conda/lib/libstdc++*'
# /opt/conda/lib/libstdc++.so
# /opt/conda/lib/libstdc++.so.6
# /opt/conda/lib/libstdc++.so.6.0.33

docker run \
    --rm \
    -it delete-me:ci-conda-py3.10 \
    bash -c 'stat /opt/conda/lib/libstdc++.so.6'
#  File: /opt/conda/lib/libstdc++.so.6 -> libstdc++.so.6.0.33
#  Size: 19              Blocks: 0          IO Block: 4096   symbolic link

@jameslamb
Copy link
Member Author

I think it's worth noting that we set up conda by copying the contents of /opt/conda from a condaforge/miniforge3 image.

COPY --from=condaforge/miniforge3:24.3.0-0 --chown=root:conda --chmod=770 /opt/conda /opt/conda

And then update Python and then all other dependencies in the base environment.

conda install -y -n base "python~=${PYTHON_VERSION}.0=*_cpython"
conda update --all -y -n base

The base environment in that upstream condaforge/miniforge3 image uses Python 3.10, so this updates fewer packages there than for the Python 3.11 and 3.12 images. That might explain why the libdstdc++ symlinking works differently in Python 3.10 images compared to the others.

@jameslamb
Copy link
Member Author

jameslamb commented Sep 3, 2024

Ahhhh yes this is totally what's happening!!!

For Python 3.11 / 3.12 environments, that conda update --all -y -n base is destroying the libstdc++.so and libstdc++.so.6 symlinks 😬

Starting from the base image, the library and links are there.

docker run --rm \
    -it condaforge/miniforge3:24.3.0-0 \
    bash

ls -l /opt/conda/lib/*stdc++* | grep -oP '/opt.*'
# /opt/conda/lib/libstdc++.so -> libstdc++.so.6.0.32
# /opt/conda/lib/libstdc++.so.6 -> libstdc++.so.6.0.32
# /opt/conda/lib/libstdc++.so.6.0.32

After updating to Python 3.12, a new libdstdc++.so.6.0.33 is pulled in, and the links are updated to point to it.

conda install -y -n base "python~=3.12.0=*_cpython"

ls -l /opt/conda/lib/*stdc++* | grep -oP '/opt.*'
# /opt/conda/lib/libstdc++.so -> libstdc++.so.6.0.33
# /opt/conda/lib/libstdc++.so.6 -> libstdc++.so.6.0.33
# /opt/conda/lib/libstdc++.so.6.0.32
# /opt/conda/lib/libstdc++.so.6.0.33
summary of upgrades, downgrades, installs, removals (click me)
The following NEW packages will be INSTALLED:

    frozendict         conda-forge/linux-aarch64::frozendict-2.4.4-py312h396f95a_0 
    libexpat           conda-forge/linux-aarch64::libexpat-2.6.2-h2f0025b_0 
    libgcc             conda-forge/linux-aarch64::libgcc-14.1.0-he277a41_1 
    libstdcxx          conda-forge/linux-aarch64::libstdcxx-14.1.0-h3f4de04_1 
  
  The following packages will be UPDATED:
  
    brotli-python                       1.1.0-py310hbb3657e_1 --> 1.1.0-py312h6f74592_2 
    ca-certificates                       2024.2.2-hcefe29a_0 --> 2024.8.30-hcefe29a_0 
    certifi                             2024.2.2-pyhd8ed1ab_0 --> 2024.8.30-pyhd8ed1ab_0 
    cffi                               1.16.0-py310hce94938_0 --> 1.17.0-py312hac81daf_1 
    conda                              24.3.0-py310h4c7bcd0_0 --> 24.7.1-py312h996f985_0 
    jsonpointer                           2.4-py310h4c7bcd0_3 --> 3.0.0-py312h996f985_1 
    libgcc-ng                               13.2.0-hf8544c7_5 --> 14.1.0-he9431aa_1 
    libgomp                                 13.2.0-hf8544c7_5 --> 14.1.0-he277a41_1 
    menuinst                            2.0.2-py310h4c7bcd0_0 --> 2.1.2-py312h996f985_1 
    openssl                                  3.2.1-h31becfc_1 --> 3.3.1-h86ecc28_3 
    python                         3.10.14-hbbe8eec_0_cpython --> 3.12.3-h43d1f9e_0_cpython 
    python_abi                                   3.10-4_cp310 --> 3.12-5_cp312 
    zstandard                          0.22.0-py310h468e293_0 --> 0.23.0-py312hb698573_1 
    zstd                                     1.5.5-h4c53e97_0 --> 1.5.6-h02f22dd_0 
  
  The following packages will be DOWNGRADED:
  
    libmambapy                          1.5.8-py310h5938bc3_0 --> 1.5.8-py312h1e39527_0 
    mamba                               1.5.8-py310hcbdc16a_0 --> 1.5.8-py312hd80a4d2_0 
    pycosat                             0.6.6-py310hb299538_0 --> 0.6.6-py312hdd3e373_0 
    ruamel.yaml                        0.18.6-py310hb299538_0 --> 0.18.6-py312hdd3e373_0 
    ruamel.yaml.clib                    0.2.8-py310hb299538_0 --> 0.2.8-py312hdd3e373_0

But the conda update --all -y -n base destroys them.

conda update --all -y -n base

ls -l /opt/conda/lib/*stdc++* | grep -oP '/opt.*'
# /opt/conda/lib/libstdc++.so.6.0.33
summary of upgrades, downgrades, installs, removals (click me)
The following NEW packages will be INSTALLED:

  h2                 conda-forge/noarch::h2-4.1.0-pyhd8ed1ab_0 
  hpack              conda-forge/noarch::hpack-4.0.0-pyh9f0ad1d_0 
  hyperframe         conda-forge/noarch::hyperframe-6.0.1-pyhd8ed1ab_0 

The following packages will be UPDATED:

  bzip2                                    1.0.8-h31becfc_5 --> 1.0.8-h68df207_7 
  c-ares                                  1.28.1-h31becfc_0 --> 1.33.1-ha64f414_0 
  conda-libmamba-so~                    24.1.0-pyhd8ed1ab_0 --> 24.7.0-pyhd8ed1ab_0 
  conda-package-han~                     2.2.0-pyh38be061_0 --> 2.3.0-pyh7900ff3_0 
  conda-package-str~                     0.9.0-pyhd8ed1ab_0 --> 0.10.0-pyhd8ed1ab_0 
  icu                                       73.2-h787c7f5_0 --> 75.1-hf9b3779_0 
  idna                                     3.7-pyhd8ed1ab_0 --> 3.8-pyhd8ed1ab_0 
  krb5                                    1.21.2-hc419048_0 --> 1.21.3-h50a48e9_0 
  ld_impl_linux-aar~                        2.40-h2d8c526_0 --> 2.40-h9fc2d93_7 
  libarchive                               3.7.2-hd2f85e0_1 --> 3.7.4-h2c0effa_0 
  libcurl                                  8.7.1-h4e8248e_0 --> 8.9.1-hfa30633_0 
  libmamba                                 1.5.8-hea3be6c_0 --> 1.5.9-hee7cc92_0 
  libmambapy                          1.5.8-py312h1e39527_0 --> 1.5.9-py312hc6280c9_0 
  libsolv                                 0.7.28-hd84c7bf_2 --> 0.7.30-h62756fc_0 
  libsqlite                               3.45.2-h194ca79_0 --> 3.46.1-hc4a20ef_0 
  libstdcxx-ng                            13.2.0-h9a76618_5 --> 14.1.0-hf1166c9_1 
  libxml2                                 2.12.6-h3091e33_1 --> 2.12.7-h00a45b3_4 
  libzlib                                 1.2.13-h31becfc_5 --> 1.3.1-h68df207_1 
  lzo                                    2.10-h516909a_1000 --> 2.10-h31becfc_1001 
  mamba                               1.5.8-py312hd80a4d2_0 --> 1.5.9-py312hd80a4d2_0 
  ncurses                           6.4.20240210-h0425590_0 --> 6.5-hcccb83c_1 
  packaging                               24.0-pyhd8ed1ab_0 --> 24.1-pyhd8ed1ab_0 
  pip                                     24.0-pyhd8ed1ab_0 --> 24.2-pyh8b19718_1 
  platformdirs                           4.2.0-pyhd8ed1ab_0 --> 4.2.2-pyhd8ed1ab_0 
  pluggy                                 1.4.0-pyhd8ed1ab_0 --> 1.5.0-pyhd8ed1ab_0 
  requests                              2.31.0-pyhd8ed1ab_0 --> 2.32.3-pyhd8ed1ab_0 
  setuptools                            69.5.1-pyhd8ed1ab_0 --> 73.0.1-pyhd8ed1ab_0 
  tqdm                                  4.66.2-pyhd8ed1ab_0 --> 4.66.5-pyhd8ed1ab_0 
  truststore                             0.8.0-pyhd8ed1ab_0 --> 0.9.2-pyhd8ed1ab_0 
  tzdata                                   2024a-h0c530f3_0 --> 2024a-h8827d51_1 
  urllib3                                2.2.1-pyhd8ed1ab_0 --> 2.2.2-pyhd8ed1ab_1 
  wheel                                 0.43.0-pyhd8ed1ab_1 --> 0.44.0-pyhd8ed1ab_0

@hcho3
Copy link
Contributor

hcho3 commented Sep 3, 2024

Really appreciate the detailed writing. Just approved the PR

@vyasr
Copy link
Contributor

vyasr commented Sep 4, 2024

This is an awesome investigation James, thanks!

I don't know exactly what is happening, but I can with reasonable confidence say that we are running afoul of conda-forge/ctng-compilers-feedstock#148 introducing incompatibilities with the old packages our images are copying from a version of miniforge prior to those changes. conda-forge/ctng-compilers-feedstock#148 introduced the unsuffixed libstdxx package in order to allow conda-forge to eventually drop the suffixed version libstdxx-ng. When you install python 3.12 with conda install -y -n base "python~=3.12.0=*_cpython", your conda output shows this:

The following NEW packages will be INSTALLED:
...
    libstdcxx          conda-forge/linux-aarch64::libstdcxx-14.1.0-h3f4de04_1 

But then, when you run the update, we see this:

The following packages will be UPDATED:
  libstdcxx-ng                            13.2.0-h9a76618_5 --> 14.1.0-hf1166c9_1 

Note that in the second case the package has the -ng, and that it somehow is still on version 13.2 despite libstdcxx version 14.1 being installed in the previous step. If we look at the recipe for these, we see that with current packages this should be impossible because libstdcxx has a strict run constraint on libstdcxx-ng to keep these in sync.

Here's my best guess for what is happening, although it has some pretty clear gaps that need to be filled in.

  • In the initial copy from miniforge we are copying over a version of libstdcxx-ng from prior to this PR merging.
  • All the packages that need libstdcxx that are installed as part of the install of Python 3.12 have had their recipes updated to point to the new libstdcxx package instead of libstdcxx-ng, so the new package is installed cleanly. Somehow because the old libstdcxx-ng package comes from before the updates introducing libstdcxx, it is not being properly updated when libstdcxx is installed despite the run_constrained that I linked above. Perhaps there is some repodata patching that could be done to support this.
  • The blanket update triggers an update of libstdcxx-ng, which now pulls in a version of the package from after the suffix-removing PR was merged. That should, in theory, have also triggered installation of a compatible version of libstdcxx, but conda sees that you already have one. Therefore, the removal of the old libstdcxx-ng triggers deletion of files that were previously owned by that package, but the installation of the new version does not replace them because those files are now owned by libstdcxx.

If I am correct, then what can we do to fix this? I would suggest updating libstdcxx-ng immediately after copying over the files from miniforge, and before we attempt to install anything else. For good measure, you probably want to also update libgcc-ng and libgfortran-ng. You could also simply run a conda update --all, which I would expect to work as well.

@msarahan
Copy link

msarahan commented Sep 4, 2024

I agree with Vyas' analysis. One possible alternative solution is to use micromamba to populate /opt/conda instead of copying from miniforge. This has the advantage of installing the desired python version directly. Here are some docs: https://micromamba-docker.readthedocs.io/en/latest/quick_start.html

@vyasr what do you think about this alternative?

@jameslamb
Copy link
Member Author

Thank you all so much!

I pushed @vyasr 's recommendation of doing an earlier conda update --all to #186, and that looks to be working well! I'm much happier with that... way more sustainable than my manually-create-the-symlinks hack.

what do you think about this alternative?

We do also need the conda executable to make it in there somewhere, as there are plenty of CI scripts around RAPIDS using conda mambabuild and other conda commands: https://github.com/search?q=org%3Arapidsai+%22rapids-conda-retry%22+AND+NOT+is%3Aarchived&type=code

It looks to me from those docs like using micromamba might not get you the conda executable, and downstream scripts would have to be changed to use e.g. micromamba install, micromamba update, etc.? If I'm right about all that, I wouldn't support switching to micromamba just for this case.

@msarahan
Copy link

msarahan commented Sep 4, 2024

If you want conda to be available in the environment afterwards, you just include it as one of the packages to install.

The idea is to use micromamba as just a provisioner for the environment, and it doesn't stick around. In other words, instead of

COPY --from miniforge3 /opt/conda/ /opt/conda/
conda update --all

you can have:

FROM micromamba as conda_env
<create the env at /opt/conda>
FROM <some lightweight base image>
COPY --from conda_env /opt/conda /opt/conda
<no update necessary>

It's pretty similar either way, and it looks to me like neither is obviously advantageous over the other. The conda history will be simpler with the latter, which might reduce the chance of weird issues.

@vyasr
Copy link
Contributor

vyasr commented Sep 4, 2024

I do think we should switch over to micromamba, see rapidsai/build-planning#50 🙂 but I would suggest we do that as a follow-up since that will require more rigorous testing to get right. It would be good to update shared workflows to support using custom images so that we could change the images to use micromamba and then run test workflows in a couple of repos to be sure that everything works as expected.

rapids-bot bot pushed a commit to rapidsai/docker that referenced this issue Sep 6, 2024
Nightly builds of `rapidsai/raft-ann-bench` failed like this:

> ImportError: /lib/aarch64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/conda/lib/python3.11/site-packages/libmambapy/bindings.cpython-311-aarch64-linux-gnu.so)

([build link](https://github.com/rapidsai/docker/actions/runs/10739898324/job/29789780257))

I suspect that's because those images use the same pattern for initializing a conda environment that led to the issues described in rapidsai/ci-imgs#185.

This proposes the same fix that we applied in `ci-imgs` (rapidsai/ci-imgs#186).

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Mike Sarahan (https://github.com/msarahan)

URL: #710
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants