Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old full-index-in-memory loading and conda.plan-related solver handling #5154

Open
mbargull opened this issue Jan 26, 2024 · 5 comments
Labels
source::contributor created by a frequent contributor

Comments

@mbargull
Copy link
Member

mbargull commented Jan 26, 2024

In gh-5074 we removed the conda.models.dist.Dist usage from conda-build.

This included copying over the install_actions/execute_actions/display_actions functions from conda.plan
(which only continued to exist to serve these deprecated functions to conda-build).

These functions, which are as of gh-5074 now part of conda_build.environ, are reduced to have only the code actually used in conda-build.
But their behavior still remained the same, i.e., they still use the same old solver invocations that loads the whole repodata index into memory, not benefiting from the reduced resource usage via conda/conda#12050 .

To be able to work with the reduced/only partially loaded index, we need to deprecate/remove some functionality that relies on iterating over the full index.
gh-5152 marks parts related to that as deprecated.

The major part to look out for when we'll replace the current install_actions code is that the index has to stay in a consistent state throughout a recipe build.


Closes #4961

@github-project-automation github-project-automation bot moved this to 🆕 New in 🧭 Planning Jan 26, 2024
@mbargull mbargull added the source::contributor created by a frequent contributor label Jan 26, 2024
@mbargull
Copy link
Member Author

I'll look into this during the next couple of weeks.
In preparation of it, to get a sense on what memory/CPU consumption is currently happening, I stumbled upon the fact that peak memory usage (with libmamba solver) is actually dominated by a bug in conda>=4.12: conda/conda#13541 .
So, when doing comparisons of resources used before/after changes in conda-build, one should factor out the surplus coming from that bug (patching conda when testing -- or, if lucky, maybe we get a patch release in the meantime).

@jaimergp
Copy link
Contributor

jaimergp commented Apr 11, 2024

xref #4961. Copied here:


In #4431 I was investigating some differences in timings between CONDA_SOLVER=libmamba conda-build and conda mambabuild, which should be similar. However, conda-build spends a few minutes before the solver kicks in, while mambabuild does not.

My research led me to finding out that those extra minutes are spent creating the build index, which is the aggregation of all the source channels (e.g. defaults, conda-forge) and their platforms (noarch is collapsed into e.g. linux-64 🤷), plus the local channels (the CONDA_BLD_PATH cache and/or the chosen output folder).

The slow part is not the repodata fetching, but creating the million+ PackageRecord objects greedily, instead of letting conda do it lazily as needed (introduced in conda/conda#12050). This actually happens in conda, in two places, but conda-build is the sole consumer of those endpoints AFAIK.

  • In index.py, in the exports module, where an identity dict[PackageRecord, PackageRecord] is built by aggregating all the SubdirData.iter_records() instances.
  • In exports.py, where that map is processed again just to convert the keys into Dist objects.

It feels like we could do this better, as @dholth was saying #4431 (comment). The interface with the Solver could also be better; it currently overwrites Solver._index with this greedily built object, and conda-libmamba-solver won't even use that 😬

@jaimergp
Copy link
Contributor

jaimergp commented Apr 11, 2024

@zklaus, summarizing our conversation earlier today:

To note:

  • conda-libmamba-solver detects whether Solver._index or not has been messed with, and extracts the local channels from the package records there defined. This is how we estimate which ones need to be reloaded on each call.
  • conda solver=classic uses the whole index for the solve, I think. Not sure if they reduce it later or not.

Approaches:

  • The easy way out: detect context.solver == libmamba and only populate the local part of the _index (no need to build PackageRecords for the whole thing) so we can still get the local channels.
  • Same but a bit more elegant, add a class attribute to Solver to announce whether a full index is needed in memory or not.
  • Better even: change the solver API to have a standardized, companion Index API that handles the interfacing better (with and without conda-build) instead of relying on Solver._index manipulation. This API should mention which channels need to be reloaded after an artifact is built and how.

@dholth
Copy link
Contributor

dholth commented Apr 11, 2024

It would also be a great idea to make PackageRecord instantiate more quickly.

@jaimergp
Copy link
Contributor

It would also be a great idea to make PackageRecord instantiate more quickly.

I'll create an issue for this if it doesn't exist already.

Besides that, I think this is ready to close after #5488 and conda/conda#13880, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source::contributor created by a frequent contributor
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

3 participants