Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files-to-artifacts database / API / mapping #54

Closed
2 tasks done
jaimergp opened this issue Feb 22, 2024 · 4 comments
Closed
2 tasks done

Files-to-artifacts database / API / mapping #54

jaimergp opened this issue Feb 22, 2024 · 4 comments

Comments

@jaimergp
Copy link
Contributor

jaimergp commented Feb 22, 2024


Provide a way for users to find which package(s) provide a certain file (e.g. a header, or a library, or an executable), similar to what portals like pkgs.org do.

We do have the info in the database designed in https://github.com/Quansight-Labs/conda-forge-db, but we need to serve it somewhere, preferably serverless or close-to-zero maintenance (e.g. one-click deployment). This is tricky because populating the database from scratch has a non-negligible overhead.

Tasks

  1. area: data 🔢 area: devops 🏗 funding: czi mission: infra 🛠 team: quansight-labs type: task
    zklaus
  2. area: data 🔢 area: devops 🏗 funding: czi mission: infra 🛠 team: quansight-labs type: task
    jaimergp
@jaimergp
Copy link
Contributor Author

jaimergp commented Mar 5, 2024

We talked with Matt last week and we may be able to unblock this. The main issue is deployment and maintenance of infrastructure. We have several venues to explore:

@jaimergp
Copy link
Contributor Author

jaimergp commented Mar 5, 2024

@zklaus shared some progress about the git-db prototype in today's mgmt call. Can you add some summary here? 🙏

Also some numbers to give an idea of the scale we are dealing with:

  • 1,602,023 artifacts
  • 18,390,176 unique paths
  • 618,908,726 path-to-artifact relationships
  • A naive dump in a json-path-to-data table in sqlite takes 61GB uncompressed. Down to 2.1GB with zst compression.

@zklaus
Copy link

zklaus commented Mar 6, 2024

The main idea is to store the mapping in a bare git repository. By using libgit2 via its Python binding pygit2 we avoid the need to create a huge tree on the filesystem. I have created a prototype at https://github.com/zklaus/cfgraphman which is able to add individual artifacts from their json info to the Git odb. It remains to be seen how this scales, which will be subject of further investigation over no more than this and the next week.

@jaimergp jaimergp moved this from ✋ On hold to 🏗 In progress in czi-conda-forge 📦 Mar 6, 2024
@jaimergp jaimergp modified the milestones: 18 months, 24 months Apr 30, 2024
@jaimergp
Copy link
Contributor Author

https://github.com/jaimergp/conda-forge-paths is now ready as a self-updating sqlite releaser, which is then queried by a VM in the GPU CI server. This VM has a systemd config and a crontab downloads the latest sqlite dump every Tuesday, restarts the datasette instance and voilà. A bit barebones but I think it will work.

https://conda-metadata-app.streamlit.app/Search_by_file_path has the UI-friendly prototype :)

@github-project-automation github-project-automation bot moved this from 🏗 In progress to 💪🏾 Done in czi-conda-forge 📦 Jun 12, 2024
@jaimergp jaimergp moved this from 💪🏾 Done to ✅ Done all-time in czi-conda-forge 📦 Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done all-time
Development

No branches or pull requests

2 participants