Integrate Artifact's tables #5

viniciusdc · 2023-07-07T13:28:39Z

This PR adds the following:

Datasset docker-compose for web-browser interface with the DB, which also allow API interfacing with the Database;
Artifacts, ArtifactsFilePaths, and RelationsMapFilePaths schemas;
artifact populate function responsible for mapping the corresponding artifact data into the DB and their respective associated files;
unit tests coverage;
download and upload the compressed database as CI/CD artifact;
New harvesting module to generate the artifact information based on anaconda mirror, decoupling from libcfgraph;

What's missing:

from scoped milestone:

Full migration with the database with all available artifacts in the mirror or syncing with current libcfgraph;
- This only requires a full execution of the code with the current version of libcfgraph, or removing the sorted_artifacts list limit of 1000 entries in the multiprocess block: https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/cfdb/harvest/core.py#L125-L127
currently import_to_package_maps still depend on libcfgraph this needs to be modified to use the stored files in the DB;

Good to have though out of scope:

better CSS/HTML for the datasset page;
optimize fetching data from anaconda with multiprocessing: https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/cfdb/harvest/upstream.py#L59-L68
replace the progress bar for an artifact to run across all artifacts instead of package count - esthetic: https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/cfdb/populate/artifacts.py#L240-L246
add documentation (check docstrings for outdated information) and extend pytest coverage;

Important! Fix issue with artifact download, where DB file is not found during download: Error: Unable to find any artifacts for the associated workflow. This is extremely important as the downloaded DB is iteratively updated across the pipeline to be uploaded at the end, completing a cycle.https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/.github/workflows/build_db.yaml#L55-L60
If no solution seems available, we can just push the latest version of the DB to the repo instead of relying on artifacts for storage, though we will not comply with the file size limit policy on Github (compressed ~65Mb)

small test Fix unitests Fix unitests test default db location Fix unitests Add custom path location for db Update path to db for build workflow use URL class for handling engine connection use URL class for handling engine connection attempt direct import for URL fix condition for workflow run when missing artifact test artifact download and decompression update download/upload artifacts actions Update path to db for build workflow

jaimergp · 2023-08-16T13:49:54Z

cfdb/log/__init__.py

Simplified logging to cfdb.log module. All we need is an initialization function that will be called from the CLI layer (not the library, which is only concerned with emitting messages).

jaimergp · 2023-08-16T13:50:41Z

cfdb/log.py

+    # Create a file handler in non-interactive sessions
+    # Set the formatter for the file handler
+    if logging_dir is None:
+        logging_dir = os.environ.get('CFDB_LOGGING_DIR', Path.cwd() / '.logs')


This default value allows us to mock the default from conftest.py.

jaimergp · 2023-08-16T13:51:13Z

cfdb/log.py

+    logger = logging.getLogger('cfdb')  
+    logger.setLevel(logging.DEBUG)
+
+    if logger.handlers:


return early if our logger already has handlers. That means someone else configured stuff and we shouldn't mess.

jaimergp · 2023-08-16T13:52:13Z

cfdb/harvest/harvester.py

@@ -0,0 +1,134 @@
+### Code extracted from https://github.com/regro/libcflib/blob/master/libcflib/harvester.py


When vendoring (and/or modifying) code, we need to bring the license file too, along with the copyright. I'd prefer we import and wrap / patch as needed, if possible, though. Otherwise it's one more piece of code we are maintaining.

jaimergp · 2023-08-16T13:53:09Z

cfdb/populate/utils.py

@@ -119,17 +128,20 @@ def traverse_files(path: Path, output_dir: Path = None) -> List[Path]:

    stored_files = []

-    with concurrent.futures.ThreadPoolExecutor() as executor:
+    with ThreadPoolExecutor() as executor:


Threads are tricky to debug. Invested a few hours hunting down an error because a package named something.json was being globbed as a file, but it was a directory, which broke the population code. Always provide an easy to debug alternative in this cases.

jaimergp · 2023-08-16T13:54:12Z

cfdb/main.py

@@ -93,7 +37,7 @@ def update_feedstock_outputs(
        To update the feedstock outputs, use the following command:
        $ cfdb update_feedstock_outputs --path /path/to/feedstock-outputs/outputs
    """
-    db_handler = CFDBHandler("sqlite:///cf-database.db")
+    db_handler = CFDBHandler()
    db_handler.update_feedstock_outputs(path)


update-feedstock-outputs was able to ingest libcfgraph/artifacts, populating tables for hours, without a complain... but it shouldn't. Maybe we need some sanitization / checking of the inputs.

jaimergp

Thanks Vini. I changed a few things and left a couple comments, but this is good enough to go for now. We can prototype APIs with this setup, for example.

A few general notes for the future:

Performance is not great for artifacts. It'll take a few hours (~40h) to populate the database from scratch. There might be something inherently slow in the population approach (e.g. the JSON payloads, the CSV records, etc). Hashing is slow at this scale. I guess we should have a "no checks" mode where it just dumps everything blindly and assumes no duplicates. Or we keep a "seen" set of hashes already processed.
What's the story for "once reasonably up-to-date, how do I keep the database up-to-date?" We need better docs for that.
I know I am saying the opposite of what I once said, but we are going to need to add conda-forge/ to the artifact identifiers, or come up with a convention to signify they come from conda-forge (and not, let's say, bioconda).
It would be useful have an "initialize database from scratch using a copy of libcfgraph and feedstock-outputs", but right now it involves several commands, and it's not clear in which order they should run or if it matters. That's maybe just a documentation issue.
It would be interesting to see a harvester of artifacts that feeds on the OCI mirror metadata (via conda-forge-metadata?) instead of the (possibly heavy) Anaconda.org packages.

Let me know what you think. I'll probably merge now and see what happens with the self-propagating artifact. We will need issues to cover the pending points above too.

viniciusdc added 9 commits July 7, 2023 10:26

Init harvesting and schema for DB - Artifacts

e9ff9cb

Integrate testing for artifacts and files populate (sync)

c943720

Test with large artifacts subset

b72ee57

Update artifacts information and structure schema

ac46e2c

Update session logic

51fc670

Add harvesting code updates to work with DB

117d576

env updates

b0ff810

Update artifact populate with correct querying & update CI/CD workflow

c1f785f

Add datasset

7d1cf2b

viniciusdc changed the title ~~WIP - Integrate Artifact's tables~~ Integrate Artifact's tables Jul 28, 2023

viniciusdc force-pushed the update-data-source-harversting branch from 9fb653d to cba8c1a Compare July 29, 2023 02:16

jaimergp added 15 commits August 15, 2023 10:47

untangle logging

ed34d49

remove unused stuff

2a1a75a

do not swallow exceptions; only process files

054ac3b

add example query

19432dd

fix entry point

e2a72d9

sort != group

0100ebb

log all exceptions

6328fa8

simplify build_db workflow

a2902ad

use ./ to signal local path

7b99342

do not use env name in conda setup

97d808d

restore env name

e1e194d

abs path to env.yml

2cb1d91

remove artifact name?

40c5cd9

revert (no artifacts are available)

3aed875

cancel in progress

6ca5cfc

jaimergp reviewed Aug 16, 2023

View reviewed changes

jaimergp approved these changes Aug 16, 2023

View reviewed changes

jaimergp mentioned this pull request Aug 23, 2023

Design and implement a database for the conda-forge graph and relevant metadata Quansight-Labs/czi-conda-forge-mgmt#5

Closed

25 tasks

jaimergp merged commit ac70a4c into main Aug 25, 2023
2 checks passed

jaimergp deleted the update-data-source-harversting branch August 25, 2023 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Artifact's tables #5

Integrate Artifact's tables #5

viniciusdc commented Jul 7, 2023 •

edited

Loading

jaimergp Aug 16, 2023

jaimergp Aug 16, 2023

jaimergp Aug 16, 2023

jaimergp Aug 16, 2023 •

edited

Loading

jaimergp Aug 16, 2023

jaimergp Aug 16, 2023

jaimergp left a comment

		@@ -0,0 +1,134 @@
		### Code extracted from https://github.com/regro/libcflib/blob/master/libcflib/harvester.py

Integrate Artifact's tables #5

Integrate Artifact's tables #5

Conversation

viniciusdc commented Jul 7, 2023 • edited Loading

This PR adds the following:

What's missing:

jaimergp Aug 16, 2023

Choose a reason for hiding this comment

jaimergp Aug 16, 2023

Choose a reason for hiding this comment

jaimergp Aug 16, 2023

Choose a reason for hiding this comment

jaimergp Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

jaimergp Aug 16, 2023

Choose a reason for hiding this comment

jaimergp Aug 16, 2023

Choose a reason for hiding this comment

jaimergp left a comment

Choose a reason for hiding this comment

viniciusdc commented Jul 7, 2023 •

edited

Loading

jaimergp Aug 16, 2023 •

edited

Loading