Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Artifact's tables #5

Merged
merged 25 commits into from
Aug 25, 2023
Merged

Conversation

viniciusdc
Copy link
Contributor

@viniciusdc viniciusdc commented Jul 7, 2023

This PR adds the following:

  • Datasset docker-compose for web-browser interface with the DB, which also allow API interfacing with the Database;
  • Artifacts, ArtifactsFilePaths, and RelationsMapFilePaths schemas;
  • artifact populate function responsible for mapping the corresponding artifact data into the DB and their respective associated files;
  • unit tests coverage;
  • download and upload the compressed database as CI/CD artifact;
  • New harvesting module to generate the artifact information based on anaconda mirror, decoupling from libcfgraph;

What's missing:

from scoped milestone:

Good to have though out of scope:

Important! Fix issue with artifact download, where DB file is not found during download: Error: Unable to find any artifacts for the associated workflow. This is extremely important as the downloaded DB is iteratively updated across the pipeline to be uploaded at the end, completing a cycle.https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/.github/workflows/build_db.yaml#L55-L60
If no solution seems available, we can just push the latest version of the DB to the repo instead of relying on artifacts for storage, though we will not comply with the file size limit policy on Github (compressed ~65Mb)

@viniciusdc viniciusdc changed the title WIP - Integrate Artifact's tables Integrate Artifact's tables Jul 28, 2023
small test

Fix unitests

Fix unitests

test default db location

Fix unitests

Add custom path location for db

Update path to db for build workflow

use URL class for handling engine connection

use URL class for handling engine connection

attempt direct import for URL

fix condition for workflow run when missing artifact

test artifact download and decompression

update download/upload artifacts actions

Update path to db for build workflow
@viniciusdc viniciusdc force-pushed the update-data-source-harversting branch from 9fb653d to cba8c1a Compare July 29, 2023 02:16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified logging to cfdb.log module. All we need is an initialization function that will be called from the CLI layer (not the library, which is only concerned with emitting messages).

# Create a file handler in non-interactive sessions
# Set the formatter for the file handler
if logging_dir is None:
logging_dir = os.environ.get('CFDB_LOGGING_DIR', Path.cwd() / '.logs')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default value allows us to mock the default from conftest.py.

logger = logging.getLogger('cfdb')
logger.setLevel(logging.DEBUG)

if logger.handlers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return early if our logger already has handlers. That means someone else configured stuff and we shouldn't mess.

@@ -0,0 +1,134 @@
### Code extracted from https://github.com/regro/libcflib/blob/master/libcflib/harvester.py
Copy link
Contributor

@jaimergp jaimergp Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When vendoring (and/or modifying) code, we need to bring the license file too, along with the copyright. I'd prefer we import and wrap / patch as needed, if possible, though. Otherwise it's one more piece of code we are maintaining.

@@ -119,17 +128,20 @@ def traverse_files(path: Path, output_dir: Path = None) -> List[Path]:

stored_files = []

with concurrent.futures.ThreadPoolExecutor() as executor:
with ThreadPoolExecutor() as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threads are tricky to debug. Invested a few hours hunting down an error because a package named something.json was being globbed as a file, but it was a directory, which broke the population code. Always provide an easy to debug alternative in this cases.

@@ -93,7 +37,7 @@ def update_feedstock_outputs(
To update the feedstock outputs, use the following command:
$ cfdb update_feedstock_outputs --path /path/to/feedstock-outputs/outputs
"""
db_handler = CFDBHandler("sqlite:///cf-database.db")
db_handler = CFDBHandler()
db_handler.update_feedstock_outputs(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update-feedstock-outputs was able to ingest libcfgraph/artifacts, populating tables for hours, without a complain... but it shouldn't. Maybe we need some sanitization / checking of the inputs.

Copy link
Contributor

@jaimergp jaimergp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vini. I changed a few things and left a couple comments, but this is good enough to go for now. We can prototype APIs with this setup, for example.

A few general notes for the future:

  • Performance is not great for artifacts. It'll take a few hours (~40h) to populate the database from scratch. There might be something inherently slow in the population approach (e.g. the JSON payloads, the CSV records, etc). Hashing is slow at this scale. I guess we should have a "no checks" mode where it just dumps everything blindly and assumes no duplicates. Or we keep a "seen" set of hashes already processed.
  • What's the story for "once reasonably up-to-date, how do I keep the database up-to-date?" We need better docs for that.
  • I know I am saying the opposite of what I once said, but we are going to need to add conda-forge/ to the artifact identifiers, or come up with a convention to signify they come from conda-forge (and not, let's say, bioconda).
  • It would be useful have an "initialize database from scratch using a copy of libcfgraph and feedstock-outputs", but right now it involves several commands, and it's not clear in which order they should run or if it matters. That's maybe just a documentation issue.
  • It would be interesting to see a harvester of artifacts that feeds on the OCI mirror metadata (via conda-forge-metadata?) instead of the (possibly heavy) Anaconda.org packages.

Let me know what you think. I'll probably merge now and see what happens with the self-propagating artifact. We will need issues to cover the pending points above too.

@jaimergp jaimergp merged commit ac70a4c into main Aug 25, 2023
2 checks passed
@jaimergp jaimergp deleted the update-data-source-harversting branch August 25, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done all-time
Development

Successfully merging this pull request may close these issues.

2 participants