-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Artifact's tables #5
Conversation
small test Fix unitests Fix unitests test default db location Fix unitests Add custom path location for db Update path to db for build workflow use URL class for handling engine connection use URL class for handling engine connection attempt direct import for URL fix condition for workflow run when missing artifact test artifact download and decompression update download/upload artifacts actions Update path to db for build workflow
9fb653d
to
cba8c1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified logging to cfdb.log
module. All we need is an initialization function that will be called from the CLI layer (not the library, which is only concerned with emitting messages).
# Create a file handler in non-interactive sessions | ||
# Set the formatter for the file handler | ||
if logging_dir is None: | ||
logging_dir = os.environ.get('CFDB_LOGGING_DIR', Path.cwd() / '.logs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default value allows us to mock the default from conftest.py
.
logger = logging.getLogger('cfdb') | ||
logger.setLevel(logging.DEBUG) | ||
|
||
if logger.handlers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return early if our logger already has handlers. That means someone else configured stuff and we shouldn't mess.
@@ -0,0 +1,134 @@ | |||
### Code extracted from https://github.com/regro/libcflib/blob/master/libcflib/harvester.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When vendoring (and/or modifying) code, we need to bring the license file too, along with the copyright. I'd prefer we import and wrap / patch as needed, if possible, though. Otherwise it's one more piece of code we are maintaining.
@@ -119,17 +128,20 @@ def traverse_files(path: Path, output_dir: Path = None) -> List[Path]: | |||
|
|||
stored_files = [] | |||
|
|||
with concurrent.futures.ThreadPoolExecutor() as executor: | |||
with ThreadPoolExecutor() as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threads are tricky to debug. Invested a few hours hunting down an error because a package named something.json
was being globbed as a file, but it was a directory, which broke the population code. Always provide an easy to debug alternative in this cases.
@@ -93,7 +37,7 @@ def update_feedstock_outputs( | |||
To update the feedstock outputs, use the following command: | |||
$ cfdb update_feedstock_outputs --path /path/to/feedstock-outputs/outputs | |||
""" | |||
db_handler = CFDBHandler("sqlite:///cf-database.db") | |||
db_handler = CFDBHandler() | |||
db_handler.update_feedstock_outputs(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update-feedstock-outputs
was able to ingest libcfgraph/artifacts
, populating tables for hours, without a complain... but it shouldn't. Maybe we need some sanitization / checking of the inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Vini. I changed a few things and left a couple comments, but this is good enough to go for now. We can prototype APIs with this setup, for example.
A few general notes for the future:
- Performance is not great for artifacts. It'll take a few hours (~40h) to populate the database from scratch. There might be something inherently slow in the population approach (e.g. the JSON payloads, the CSV records, etc). Hashing is slow at this scale. I guess we should have a "no checks" mode where it just dumps everything blindly and assumes no duplicates. Or we keep a "seen" set of hashes already processed.
- What's the story for "once reasonably up-to-date, how do I keep the database up-to-date?" We need better docs for that.
- I know I am saying the opposite of what I once said, but we are going to need to add
conda-forge/
to the artifact identifiers, or come up with a convention to signify they come from conda-forge (and not, let's say, bioconda). - It would be useful have an "initialize database from scratch using a copy of libcfgraph and feedstock-outputs", but right now it involves several commands, and it's not clear in which order they should run or if it matters. That's maybe just a documentation issue.
- It would be interesting to see a harvester of artifacts that feeds on the OCI mirror metadata (via
conda-forge-metadata
?) instead of the (possibly heavy) Anaconda.org packages.
Let me know what you think. I'll probably merge now and see what happens with the self-propagating artifact. We will need issues to cover the pending points above too.
This PR adds the following:
Artifacts
,ArtifactsFilePaths
, and RelationsMapFilePaths schemas;libcfgraph
;What's missing:
from scoped milestone:
libcfgraph
;sorted_artifacts
list limit of1000
entries in the multiprocess block: https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/cfdb/harvest/core.py#L125-L127import_to_package_maps
still depend onlibcfgraph
this needs to be modified to use the stored files in the DB;Good to have though out of scope:
Important! Fix issue with artifact download, where DB file is not found during download:
Error: Unable to find any artifacts for the associated workflow
. This is extremely important as the downloaded DB is iteratively updated across the pipeline to be uploaded at the end, completing a cycle.https://github.com/viniciusdc/czi-conda-forge-db/blob/cba8c1a33c1848bbdd243d37d428935c8eaedb8f/.github/workflows/build_db.yaml#L55-L60If no solution seems available, we can just push the latest version of the DB to the repo instead of relying on artifacts for storage, though we will not comply with the file size limit policy on Github (compressed ~65Mb)