Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design datalad remake-capture or remake-sink #13

Open
mih opened this issue May 2, 2024 · 0 comments
Open

Design datalad remake-capture or remake-sink #13

mih opened this issue May 2, 2024 · 0 comments

Comments

@mih
Copy link
Member

mih commented May 2, 2024

This is about the second half of #10 -- a datalad-based data sink or data capture helper. See #12 for the other half.

Purpose

Accept data (files) from some source/location, and inject them into a datalad dataset (as a new commit, to some branch, under some file names(s)), and optionally push the dataset modification to a remote or service that accepts a (serialized) dataset (update).

Target use cases

  • Point to a (local) repository clone, and capture any (subset) of modifications in it
  • Obtain a dataset from some datalad-compatible source, and populate it with content from some other location (e.g. workflow output) (under some given name(s)) (and push back to the remote)

Provenance capture

It would be useful to be able to ingest provenance information on the dataset modifications

  • where is the content coming from
  • what created it (e.g., workflow execution)

API

  • (1) dataset to add changes to (URL, identifier, similar to remake-provision
  • (2) branch to commit modifications to
  • (3) some mode switch to define the nature of the modification
    • incremental: add new, replace existing
    • explicit: the incoming content is the sole content for the new version, all absent previous content is removed
  • (4) options to declare where/how to deposit update at a remote

remake-capture is likely using remake-provision whenever it is not operating on an already existing local repository so (1) and (2) would need to be aligned between the commands.

(4) is included to make remake-provision and remake-capture be the only two datalad "nodes" to make arbitrary workflow system datalad compatible. Of course (4) could also be a dedicated execution of datalad push -- different trade-off -- subject to further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: discussion needed
Development

No branches or pull requests

1 participant