Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yadage: allow passing initdir configuration #309

Merged

Conversation

mvidalgarcia
Copy link
Member

@mvidalgarcia mvidalgarcia commented Mar 23, 2020

closes reanahub/reana-client#396

reana.yaml could be like:

version: 0.6.0
[...]
workflow:
  type: yadage
  file: workflow.yaml
  options: 
    initdir: ./myinitdir
[...]

@@ -186,6 +187,8 @@ def _workflow_engine_command(self, overwrite_input_parameters=None,
self.workflow.get_specification()).encode()),
workflow_file=self.workflow.reana_specification.get(
'workflow').get('file'),
workflow_initdir=self.workflow.reana_specification.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In R-W-E-CWL we are passing --operational-options. Could we do the same here? The number of customisable options users might ask for will likely grow, and if that happens we would always need to come to this file to update it with glue code for the new option. On the other hand, if we send everything, options are added in reana.yaml by the user, and only changes in the specific workflow engine are needed to parse them. Let me know what you think :)

P.S. Similar pattern to the one described in reanahub/reana#277

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While elaborating this answer and re-reading it I have one question. Is the new options field in reana.yaml workflow engine specific? I think the answer is yes. Then I would rename along the lines of:

  • engine_options: for workflow engine specific options
  • options: (or better name for a "general options" field) which would have common options across engines, ideally one day CACHE=off|on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree about your first comment, I'll look into it.
WRT the second comment, I'm not sure, for now, it is true that is engine specific but it could have more general options as the CACHE you mentioned. We could parse the options and grab specific keywords to handle them properly and send the rest to the workflow engine. But we could have different fields for this as well...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking how the operational parameters are passed, I see that they shall be passed as part of inputs.options in reana.yaml (BTW this is missing in the docs I think). Should we include here the options from workflow.options, merge both in one, ...? This goes in line with Diego's second comment discussion.
We can discuss this in the next standup cc/ @reanahub/developers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just illustrate what we are discussing to make it easier for me to follow:

  1. Right now what works (not saying that is how it should be):

    • Including options in reana.yaml (this was undocumented and missing from the reana.yaml JSONSchema as pointed out by @mvidalgarcia)
    version: 0.3.0
    inputs:
      files:
        - code/helloworld.py
        - data/names.txt
      parameters:
        helloworld: code/helloworld.py
        inputfile: data/names.txt
        outputfile: results/greetings.txt
        sleeptime: 0
    + options:
    +   CACHE: off
    +   TARGET: gendata
    workflow:
      type: serial
      specification:
        steps:
          - environment: 'python:2.7-slim'
            commands:
              - python "${helloworld}"
                  --inputfile "${inputfile}"
                  --outputfile "${outputfile}"
                  --sleeptime ${sleeptime}
    outputs:
      files:
      - results/greetings.txt
    • Including options when invoking the CLI (documented in the REANA CLI reference)
    $ reana-client start -o TARGET=gendata -o CACHE=off
  2. Now we have the following with global: allow operational options in reana.yaml reana-client#400, because we didn't remember we were having the options field in reana.yaml (was missing from the JSONSChema)

    • On the reana.yaml
    version: 0.3.0
    inputs:
      files:
        - code/helloworld.py
        - data/names.txt
      parameters:
        helloworld: code/helloworld.py
        inputfile: data/names.txt
        outputfile: results/greetings.txt
        sleeptime: 0
    workflow:
      type: serial
    + options:
    +   CACHE: off
    +   TARGET: gendata
      specification:
        steps:
          - environment: 'python:2.7-slim'
            commands:
              - python "${helloworld}"
                  --inputfile "${inputfile}"
                  --outputfile "${outputfile}"
                  --sleeptime ${sleeptime}
    outputs:
      files:
      - results/greetings.txt

The question is whether the options belong to:

  • (1) the inputs of a workflow (inputs)
  • (2) configuration for the workflow (workflow.options).

(2) Seems more natural to me since inputs is a word loaded with a greater meaning (remember the four questions) for the analysis and seems that everything we add to the workflow options is not directly affecting the science behind it. In the future I don't see someone seraching for inputs: CACHE off, because probably this was just a transient parameter passed to avoid a caching issue and make the build pass. However a search inputs: events 20000 would have direct impact in the search results from the physics point of view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. We have to distinguish operational options being in reana.yaml and operational options passed via CLI.

Consider the following pseudo example:

inputs:
   parameters:
     myparam: myparamvalue
   options:
     myoption: myoptionvalue

and users running the workflow as:

$ reana-client run -p myparam=42.1 -p mynewparam=42.2 -o myoption=42.3 -o mynewoption=42.4

That is we have to be ready to manage (1) params as they are coming from reana.yaml, (2) params as they are coming from CLI, (3) options as they are coming from reana.yaml (once you will introduce this feature! we don't have it yet) and (4) options as they are coming from CLI.

Some complexities: same variables may be present only in reana.yaml some only in CLI, some in both. We have to handle both their existence and which value overrides which value. (E.g. the variable values from no.3 and no.4, if they are present, will override variable values from no.1 and no.2.)

Ideally, we should store all of them separately in REANA DB identically to what user has passed. I.e. the expansion should not really happen on the reana-client side, and the "source-of-truth" storage should be what the user has passed. Only after that we can think of merging options into one (i.e. overriding ones in reana.yaml by ones passed via CLI).

2. In addition, we have a complication that some options are generic, and some options are workflow-engine specific. As @diegodelemos mentions, we could store them in two different places; however since we already define workflow.type property, we can also for simplicity use only one options property, and the client will validate and will reply?

Example:

inputs:
   parameters:
     myparam: myparamvalue
   options:
     myoption: myoptionvalue
     toplevel: foobar
workflow:
   type: cwl
   file: myworkflow.cwl

would reply *"Sorry, you have specified both toplevel option and cwl workflow engine, which does not support it."

It'll be easier for users to specify it in one place... We should take the trouble of properly validating which feature is supported by which engine. (See the old musings about feature matrix for each engine; we have started to have it in docs, we should also make it programmer-friendly e.g. in r-commons so that r-client could take advantage of that knowledge.)

3. If this sounds overwhelming, we could simplify usage scenario by not allowing to specify -o and -p in CLI, reducing the number of cases mentioned above from four to two. I'm not sure it would be a good idea to always ask people to edit reana.yaml for even simplest of "quick runs", though.

Another alternative could be to allow only -o and -p and "generate" reana.yaml based on that.

Currently we are a bit in the mixture of both approaches, hence the additional complexity... but it should not be that hard to manage either, so I guess with a few rules about what to store as the source-of-truth and about what-overrides-what, it can work nicely!

P.S. See also some past musings on this "science parameters" + "global REANA options" +"workflow-specific options" topic e.g. here:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that everything we add to the workflow options is not directly affecting the science behind it.

This is true for operational options like CACHE=off, but for workflow engine options like toplevel the science would not even run properly... So these latter ones are more like inputs actually.

@lukasheinrich
Copy link
Member

lukasheinrich commented Mar 25, 2020

yadage can read in an overall coniguration file instead of command line parameters

e.g.

$> cat yadage.yml
dataarg: workdir
initdata:
  nevents: 100
workflow: madgraph_delphes.yml
toplevel: from-github/phenochain
visualize: true
verbosity: INFO

which can be executed with

yadage-run -f yadage.yml

so probably one could pass in the same format, and R-W-E-Y, would only set e.g. workdir

@mvidalgarcia mvidalgarcia force-pushed the rc/396-initdir-configurable branch from 0f86284 to 8244b5d Compare April 3, 2020 14:42
@mvidalgarcia mvidalgarcia force-pushed the rc/396-initdir-configurable branch from 8244b5d to b08d617 Compare April 3, 2020 16:58
@diegodelemos diegodelemos merged commit b08d617 into reanahub:master Apr 7, 2020
@mvidalgarcia mvidalgarcia deleted the rc/396-initdir-configurable branch April 7, 2020 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

yadage: make initdir configurable
4 participants