Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Templating within a remakefile #2

Open
richfitz opened this issue Sep 26, 2014 · 9 comments
Open

Templating within a remakefile #2

richfitz opened this issue Sep 26, 2014 · 9 comments

Comments

@richfitz
Copy link
Owner

This is something suggested by @cboettig, and which I immediately ran into on dfalster/tree-p#9 (private repo currently). We have a makerfile that includes targets like:

  output/leaf_traits.csv:
    depends: leaf_traits
    rule: export_csv
    target_argument_name: filename

  output/growth_mortality_traits.csv:
    depends: growth_mortality_traits
    rule: export_csv
    target_argument_name: filename

It might be nice to be able to remove the duplication in a couple of places.

Most simply, the output directory (ideally there won't be that many file targets in a maker workflow, but repetition is bad form). The simplest option I can think of there is to use whisker, so that we'd have:

  {{output_dir}}/leaf_traits.csv:
    depends: leaf_traits
    rule: export_csv
    target_argument_name: filename

  {{output_dir}}/growth_mortality_traits.csv:
    depends: growth_mortality_traits
    rule: export_csv
    target_argument_name: filename

and then another section in the makerfile:

variables:
  output_dir: output

The only real sticking point here is that will fail miserably if a whisker variable is missing because the mustache spec says by default that missing variables result in the empty string. This issue suggests that throwing errors might be a possibility.

A more complicated form of templating (which is then more prone to odd corner cases) would be to define whole template rules. So we'd have:

output/{{filename}}.csv:
    depends: {{object}}
    rule: export_csv
    target_argument_name: filename

somehow fill that in for the two cases above.

Of course, it should be fairly easy for users to manually template their own files prior to running maker, so the simplest solution might be to trial some forms of templating outside the package and incorporate what works.

@richfitz richfitz changed the title Templating within a makerfile Templating within a remakefile Feb 6, 2015
@krlmlr
Copy link
Collaborator

krlmlr commented Nov 25, 2016

The make syntax would be:

  output/%.csv:
    depends: %
    rule: export_csv
    target_argument_name: filename

Would that be an option?

@richfitz
Copy link
Owner Author

That's really nice, yeah. And I like the connection to the make rules.

The use case that triggered this was slightly more complicated though (and done manually). The repo is at dfalster/baad:

Not very pretty but it worked.

Getting the wildcard bit (#70) could work for file based patterns, but for looping over sets of objects it could all get a bit nastier.

@krlmlr
Copy link
Collaborator

krlmlr commented Nov 26, 2016

At this point we might want to consider supporting a DSL in addition to the .yml format. It could be as simple as:

packages:
- magrittr
- tibble
- dplyr
- ggplot2

sources:
- src/

targets:
  target:
    command: command_to_create_target(dep1, dep2)
remake() %>%
  add_library(c("magrittr", "tibble", "dplyr", "ggplot2)) %>%
  add_sources("src/") %>%
  add_target("target", command = ~command_to_create_target(dep1, dep2))

If we give the user a bit more flexibility when specifying the rules, we don't need to add all the logic ourselves. (Thanks @hadley for suggesting a DSL here.)

@hadley
Copy link
Contributor

hadley commented Nov 26, 2016

I think it would be better to make remake() implicit (i.e. make it the last argument and default to remake()), i.e.

remake_needs_package("magrittr", "tibble", "dplyr", "ggplot2")
remake_uses_source("src/")

target <- remake_target(~ command_to_create_target(dep1, dep2))

(I'm speculating wildly on the rest of the DSL. I'm happy to give more feedback)

@krlmlr
Copy link
Collaborator

krlmlr commented Nov 26, 2016

Thanks. It seems that using the assignment operator will make it difficult to create similar rules programmatically. How about:

remake_needs_package("magrittr", "tibble", "dplyr", "ggplot2")
remake_uses_source("src/")

remake_target(~ target, ~ command_to_create_target(dep1, dep2))
remake_target("file/target.txt", ~ command_to_create_file_target(dep3, dep4))

@wlandau
Copy link

wlandau commented Nov 26, 2016

How did I not see this issue before? I wrote a whole package to deal with it!
remakeGenerator does templating inside R to create remake.yml and a Makefile. It works with data frames of remake commands and writes remake.yml (plus an overarching Makefile via parallelRemake as in #84).

# install_github("wlandau/remakeGenerator")
> library(remakeGenerator)

> df = commands(data = simulate(center = MU, scale = SIGMA))
> df
  target                              command
1   data simulate(center = MU, scale = SIGMA)

Add multiple reps:

> df = expand(df, values = c("rep1", "rep2"))
> df
     target                              command
1 data_rep1 simulate(center = MU, scale = SIGMA)
2 data_rep2 simulate(center = MU, scale = SIGMA)

Evaluate wildcard patterns:

> evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
     target                           command
1 data_rep1 simulate(center = 1, scale = 0.1)
2 data_rep2   simulate(center = 2, scale = 1)

Expand over wildcard patterns:

> evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = TRUE)
           target                           command
1 data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
2   data_rep1_1_1   simulate(center = 1, scale = 1)
3 data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
4   data_rep1_2_1   simulate(center = 2, scale = 1)
5 data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
6   data_rep2_1_1   simulate(center = 1, scale = 1)
7 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
8   data_rep2_2_1   simulate(center = 2, scale = 1)

Write remake.yml and the Makefile:

targ = targets(stage1 = df, some_other_stage = similar_data_frame)
workflow(targets = targ, sources = my_sources, packages  = my_packages, ...)

Functions analyses() and summaries() work this way internally. Also, you can add fields like plot and knitr to data frames of commands, and they will appear in remake.yml and the Makefile. The linked vignette and example_remakeGenerator() have more.

@krlmlr
Copy link
Collaborator

krlmlr commented Nov 27, 2016

Thanks @wlandau, will try! I still think we need to be able to specify rules without the roundtrip of a .yml file.

@richfitz
Copy link
Owner Author

Thanks for the thoughts all; I have been thinking on them a while. There are a number of things here that might be better to be broken up into a series of separate issues. This is going to be a bit of a wall of text, I'm afraid.

templating vs DSL

I think these things are actually fairly orthogonal, and while a DSL might help remove some needs for templating it won't do all. I think it could work for the case that I linked above where files "appear" in a set of directories independently of remake. But if there is a target that generates a set of files, then some concept of wildcards is needed in the underlying machinery. Whatever the solution is definitely should not involve a yaml roundtrip and that should really not be needed

A DSL

I think that the idea of adding a DSL is interesting and worth pursuing.

@hadley emailed me about remake almost 2 years ago making suggestions along similar lines 😀, and my position on this is largely unchanged:

  • There is nothing in remake that fundamentally needs yaml; it's just a convenient vehicle to get a set of nested data into R. I can rework things to decouple things further and then whatever interface is useful to generate the structure can be used (be it the generative approaches that @wlandau has tried, a pipe-and-shiny approach that Hadley mooted a couple of years ago, or whatever).
  • The package is already too big (I've already pulled storr out of it, and I'm working on factoring out some other bits at the moment) - I think it would probably be nice to allow things like the DSL to be built on top of the underlying engine. My focus with the package (and really my interest in this area) is in getting the underlying machinery robust. Interface design is something that others are probably better at than I am and giving people freer rein there would be an advantage to everyone I suspect
  • As an old-timer who is not the biggest fan in the universe of the pipe operator I would love it if use of the pipe be optional
  • Whatever happens with the DSL, I think we need to be very careful not to create something that works with mtcars and iris but not with real research problems. remake was created because we hit the wall on reproducibilty trying do the right thing with knitr and caching (the blog post I wrote for ropensci was my aha moment). I honestly believe that like mustache there's a big advantage in a logic-free approach to this problem - I'm well aware that not everyone believes me though. In my previous job we used remake to scale projects that ran to many days of CPU time though, so I feel that it is at least a sufficient (if not necessary) approach. If things can be made modular though, it really won't matter - especially if it's possible to translate between one approach and the other. My concern is that if the DSL looks R-ish pretty soon people will want loops, conditionals, etc and the whole thing will balloon out of control and you'll end up with something awkward to use. OTOH, the success of dplyr and ggplot2 show that a well designed DSL can be very powerful, so who knows.

My previous approach to the DSL looked like

m <- maker({
  library(testthat)
  source("code.R")

  file("data.csv", cleanup_level="purge") <- download_data(target_name)
  processsed <- process_data("data.csv")
  plot("plot.pdf", width=8, height=4) <- myplot(processed)
})

see here -- I think this followed directly from thinking about Hadley's email.

In the current sources (though I think it's deleted in the refactor branch and I need to work out where it stands at the moment), there's an alternative bit of experimentation that looks like:

  m <- remake()
  m$add <- "package:testthat"
  m$add <- "code.R"
  m$add <- target("data.csv", download_data(target_name),
                  cleanup_level="purge")
  m$add <- target("processed", process_data("data.csv"))
  m$add <- target("plot.pdf", myplot(processed),
                  plot=list(width=8, height=4))
  m$make("plot.pdf")

This is probably not that far from what @krlmlr is imagining above, though differing in implementation. Making the remake object part implicit would be easy and there's already a cache of remake objects.

The trick from memory there was building the object up and then at the last minute before anything uses the remake object to build something you have to do the validation.

Done right, that approach (sequentially adding things to a remake object, validating, running) could be used by the yaml interface at which point things are properly decoupled

@hadley
Copy link
Contributor

hadley commented Nov 30, 2016

I think in the two years my feelings for an internal DSL (i.e. something pipe-y) vs. an external DSL (e.g. YAML) have grown stronger. remake is fundamentally about running code, which IMO, means that you should be in R scripts as much as possible (and then templates just become functions and for loops etc). I think yaml is better for "string-y" type operations (like pkgdown) with straightforward hierarchies (not graphs like remake models).

The advantage of a pipe based DSL is that the pipe is optional - you can use it with your preferred style of function invocation. It sounds like you're arguing more for a non-functional (i.e. mutable object) approach. I think that's generally sub-optimal because it's different to the majority of R code that most R users will see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants