Clean input files arguement #494

jennysjaarda · 2021-06-03T13:10:22Z

jennysjaarda
Jun 3, 2021

I have a project with several input files and I was thinking it would be quite clean to have just one target for input files and then call to that target to read those input files as necessary. Right now I essentially have a target for each file and then antoher target for the data itself which reads quite messy.

For example, borrowing from the minimal_example.


### Here create a list of input files with names so that you can easily pull the appropriate file in downstream targets. 
### This function would be updated if new input files are needed for the project, of course normally there would be multiple different files not one file with different names. 
create_input_files_list <- function(){
  c(raw_data_file = "data/raw_data.csv", raw_data2_file = "data/raw_data.csv", raw_data3_file = "data/raw_data.csv")
}

### Target plan: 
list(
  tar_target(
    raw_data_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    raw_data2_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    input_files,
    create_input_files_list() #, format = "file" ## If format is set to "file" then the names are removed for some reason?
  ),
  # Check input files in this target with `format = "file"`
  tar_target(
    input_files_check,
    input_files,
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(input_files[["raw_data_file"]], col_types = cols())
  )

The problem with the above is that input_files_check does not feed into raw_data. Also if I later update the input files function with more files:

create_input_files_list <- function(){
  c(raw_data_file = "data/raw_data.csv", raw_data2_file = "data/raw_data.csv", raw_data3_file = "data/raw_data.csv", new_data = "data/new_data.txt")
}

Then of course raw_data becomes outdated. I assume once raw_data is re-run targets sees that nothing has change and its downstream targets are not run, but is even this computation expensive for largish files? Or is it just seeing that arguement has not changed so it doesn't even need to run the read/fread function (in that case, not expensive at all). Do you have any suggestions for this type of workflow? Ideally I want one nice clean target for all input files that I can easily refer to. But if it changes late in the project perhaps it's resource expensive to be changing it all the time. What do you think?

wlandau · 2021-06-04T14:22:03Z

wlandau
Jun 4, 2021
Maintainer

It depends on what kind of branching you need. If you want to dynamically branch over individual files, tarchetypes::tar_files_input() can help. Otherwise, you can combine the pattern from the minimal example with static branching, either with tar_map() or tar_eval() from tarchetypes.

0 replies

jennysjaarda · 2021-06-10T21:06:49Z

jennysjaarda
Jun 10, 2021
Author

Thanks for the tips. I've taken a look and I'm not sure what option, if any, would provide the best solution for what I'm looking for. I don't really want to branch over the files individually, at least not indefinitely in the pipeline. Rather I just want to have a clean argument of input files because at the beginning of the pipeline I consistently follow this pattern of defining the file and then reading it in, and then apply different functions to different data sets before eventually combining the data sets. I think the suggestions you put forward are to help with applying the identical workflow to different files? Perhaps the network visualization below helps.

1 reply

wlandau Jun 11, 2021
Maintainer

I still feel like I am missing something, so please correct me if I am off base. It seems like you currently have a workflow with one target per file and you process the file with something like this:

list(
  tar_target(path, "file-path.csv", format = "file"),
  tar_target(data, read.csv(path)),
  tar_target(analysis, run_analysis(data))
)

I agree that's a lot of code if you have a lot of files. If you were to apply the identical workflow to each file, tar_files_input() might help, but it sounds like you process different files differently. In that case, you could use static branching to apply different functions to different datasets.

library(targets)
  
tar_script({
  library(tarchetypes)
  tar_map(
    values = list(
      files = c("file1.csv", "file2.xlsx"),
      fun = rlang::syms(c("process_file1", "process_file2"))
    ),
    names = files,
    tar_target(path, files, format = "file"),
    tar_target(output, fun(path))
  )
})

tar_manifest()
#> # A tibble: 4 x 3
#>   name              command                          pattern
#>   <chr>             <chr>                            <chr>  
#> 1 path_file1.csv    "\"file1.csv\""                  <NA>   
#> 2 path_file2.xlsx   "\"file2.xlsx\""                 <NA>   
#> 3 output_file1.csv  "process_file1(path_file1.csv)"  <NA>   
#> 4 output_file2.xlsx "process_file2(path_file2.xlsx)" <NA>

^{Created on 2021-06-10 by the reprex package (v2.0.0)}

I recommend this kind of pattern if process_file1() and/or process_file2() are long computations. If you were to put all the input files in a single target and not branch over them, then all targets downstream of any file would invalidate if any file changed, and that often adds up to a lot of unnecessary runtime.

jennysjaarda · 2021-06-11T07:26:53Z

jennysjaarda
Jun 11, 2021
Author

Thanks a lot, and sorry for the unclear example and explanation! Yes you are correct, I have many input files at the start of my project, I do some processing and then eventually end up with some clean data where I really take advantage of the dynamic branching functionality in drake.

The beginning targets are simply:


list(

  ##################################
  ### READ IN INPUTS ####
  ##################################
  tar_target(
    household_info_file,
    paste0(UKBB_dir,"/pheno/ukb6881.csv"),
    format = "file"
  ),
  tar_target(
    household_info,
    read.table(household_info_file, header=T),
  ),
  tar_target(
    relatives_file,
    paste0(UKBB_dir,"/geno/","ukb1638_rel_s488366.dat", header=T),
    format = "file"
  ),
  tar_target(
    relatives,
    read.table(relatives_file),
  ),
 tar_target(input3_file, "file3-path.csv", format = "file"),
  tar_target(input3, read.csv(input3_file)),

 tar_target(input4_file, "file3-path.csv", format = "file"),
  tar_target(input4, read.csv(path)),

#... do this with about 10 input files

  ##################################
  ### PROCESS INPUTS ####
  ##################################

 tar_target(process1, process1(household_info, input3)),
 tar_target(process2, process2(relatives)),
 tar_target(process3, process3(process1, input4)),
 tar_target(final_process, process4(process2, process3)) #now all inputs have been used and I use this final_process data in typical 
 targets fashion

)

I guess I was just wondering if there was a cleaner way to read in a whole bunch of input files because it seemed like a lot of very similar code and one of the benefits of targets is to reduce repetitive code and make things more efficient. The problem is I want to be able to easily access the data by some easily identifiable name in the processing scripts.

So in your example, I could do something like this:

list(
  tar_map(
    values = list(
      files = c("file1.csv", "file2.xlsx"), 
      fun = rlang::syms(c("process_file1", "process_file2"))
    ),
    names = files,
    tar_target(path, files, format = "file"),
    tar_target(data, read.csv(path)),
    tar_target(output, fun(path))

  ),
  tar_target(output_joint, processx(data_file2.xlsx , data_file1.csv))

)

tar_visnetwork() # as expected

But I was wondering if there was functionality for files to have names so that the targets can be easily accessed? for example:

list(
  tar_map(
    values = list(
      files = c(file1 = "file1.csv", file2 = "file2.xlsx"),  # add names to files argument (or define above target pipeline; eg. file1 <- "file1.csv"; file2 <- "file2.csv")
      fun = rlang::syms(c("process_file1", "process_file2"))
    ),
    names = files,
    tar_target(path, files, format = "file"),
    tar_target(data, read.csv(path)),
    tar_target(output, fun(path))

  ),
  tar_target(output_joint, processx(data_file2 , data_file1)) ## then for long path names we can easily pull the data needed for downstream targets. 

tar_visnetwork() # now disjointed
)

Hopefully that makes more sense.

0 replies

jennysjaarda · 2021-06-11T08:08:46Z

jennysjaarda
Jun 11, 2021
Author

Here's a better example with a potential solution using the get command. Let me know what you think.


library(targets)
library(tarchetypes)

file1 = "data/file1.csv"
file2 = "data/file2.csv"

name <- c("Jon", "Bill", "Maria")
age <- c(23, 41, 32)

df <- data.frame(name, age)

write.csv(df, file1)
write.csv(df, file2)

process_file1 <- function(df){
  df$age_sq <- df$age^2
  return(df)
}

process_file2 <- function(df){
  df$age_dbl <- df$age*2
  return(df)
}

list(
  tar_map(
    values = list(
      files = c("file1", "file2"),
      fun = rlang::syms(c("process_file1", "process_file2"))
    ),
    names = files,

    tar_target(get_files, get(files)),
    tar_target(path, get_files, format = "file"),

    tar_target(data, read.csv(path)),
    tar_target(output, fun(data))

  ),
  tar_target(output_joint, merge(output_file1, output_file2))

)

1 reply

wlandau Jun 11, 2021
Maintainer

Re #494 (comment), is the purpose of the get_files target just to identify all the file paths? If so, I would recommend listing the files as a global object in _targets.R. Path identification is really cheap and needs to rerun every time in order to have the latest set of targets. In addition, you can use tarchetypes::tar_combine() to automatically create output_joint to merge all the outputs.

The problem is I want to be able to easily access the data by some easily identifiable name in the processing scripts.

So then the issue is friendly target names? You can create custom names with a custom element of the values list.

library(targets)
tar_script({
  library(targets)
  library(tarchetypes)
  upstream <- tar_map(
    values = list(
      custom_names = c("file1", "file2"),
      files = c("file1.csv", "file2.xlsx"),
      fun = rlang::syms(c("process_file1", "process_file2"))
    ),
    names = custom_names,
    unlist = FALSE,
    tar_target(path, files, format = "file"),
    tar_target(data, read.csv(path)),
    tar_target(output, fun(data))
  )
  list(
    upstream,
    tar_combine(
      name = output_joint,
      command = process_output(!!!.x),
      upstream$output
    )
  )
})

tar_visnetwork(exclude = "upstream")

^{Created on 2021-06-11 by the reprex package (v2.0.0)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean input files arguement #494

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Clean input files arguement #494

jennysjaarda Jun 3, 2021

Replies: 4 comments · 2 replies

wlandau Jun 4, 2021 Maintainer

jennysjaarda Jun 10, 2021 Author

wlandau Jun 11, 2021 Maintainer

jennysjaarda Jun 11, 2021 Author

jennysjaarda Jun 11, 2021 Author

wlandau Jun 11, 2021 Maintainer

jennysjaarda
Jun 3, 2021

Replies: 4 comments 2 replies

wlandau
Jun 4, 2021
Maintainer

jennysjaarda
Jun 10, 2021
Author

wlandau Jun 11, 2021
Maintainer

jennysjaarda
Jun 11, 2021
Author

jennysjaarda
Jun 11, 2021
Author

wlandau Jun 11, 2021
Maintainer