[help] How does format = "file" work for nested directories? #1257

rsangole · 2024-03-18T23:05:18Z

rsangole
Mar 18, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

Hello,

How does format = 'file' work when data is embedded within directories?

For ex: I have some data like so (partitioned parquet files created by arrow):

Would I define target like this, where I simply track the top directory (and target tracks changes for any files contained within said directory)?

list(
    tar_target(
        fake_data_loc, {
            out_path <- "some_location/FAKE_arrow10"
            create_data() |>
                group_by(group) |>
                arrow::write_dataset(out_path)
            
            out_path # <---- just the root directory
        },
        format = "file"
    ),
    tar_target(
        downstream,{
            fake_data_loc
            
            ...
            ...
        }
    ),
    ...
)

Or would I have to define a target like so, where the target is a vector of all the files within?

list(
    tar_target(
        fake_data_loc,{
            out_path <- "some_location/FAKE_arrow10"
            create_data() |>
                group_by(group) |>
                arrow::write_dataset(out_path)
            
            fs::dir_ls(out_path, recurse = TRUE, type = "file") # <---- a vector of all the files
        },
        format = "file"
    ),
    tar_target(
        downstream,{
            fake_data_loc
            
            ...
            ...
        }
    ),
    ...
)

Thanks!

Answered by rsangole

Mar 19, 2024

I suppose I answered my own question with an experiment...

library(targets)
list(
  tar_target(
    mtcars_out_1,{
      tibble::as_tibble(mtcars) |>
        dplyr::group_by(cyl) |>
        arrow::write_dataset("folder_out")
      here::here("folder_out")
    },
    format = "file"
  ),
  tar_target(
    mtcars_out_2,{
      tibble::as_tibble(mtcars) |>
        dplyr::group_by(cyl) |>
        arrow::write_dataset("file_out")
      fs::dir_ls("file_out", recurse = TRUE, type = "file")
    },
    format = "file"
  )
)

Both approaches seem to work right.

Is there one approach more performant than the other, esp for v-large datasets?

View full answer

rsangole · 2024-03-19T02:50:08Z

rsangole
Mar 19, 2024
Author

I suppose I answered my own question with an experiment...

library(targets)
list(
  tar_target(
    mtcars_out_1,{
      tibble::as_tibble(mtcars) |>
        dplyr::group_by(cyl) |>
        arrow::write_dataset("folder_out")
      here::here("folder_out")
    },
    format = "file"
  ),
  tar_target(
    mtcars_out_2,{
      tibble::as_tibble(mtcars) |>
        dplyr::group_by(cyl) |>
        arrow::write_dataset("file_out")
      fs::dir_ls("file_out", recurse = TRUE, type = "file")
    },
    format = "file"
  )
)

Both approaches seem to work right.

Is there one approach more performant than the other, esp for v-large datasets?

1 reply

wlandau Mar 20, 2024
Maintainer

It actually comes out to the same thing because internally targets uses list.files(recursive = TRUE) to find all the files in a directory. mtcars_out_1 may be ever so slightly faster because it stores a shorter character string in the metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] How does format = "file" work for nested directories? #1257

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[help] How does format = "file" work for nested directories? #1257

rsangole Mar 18, 2024

Help

Description

Replies: 1 comment · 1 reply

rsangole Mar 19, 2024 Author

wlandau Mar 20, 2024 Maintainer

rsangole
Mar 18, 2024

Replies: 1 comment 1 reply

rsangole
Mar 19, 2024
Author

wlandau Mar 20, 2024
Maintainer