Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_fst Seems To Skip Small Tables When Writing In A for Loop #280

Open
drag05 opened this issue Dec 8, 2023 · 5 comments
Open

write_fst Seems To Skip Small Tables When Writing In A for Loop #280

drag05 opened this issue Dec 8, 2023 · 5 comments

Comments

@drag05
Copy link

drag05 commented Dec 8, 2023

I am using fst version 0.9.8 with R-4.3.2.

I am writing 'data.table' class data frames in a for loop. The tables have different number of rows as they are the result of in-silico chemical modification of a list of peptides (protein fragments).

When writing these data tables to csv format using data.table::fwrite (with append = TRUE), all modified peptides are written correctly and all are present.

When writing them to fst format with compress = 50 or compress = 100 , data tables with 10- 11 rows (resulted from peptides with 1-2 modifications) are skipped while the bigger ones are written as expected.

Unfortunately proprietary rights do not allow me to present a full example, just the code for write_fst with argument uniform_encoding set to default (for speed as there are millions of such tables to be written):

write_fst(dt1, paste0(fname, '-',  i, '.fst'), compress = 100)

Here, dt1 is a 'data.table' class data frame residing in memory. fname is a character length 1 , i stands for current iteration and the arguments of paste0 form the unique name of the fst file written to disk. The fst tables that are written have names formatted as expected.

The upstream code is the same, the only difference at this point is the file format writing decided by an if control selected by User: if (compress == yes) write "fst" else write "csv".
Thank you!

@AndyTwo2
Copy link
Contributor

I have personally never seen this behaviour, and am unable to replicate it based on the information given. A few questions whose answers might enable you to narrow down the issue without de-anonymising your data include:

  • If the compress argument is set to be 0, are the 10-11 row tables successfully written? (If so, this may be an issue with the compression fst does)
  • If you reduce the number of columns of dt1 you are trying to write, or limit to a certain number of column classes (e.g. just writing out the numeric columns via: dt1[, .SD, .SDcols=is.numeric]), are the 10-11 row tables successfully written? (If so, this may be an issue with the fst package being unable to write certain column types, or certain number of columns)
  • If you change the order of your loop so that all the larger ones are written first and then the 10-11 row tables written afterwards, are the 10-11 row tables successfully written? (If so, I'm not sure what could be causing the issue)

While there's no reproducible example, I doubt there's a lot more help that can be provided, but following the advice in this article may help you create one: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

@drag05
Copy link
Author

drag05 commented Dec 12, 2023

@AndyTwo2 My answers bullet-by-bullet:

  • Frankly, I haven't tried zero compression and the job is now running, expecting an over 2 billion-row table written in chunks as fst files. I have actually thought that compression might do something and tried the default and 100 values - no zero though. I am not sure I have mentioned before but the screen message confirms that small tables are being written to disk same as the long ones. However only the long ones are on disk. Could be a disk bus/buffer issue?

  • Any of these suggestions is not possible with current data. Each table contains character and numeric columns. Separating them is not possible as this is an intermediate process: somebody else is transferring them to big query. We should have sent them directly to big query but then would have been concerned with connection drop and other events - these are very long jobs.

  • This one I have had tried to no avail: same outcome.

When the big job is complete I will go back through all the archives containing fst files and check which peptides have been written and which have not. Then, reiterate on the left-out ones. Tables of comparable lengths found in the same loop are being written as it should.

Thanks for suggestions!

@drag05
Copy link
Author

drag05 commented Dec 13, 2023

@AndyTwo2 I should have probably mentioned that although different in number of rows, all tables contain the same columns - in reference to second bullet in your reply.
Also, I have had tried both options of uniform_encoding. Same result
Thank you!

@MarcusKlik
Copy link
Collaborator

Hi @drag05, thanks for reporting your issue!

I've done a small test using a table with 9 rows, but cannot reproduce your finding:

nr_of_rows <- 9
test_dir <- tempdir()

df <- data.frame(
  Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
  Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
  Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
  Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)

# write using various compression settings
df |>
  fst::write_fst(paste0(test_dir, "/compress_0.fst"),   compress = 0  ) |>
  fst::write_fst(paste0(test_dir, "/compress_1.fst"),   compress = 1  ) |>
  fst::write_fst(paste0(test_dir, "/compress_50.fst"),  compress = 50 ) |>
  fst::write_fst(paste0(test_dir, "/compress_75.fst"),  compress = 75 ) |>
  fst::write_fst(paste0(test_dir, "/compress_100.fst"), compress = 100)

# test roundtrip against source table  
fst::read_fst(paste0(test_dir, "/compress_0.fst"))   |> testthat::expect_equal(df)
fst::read_fst(paste0(test_dir, "/compress_1.fst"))   |> testthat::expect_equal(df)
fst::read_fst(paste0(test_dir, "/compress_50.fst"))  |> testthat::expect_equal(df)
fst::read_fst(paste0(test_dir, "/compress_75.fst"))  |> testthat::expect_equal(df)
fst::read_fst(paste0(test_dir, "/compress_100.fst")) |> testthat::expect_equal(df)

Is there a way you can adapt the example above to reflect the type of data you are using and reproduce the issue? thanx

@drag05
Copy link
Author

drag05 commented Sep 26, 2024

@MarcusKlik A "small test" does not replicate the error. A large dataset, from which table chunks of different nrow() compete for disk write to another location, is needed for this to be replicated.

As I have mentioned to @AndyTwo2 above, the source data that is being transferred to another location on disk has millions of rows and tens of thousands of table chunks of different nrow(). When competing for disk write, the small tables - although formed correctly and displayed as such - are left in cache while larger table chunks are actually written to target location on disk.

I am no longer sure this is a package issue. Could be a disk cache or transfer issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants