Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Permission error on Windows when deleting file previously accessed with open_dataset #31796

Open
asfimport opened this issue Apr 29, 2022 · 25 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Apr 29, 2022

On Windows this fails:

library(arrow)

write_dataset(iris, "test_dataset")

# Original example was with DuckDB, but that's not necessarily the issue
# con <- open_dataset("test_dataset") |> to_duckdb()
con <- open_dataset("test_dataset")$NewScan()$Finish()$ToRecordBatchReader()

file.remove("test_dataset/part-0.parquet")
#> Warning in file.remove("test_dataset/part-0.parquet"): cannot remove file
#> 'test_dataset/part-0.parquet', reason 'Permission denied'
#> [1] FALSE

But on MacOS it does not:

library(arrow)

write_dataset(iris, "test_dataset")

# Original example was with DuckDB, but that's not necessarily the issue
# con <- open_dataset("test_dataset") |> to_duckdb()
con <- open_dataset("test_dataset")$NewScan()$Finish()$ToRecordBatchReader()

file.remove("test_dataset/part-0.parquet")
#> [1] TRUE

Reporter: Will Jones / @wjones127

Related issues:

Note: This issue was originally created as ARROW-16421. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Possibly related

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Actually, if I call gc() prior to file.remove, this works fine.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Windows is notoriously stubborn about deleting files that have any kind of open handle so that makes sense. Another possibility that is often frustrating when deleting recently created files is that the file is picked up by a search indexer or an antivirus scanner of some sort. I suspect that is why MinIO/S3 tests sporadically fail on Windows as well.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
There doesn't seem to be a close_dataset() or dataset$close() (or maybe it's the Scanner object?). The deleters for the C++ objects behind the R6 objects usually (but not always) get run on gc(), so we may be relying on some implicit cleanup where we should be encouraging explicit closing (e.g., R warns when you forget to close() a connection to a file; DBI warns when you forget to close a database connection).

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
The scanner does need to close its files. It takes care of this itself as it finishes scanning. One issue we have today (which may be related) is that the scanner reports that it is finished (and the exec plan too if you are using one) before it has completely finished all close/cleanup tasks so there is some possibility that the close happens after the scanner is finished. That behavior will change in a month or two as part of ARROW-15732

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Yeah this was inspired by an issue I was told about in a private channel; but the above reproduction doesn't seem to match the issue they are seeing, so I'm waiting to see if they can make a reproducible example.

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
It seems like this issue isn't resolved by rm() and gc() if there is a large number of partitions:

library(arrow)
#> Warning: package 'arrow' was built under R version 4.1.3
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(readr)

# warning: 5 MB
test_data <- read_csv("https://cn.dataone.org/cn/v2/resolve/knb.92098.1")
#> Rows: 104497 Columns: 9
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (7): Y_CNAREA, CITY, A_RESID, local, Y_Rural, P_FSHY, P_TYPE
#> dbl (2): YEAR, Permit_Count
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

write_dataset(test_data, "test_data", partitioning = "CITY")

# Original example was with DuckDB, but that's not necessarily the issue
# con <- open_dataset("test_dataset") |> to_duckdb()
con <- open_dataset("test_data")$NewScan()$Finish()$ToRecordBatchReader()

files <- dir("test_data", full.names = TRUE, recursive = TRUE)

rm(con)
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 1052314 56.2    2227298  119  1220182 65.2
#> Vcells 2795421 21.4    8388608   64  3736313 28.6

lapply(files, file.remove)
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=CORNER BAY/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=COUNCIL/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=CRAIG/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=CROOKED CREEK/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=DEEP BAY/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=DEERING/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=DELTA JUNCTION/
#> part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file 'test_data/CITY=DENALI PARK/
#> part-0.parquet', reason 'Permission denied'
#> [[1]]
#> [1] TRUE
#> 
#> [[2]]
#> [1] TRUE
#> 
#> [[3]]
#> [1] TRUE
#> 
#> ...
#> 
#> [[343]]
#> [1] TRUE
#> 
#> [[344]]
#> [1] TRUE

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:

Windows is notoriously stubborn about deleting files that have any kind of open handle so that makes sense. Another possibility that is often frustrating when deleting recently created files is that the file is picked up by a search indexer or an antivirus scanner of some sort. I suspect that is why MinIO/S3 tests sporadically fail on Windows as well.
I tested the above with a 20 second sleep in between garbage collection and the deletion, and got the exact same result, so I don't think it's this.
The scanner does need to close its files. It takes care of this itself as it finishes scanning.
@westonpace Could you point me to the code where that happens? I'm having trouble finding it.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
It's rather spread out and not at all obvious. For example, if we are scanning parquet files, then we will call parquet::arrow::FileReader::GetRecordBatchGenerator in file_parquet.cc. This creates a generator which is an instance of parquet::arrow::RowGroupGenerator. When parquet::arrow::RowGroupGenerator is destroyed its members are destroyed. One of them is a shared_ptr<parquet::arrow::<unnamed>::FileReaderImpl>. This in turn has an owned reference to a parquet::ParquetFileReader and the destructor of parquet::ParquetFileReader closes the file.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Right now the destruction of the record batch generator likely happens on a CPU thread task and the exec plan is not (yet) configured to wait for that CPU thread task. Changing that is the heart of ARROW-15732.

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
@westonpace while working on ARROW-13763, it appears that closing files on C++ side doesn't occur in places one might initially expect. For example file_ipc.cc IpcFileFormat::Inspect could close the reader then return the schema, but doesn't.  Also in SerializedFile::Close (ParquetFileReader::Contents) doesn't close the file, but deals with decryption keys. It is possible to add a ~source_.get()->Close()~ immediately after however to close the RandomAccessFile.

Is my understanding correct? If so, it seems like it could be related to this issue, but suppose there are reasons for not doing such?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
@milesgranger reader should go out of scope when that method returns. I would expect the destructor of reader to close any open files.

(I'm specifically referring to IpcFileFormat::Inspect here)

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
So in general in C++ we rely on objects going out of scope to ensure files are closed (in contrast to explicitly calling Close() on the InputStream) ?

But given the reported issues on Windows, that might not work sufficiently in all cases?
( @milesgranger from testing on Linux, I think you couldn't observe any files being kept open after a pq.read_table or ds.dataset(..)?)

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
That's right, couldn't see that any files were left open when using those methods. Doing a loop and watching lsof I can spot the file being opened and closed immediately during ds.dataset(...).read(), and assigning to a variable also will not leave the file open from what I could see.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
(deleted the comment I just posted because that was a slightly erroneous take IMHO)

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:

So in general in C++ we rely on objects going out of scope to ensure files are closed (in contrast to explicitly calling Close() on the InputStream) ?

Yes. In C++ there is no "with" or "using" statement like there is in other languages so relying on scope is the safe alternative. Explicitly calling close usually leaves open the risk of missing a close on a failure path or especially if an exception is thrown.

But given the reported issues on Windows, that might not work sufficiently in all cases?

I'm not convinced this is the case. I'm pretty sure we are calling close on Windows too. How are we confirming the file is open?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
My belief is that, on Windows, closing a file handle does not guarantee that file can immediately be deleted. For example, consider this page: https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/irp-mj-cleanup

It is important to note that when all handles to a file object have been closed, this does not necessarily mean that the file object is no longer being used. System components, such as the Cache Manager and the Memory Manager, might hold outstanding references to the file object. These components can still read to or write from a file, even after an IRP_MJ_CLEANUP request is received.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
So if that's the case, this issue is basically a "can't fix" ?

And the workaround we can give to users: put some kind of try-except around the delete, and try again after some pause?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
We should first ensure we understand what exactly is happening here.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
File deletion is usually quite reliable on Windows as long as you don't have aggressive virus scanners etc. running in the background.

For example, our filesystem tests create, write and delete files and we don't seem to get spurious deletion errors on them.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
And at least it should be easy for a R developer to dump logs on C++ file opens and closes, and see what happens exactly here.

@asfimport
Copy link
Collaborator Author

Apache Arrow JIRA Bot:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

@sbashevkin
Copy link

Hello! This issue has reemerged with Arrow v11.0.0.3. It was previously fixed in an older arrow version but we are now encountering it again. Please let me know if you'd prefer I open a new issue.

The issue is a bit stranger now in that it seems to require multiple dplyr and base-R functions to be triggered. I played around with it a little and was only able to trigger the issue with a combination of a join, head, and collect call that you'll see in the reprex below. If I delete any of those function calls, the issue disappears. Similar to its earlier iteration, it only occurs with larger datasets and only on Windows. Lastly, this issue occurs with Arrow v11.0.0.3, but not with Arrow v10.0.1. Please find a reprex below.

library(arrow)
#> Warning: package 'arrow' was built under R version 4.2.3
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
test_data1 <- expand.grid(A=1:150, B=1:10, C=1:50)
test_data2 <- data.frame(A=1:100, D=1:100)
write_dataset(test_data1, "test_data1", partitioning = "A")
write_dataset(test_data2, "test_data2", partitioning = "A")
test<-open_dataset("test_data1")%>%
    dplyr::inner_join(open_dataset("test_data2")) %>%
         head()%>%
         collect()
rm(list=ls())
gc()
#>           used (Mb) gc trigger  (Mb) max used (Mb)
#> Ncells 1184245 63.3    2479418 132.5  1365350 73.0
#> Vcells 2047220 15.7    8388608  64.0  3304441 25.3
files1 <- dir("test_data1", full.names = TRUE, recursive = TRUE)
files2 <- dir("test_data2", full.names = TRUE, recursive = TRUE)
files1_leftover<-lapply(files1, file.remove)
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=96/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=97/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=98/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=99/part-0.parquet', reason 'Permission denied'
which(!unlist(files1_leftover))
#> [1] 147 148 149 150
files1[!unlist(files1_leftover)]
#> [1] "test_data1/A=96/part-0.parquet" "test_data1/A=97/part-0.parquet"
#> [3] "test_data1/A=98/part-0.parquet" "test_data1/A=99/part-0.parquet"
files2_leftover<-lapply(files2, file.remove)
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=96/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=97/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=98/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=99/part-0.parquet', reason 'Permission denied'
which(!unlist(files2_leftover))
#> [1]  97  98  99 100
files2[!unlist(files2_leftover)]
#> [1] "test_data2/A=96/part-0.parquet" "test_data2/A=97/part-0.parquet"
#> [3] "test_data2/A=98/part-0.parquet" "test_data2/A=99/part-0.parquet"

Created on 2023-03-29 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Los_Angeles
#>  date     2023-03-29
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 11.0.0.3 2023-03-08 [1] CRAN (R 4.2.3)
#>  assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.2.1)
#>  bit           4.0.5    2022-11-15 [1] CRAN (R 4.2.2)
#>  bit64         4.0.5    2020-08-30 [1] CRAN (R 4.2.2)
#>  cli           3.6.1    2023-03-23 [1] CRAN (R 4.2.3)
#>  digest        0.6.31   2022-12-11 [1] CRAN (R 4.2.2)
#>  dplyr       * 1.1.1    2023-03-22 [1] CRAN (R 4.2.3)
#>  evaluate      0.20     2023-01-17 [1] CRAN (R 4.2.2)
#>  fansi         1.0.4    2023-01-22 [1] CRAN (R 4.2.2)
#>  fastmap       1.1.0    2021-01-25 [1] CRAN (R 4.2.1)
#>  fs            1.6.0    2023-01-23 [1] CRAN (R 4.2.2)
#>  generics      0.1.3    2022-07-05 [1] CRAN (R 4.2.2)
#>  glue          1.6.2    2022-02-24 [1] CRAN (R 4.2.2)
#>  htmltools     0.5.4    2022-12-07 [1] CRAN (R 4.2.2)
#>  knitr         1.42     2023-01-25 [1] CRAN (R 4.2.2)
#>  lifecycle     1.0.3    2022-10-07 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.2.2)
#>  pillar        1.9.0    2023-03-22 [1] CRAN (R 4.2.3)
#>  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 4.2.2)
#>  purrr         1.0.1    2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache       0.16.0   2022-07-21 [1] CRAN (R 4.2.3)
#>  R.methodsS3   1.8.2    2022-06-13 [1] CRAN (R 4.2.2)
#>  R.oo          1.25.0   2022-06-12 [1] CRAN (R 4.2.2)
#>  R.utils       2.12.2   2022-11-11 [1] CRAN (R 4.2.3)
#>  R6            2.5.1    2021-08-19 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2    2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.1.0    2023-03-14 [1] CRAN (R 4.2.3)
#>  rmarkdown     2.20     2023-01-19 [1] CRAN (R 4.2.2)
#>  rstudioapi    0.14     2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.2.1)
#>  styler        1.9.1    2023-03-04 [1] CRAN (R 4.2.3)
#>  tibble        3.2.1    2023-03-20 [1] CRAN (R 4.2.3)
#>  tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.2.2)
#>  tzdb          0.3.0    2022-03-28 [1] CRAN (R 4.2.2)
#>  utf8          1.2.3    2023-01-31 [1] CRAN (R 4.2.2)
#>  vctrs         0.6.1    2023-03-22 [1] CRAN (R 4.2.3)
#>  withr         2.5.0    2022-03-03 [1] CRAN (R 4.2.2)
#>  xfun          0.37     2023-01-31 [1] CRAN (R 4.2.2)
#>  yaml          2.3.7    2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] C:/Users/sbashevkin/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@sbashevkin
Copy link

The error does seem to be related to the head function. When I replace

head()

with

`[`(1:6,)

the error disappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants