-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Port resume-mode to R #2405
[r] Port resume-mode to R #2405
Conversation
19eab3f
to
32e45a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mojaveazure !!
On #2245 (review) I attached sample scripts with sample invocations for local disk, S3, and TileDB Cloud URIs. (I also gave some narrative there around why these currently must be run outside of our package CI, and why our unit-test coverage is not sufficient for cloud testing.)
I ran those scripts as-is, with this PR checked out and built, with two mods:
- updated the timestamped basenames;
- replaced
mode="resume"
withingest_mode="resume"
which is appropriate since you called thisingest_mode
which is a good thing.
Following the scenarios as given above, I found:
- For local disk:
- ✅ initial ingest succeeds
- ✅ re-ingest without
ingest_mode="resume"
fails with an already-exists error, as intended - ✅ re-ingest with
ingest_mode="resume"
succeeds, and quickly
- For S3:
- ✅ initial ingest succeeds
- ✅ re-ingest without
ingest_mode="resume"
fails with an already-exists error, as intended - ✅ re-ingest with
ingest_mode="resume"
succeeds, and quickly
- For TileDB-Cloud
- ✅ initial ingest succeeds
- ✅ re-ingest without
ingest_mode="resume"
fails with an already-exists error, as intended - ❌ re-ingest with
ingest_mode="resume"
fails with an already-exists error, which is not as intended
INPUT /var/s/d/seurat_pbmc3k.rds
OUTPUT tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/ring-20240312-134625
[2024-04-11 14:39:16.970] [default] [Process: 61302] [debug] [TileDBObject] initialize SOMAExperiment with 'tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/ring-20240312-134625'
[2024-04-11 14:39:23.756] [default] [Process: 61302] [info] Creating new SOMAExperiment at 'tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/ring-20240312-134625'
Error: [TileDB::REST] Error: Error in libcurl POST operation: libcurl error message 'CURLE_OK'; HTTP code 500; server response data '{"code":903,"message":"error creating the group, please check your inputs and try again: failed to create tiledb group: Error in creating group: [TileDB::StorageManager] Error: Cannot create group; Group 's3://tiledb-johnkerl/scratch/ring-20240312-134625' already exists","request_id":"9a8bf500-f3d6-4c57-b8f2-8e1351be0ddb"}'.
Execution halted
@@ -29,6 +29,8 @@ test_that("Factory re-creation", { | |||
expect_error(fxn(uri)) | |||
obj$close() | |||
} | |||
|
|||
gc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW there was never conclusive proof these help. There is IIRC a decade+ old SO post of mine arguing to chain several calls to gc()
... but at the end of the day the R memory management layer is (as for any other managed language) a black box. So 🤷♂️ -- this may not do harm but it most likely won't help much.
caea782
to
09909b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-ran #2405 (review) -- all looks amazing -- thanks @mojaveazure !!
Implement resume-mode ingestion in the R API This PR parallels #664; it adds support for resume-mode in factory functions, which allows `SOMA*Create()` to check for an already existing TileDB object at `uri` and if so, simply connect to it rather than try to re-create It also adds support for resume-mode in `write_soma()` methods; these methods check the `soma_joinids` that are present in the input data and that already exist on disk, and only writes data for `soma_joinids` that are missing from disk The following SOMA functions have been modified to with an `ingest_mode` parameter: - `SOMADataFrameCreate()` - `SOMASparseNDarrayCreate()` - `SOMACollectionCreate()` - `SOMAMeasurementCreate()` - `SOMAExperimentCreate()` The following methods of `write_soma()` have been modified with an `ingest_mode` parameter: - `write_soma.character()`/`write_soma.data.frame()`/`write_soma.Dataframe()` - `write_soma.matrix()`/`write_soma.TsparseMatrix()` - `write_soma.Seurat()` and other Seurat-subobject methods - `write_soma.SummarizedExperiment()`/`write_soma.SingleCellExperiment()` resolves #1399
09909b3
to
521bd79
Compare
Bump develop version [ci skip]
Implement resume-mode ingestion in the R API
This PR parallels #664; it adds support for resume-mode in factory functions, which allows
SOMA*Create()
to check for an already existing TileDB object aturi
and if so, simply connect to it rather than try to re-createIt also adds support for resume-mode in
write_soma()
methods; these methods check thesoma_joinids
that are present in the input data and that already exist on disk, and only writes data forsoma_joinids
that are missing from diskThe following SOMA functions have been modified to with an
ingest_mode
parameter:SOMADataFrameCreate()
SOMASparseNDarrayCreate()
SOMACollectionCreate()
SOMAMeasurementCreate()
SOMAExperimentCreate()
The following methods of
write_soma()
have been modified with aningest_mode
parameter:write_soma.character()
/write_soma.data.frame()
/write_soma.Dataframe()
write_soma.matrix()
/write_soma.TsparseMatrix()
write_soma.Seurat()
and other Seurat-subobject methodswrite_soma.SummarizedExperiment()
/write_soma.SingleCellExperiment()
resolves #1399