Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Port resume-mode to R #2405

Merged

Conversation

mojaveazure
Copy link
Member

Implement resume-mode ingestion in the R API

This PR parallels #664; it adds support for resume-mode in factory functions, which allows SOMA*Create() to check for an already existing TileDB object at uri and if so, simply connect to it rather than try to re-create

It also adds support for resume-mode in write_soma() methods; these methods check the soma_joinids that are present in the input data and that already exist on disk, and only writes data for soma_joinids that are missing from disk

The following SOMA functions have been modified to with an ingest_mode parameter:

  • SOMADataFrameCreate()
  • SOMASparseNDarrayCreate()
  • SOMACollectionCreate()
  • SOMAMeasurementCreate()
  • SOMAExperimentCreate()

The following methods of write_soma() have been modified with an ingest_mode parameter:

  • write_soma.character()/write_soma.data.frame()/write_soma.Dataframe()
  • write_soma.matrix()/write_soma.TsparseMatrix()
  • write_soma.Seurat() and other Seurat-subobject methods
  • write_soma.SummarizedExperiment()/write_soma.SingleCellExperiment()

resolves #1399

@mojaveazure mojaveazure force-pushed the paulhoffman/sc-42217/r-port-resume-mode-processing-from-python branch from 19eab3f to 32e45a6 Compare April 10, 2024 21:49
@mojaveazure mojaveazure marked this pull request as ready for review April 10, 2024 22:04
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mojaveazure !!

On #2245 (review) I attached sample scripts with sample invocations for local disk, S3, and TileDB Cloud URIs. (I also gave some narrative there around why these currently must be run outside of our package CI, and why our unit-test coverage is not sufficient for cloud testing.)

I ran those scripts as-is, with this PR checked out and built, with two mods:

  • updated the timestamped basenames;
  • replaced mode="resume" with ingest_mode="resume" which is appropriate since you called this ingest_mode which is a good thing.

Following the scenarios as given above, I found:

  • For local disk:
    • ✅ initial ingest succeeds
    • ✅ re-ingest without ingest_mode="resume" fails with an already-exists error, as intended
    • ✅ re-ingest with ingest_mode="resume" succeeds, and quickly
  • For S3:
    • ✅ initial ingest succeeds
    • ✅ re-ingest without ingest_mode="resume" fails with an already-exists error, as intended
    • ✅ re-ingest with ingest_mode="resume" succeeds, and quickly
  • For TileDB-Cloud
    • ✅ initial ingest succeeds
    • ✅ re-ingest without ingest_mode="resume" fails with an already-exists error, as intended
    • ❌ re-ingest with ingest_mode="resume" fails with an already-exists error, which is not as intended

INPUT   /var/s/d/seurat_pbmc3k.rds
OUTPUT  tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/ring-20240312-134625
[2024-04-11 14:39:16.970] [default] [Process: 61302] [debug] [TileDBObject] initialize SOMAExperiment with 'tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/ring-20240312-134625'
[2024-04-11 14:39:23.756] [default] [Process: 61302] [info] Creating new SOMAExperiment at 'tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/scratch/ring-20240312-134625'
Error: [TileDB::REST] Error: Error in libcurl POST operation: libcurl error message 'CURLE_OK'; HTTP code 500; server response data '{"code":903,"message":"error creating the group, please check your inputs and try again: failed to create tiledb group: Error in creating group: [TileDB::StorageManager] Error: Cannot create group; Group 's3://tiledb-johnkerl/scratch/ring-20240312-134625' already exists","request_id":"9a8bf500-f3d6-4c57-b8f2-8e1351be0ddb"}'.
Execution halted

@@ -29,6 +29,8 @@ test_that("Factory re-creation", {
expect_error(fxn(uri))
obj$close()
}

gc()
Copy link
Contributor

@eddelbuettel eddelbuettel Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there was never conclusive proof these help. There is IIRC a decade+ old SO post of mine arguing to chain several calls to gc() ... but at the end of the day the R memory management layer is (as for any other managed language) a black box. So 🤷‍♂️ -- this may not do harm but it most likely won't help much.

@mojaveazure mojaveazure force-pushed the paulhoffman/sc-42217/r-port-resume-mode-processing-from-python branch from caea782 to 09909b3 Compare April 24, 2024 00:28
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-ran #2405 (review) -- all looks amazing -- thanks @mojaveazure !!

Implement resume-mode ingestion in the R API

This PR parallels #664; it adds support for resume-mode in factory
functions, which allows `SOMA*Create()` to check for an already existing
TileDB object at `uri` and if so, simply connect to it rather than try
to re-create

It also adds support for resume-mode in `write_soma()` methods; these
methods check the `soma_joinids` that are present in the input data and
that already exist on disk, and only writes data for `soma_joinids` that
are missing from disk

The following SOMA functions have been modified to with an `ingest_mode`
parameter:
 - `SOMADataFrameCreate()`
 - `SOMASparseNDarrayCreate()`
 - `SOMACollectionCreate()`
 - `SOMAMeasurementCreate()`
 - `SOMAExperimentCreate()`

The following methods of `write_soma()` have been modified with an
`ingest_mode` parameter:
 - `write_soma.character()`/`write_soma.data.frame()`/`write_soma.Dataframe()`
 - `write_soma.matrix()`/`write_soma.TsparseMatrix()`
 - `write_soma.Seurat()` and other Seurat-subobject methods
 - `write_soma.SummarizedExperiment()`/`write_soma.SingleCellExperiment()`

resolves #1399
@mojaveazure mojaveazure force-pushed the paulhoffman/sc-42217/r-port-resume-mode-processing-from-python branch from 09909b3 to 521bd79 Compare April 24, 2024 17:32
Bump develop version

[ci skip]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[r] Port resume-mode processing from Python to R
3 participants