-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase documentation on cloud-based data sources #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple minor suggestions to improve clarity.
SQLite databases stored on cloud services are downloaded locally before other CytoTable work is performed. | ||
This is done to account for differences in how [SQLite's virtual file system (VFS)](https://www.sqlite.org/vfs.html) operates in context with cloud service object storage. | ||
Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (`**kwarg`) passed through CytoTable to [`cloudpathlib`: `local_cache_dir`](https://cloudpathlib.drivendata.org/~latest/caching/#keeping-the-cache-around). | ||
This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this default for CytoTable sqlite files? Is it too difficult or niche to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question, thank you for raising it! This is possible to implement as a default but comes with some assumptions. The "quickest" path here would be to use the target_path
argument location with a subdirectory to house these temporary files. This would theoretically at least double the storage space required for CytoTable output while CytoTable completes its work. For example, if there were two 40GB SQLite files being used as source data from the cloud, I'd estimate that 160 GB would be needed at a minimum to complete processing (probably with a buffer of +20 GB extra or more). While discovering this challenge I wasn't sure if this would be astounding to users or if there was a better path to take. Either way, worth an issue to explore this a bit more!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created new issue to delve into possible default here: #140
Co-authored-by: Gregory Way <[email protected]>
Thank you @gwaybio for your review! Merging this in after applying updates and creating the new issue for a local cache default. |
Description
This PR increases documentation surrounding cloud-based data sources which may be used through CytoTable. I tried to be as thorough as I could without delving too deeply into non-standard capabilities or user-specific cloud configurations (many unknowns here).
Additional unrelated changes:
pycytominer-transform
still existed within the documentation as an earlier reference to the title of the project. I corrected these as I went.h5
tags received no visual display tweaks through thealabaster
Sphinx theme. I added a custom stylesheet and minor changes toconf.py
to adjust this and hopefully make reading a bit easier for users.Thanks in advance for any feedback you may have!
Closes #62
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.