Increase documentation on cloud-based data sources #138

d33bs · 2024-01-08T17:49:09Z

Description

This PR increases documentation surrounding cloud-based data sources which may be used through CytoTable. I tried to be as thorough as I could without delving too deeply into non-standard capabilities or user-specific cloud configurations (many unknowns here).

Additional unrelated changes:

I discovered pycytominer-transform still existed within the documentation as an earlier reference to the title of the project. I corrected these as I went.
I found that h5 tags received no visual display tweaks through the alabaster Sphinx theme. I added a custom stylesheet and minor changes to conf.py to adjust this and hopefully make reading a bit easier for users.

Thanks in advance for any feedback you may have!

Closes #62

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

gwaybio

Looks great! A couple minor suggestions to improve clarity.

docs/source/overview.md

gwaybio · 2024-01-08T18:16:56Z

docs/source/overview.md

+SQLite databases stored on cloud services are downloaded locally before other CytoTable work is performed.
+This is done to account for differences in how [SQLite's virtual file system (VFS)](https://www.sqlite.org/vfs.html) operates in context with cloud service object storage.
+Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (`**kwarg`) passed through CytoTable to [`cloudpathlib`: `local_cache_dir`](https://cloudpathlib.drivendata.org/~latest/caching/#keeping-the-cache-around).
+This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations).


Why isn't this default for CytoTable sqlite files? Is it too difficult or niche to implement?

This is a great question, thank you for raising it! This is possible to implement as a default but comes with some assumptions. The "quickest" path here would be to use the target_path argument location with a subdirectory to house these temporary files. This would theoretically at least double the storage space required for CytoTable output while CytoTable completes its work. For example, if there were two 40GB SQLite files being used as source data from the cloud, I'd estimate that 160 GB would be needed at a minimum to complete processing (probably with a buffer of +20 GB extra or more). While discovering this challenge I wasn't sure if this would be astounding to users or if there was a better path to take. Either way, worth an issue to explore this a bit more!

Created new issue to delve into possible default here: #140

Co-authored-by: Gregory Way <[email protected]>

d33bs · 2024-01-09T13:57:49Z

Thank you @gwaybio for your review! Merging this in after applying updates and creating the new issue for a local cache default.

d33bs added 5 commits January 8, 2024 08:47

update pycytominer-transform mentions

9bd2890

update docs on cloud sources

1124971

add custom css for h5 header display

cbc2877

add issue link

24b0454

enhance cloud auth mentions

ce27b1c

d33bs requested review from gwaybio, kenibrewer and falquaddoomi January 8, 2024 17:49

gwaybio approved these changes Jan 8, 2024

View reviewed changes

d33bs mentioned this pull request Jan 9, 2024

Enable automatic local_cache_dir specification for cloud-based SQLite data sources #140

Open

d33bs and others added 2 commits January 9, 2024 06:45

Apply suggestions from code review

d1890b8

Co-authored-by: Gregory Way <[email protected]>

simplify subheader for cloud data sources

c7d665e

d33bs merged commit 85f447b into cytomining:main Jan 9, 2024
7 checks passed

d33bs deleted the document-cloud-sources branch January 9, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase documentation on cloud-based data sources #138

Increase documentation on cloud-based data sources #138

d33bs commented Jan 8, 2024

gwaybio left a comment

gwaybio Jan 8, 2024

d33bs Jan 9, 2024

d33bs Jan 9, 2024

d33bs commented Jan 9, 2024

Increase documentation on cloud-based data sources #138

Increase documentation on cloud-based data sources #138

Conversation

d33bs commented Jan 8, 2024

Description

What is the nature of your change?

Checklist

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio Jan 8, 2024

Choose a reason for hiding this comment

d33bs Jan 9, 2024

Choose a reason for hiding this comment

d33bs Jan 9, 2024

Choose a reason for hiding this comment

d33bs commented Jan 9, 2024