Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase documentation on cloud-based data sources #138

Merged
merged 7 commits into from
Jan 9, 2024

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Jan 8, 2024

Description

This PR increases documentation surrounding cloud-based data sources which may be used through CytoTable. I tried to be as thorough as I could without delving too deeply into non-standard capabilities or user-specific cloud configurations (many unknowns here).

Additional unrelated changes:

  • I discovered pycytominer-transform still existed within the documentation as an earlier reference to the title of the project. I corrected these as I went.
  • I found that h5 tags received no visual display tweaks through the alabaster Sphinx theme. I added a custom stylesheet and minor changes to conf.py to adjust this and hopefully make reading a bit easier for users.

Thanks in advance for any feedback you may have!

Closes #62

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple minor suggestions to improve clarity.

docs/source/overview.md Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
SQLite databases stored on cloud services are downloaded locally before other CytoTable work is performed.
This is done to account for differences in how [SQLite's virtual file system (VFS)](https://www.sqlite.org/vfs.html) operates in context with cloud service object storage.
Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (`**kwarg`) passed through CytoTable to [`cloudpathlib`: `local_cache_dir`](https://cloudpathlib.drivendata.org/~latest/caching/#keeping-the-cache-around).
This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this default for CytoTable sqlite files? Is it too difficult or niche to implement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question, thank you for raising it! This is possible to implement as a default but comes with some assumptions. The "quickest" path here would be to use the target_path argument location with a subdirectory to house these temporary files. This would theoretically at least double the storage space required for CytoTable output while CytoTable completes its work. For example, if there were two 40GB SQLite files being used as source data from the cloud, I'd estimate that 160 GB would be needed at a minimum to complete processing (probably with a buffer of +20 GB extra or more). While discovering this challenge I wasn't sure if this would be astounding to users or if there was a better path to take. Either way, worth an issue to explore this a bit more!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created new issue to delve into possible default here: #140

@d33bs
Copy link
Member Author

d33bs commented Jan 9, 2024

Thank you @gwaybio for your review! Merging this in after applying updates and creating the new issue for a local cache default.

@d33bs d33bs merged commit 85f447b into cytomining:main Jan 9, 2024
7 checks passed
@d33bs d33bs deleted the document-cloud-sources branch January 9, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase documentation on cloud-based sources
2 participants