Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103

Merged
merged 8 commits into from
Aug 26, 2024

Conversation

heispv
Copy link
Contributor

@heispv heispv commented Aug 16, 2024

Description:

This PR enhances the EmbeddingService by optimizing memory usage during embedding computations, refactoring core logic, and adding unit tests for robustness.

Key Changes:

  1. Dynamic Memory Management:

    • Automatically estimate and manage RAM usage by saving embeddings to disk when memory limits are reached.
  2. Refactor EmbeddingService:

    • Extracted core logic into a dedicated method.
    • Handled ultra-long reads by saving embeddings immediately and preventing additional sequence loading.
    • Removed SAVE_AFTER_N_EMBEDDINGS.
  3. Unit Testing:

    • Added tests for computing embeddings on long, short, and mixed sequences.
    • Validated dynamic memory management and embedding computation accuracy.
    • Ensured correct result paths and file existence in the tests.

Issue Reference:

Closes [#98]

heispv added 4 commits August 16, 2024 14:48
Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage
- Extract core embedding service logic into embedding_service method
- Add special case handling for ultra-long reads:
  - Immediately save ultra-long read embeddings to disk
  - Avoid loading additional sequences when ultra-long read detected
- Dynamically calculate max embeddings that fit in available memory
  - Use this to determine when to flush embeddings to disk
- Code cleanup:
  - Use type hints for improved readability
  - Docstrings for key methods
@SebieF SebieF added enhancement New feature or request maintenance Code or example maintainance labels Aug 19, 2024
@SebieF SebieF self-requested a review August 19, 2024 07:31
Copy link
Collaborator

@SebieF SebieF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, thank you for looking into this quite complex problem! :)

Most of my comments are minor suggestions for improvements, the only major change that I would suggest is to unify the handling of the ultra-long reads and normal reads if at all possible, and removing the while loops altogether.

Other than that, the approach seems to work fine, I ran the tests locally and executed a run with ProtT5 for 2377 sequences successfully.

tests/test_embedding_service.py Outdated Show resolved Hide resolved
tests/test_embedding_service.py Outdated Show resolved Hide resolved
tests/test_embedding_service.py Outdated Show resolved Hide resolved
tests/test_embedding_service.py Outdated Show resolved Hide resolved
tests/test_embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
heispv added 2 commits August 25, 2024 12:04
…ogging

- Renamed embedding_service to _do_embeddings_computation for clarity.
- Replaced while loop with for loop to prevent infinite loops and simplify processing.
- Combined handling of ultra-long and normal reads into a single loop to reduce complexity.
- Updated logging levels for better clarity and reduced unnecessary logs.
- Optimized garbage collection by consolidating gc.collect() calls.
…gement

- Replacing tearDown and manual cleanup with tempfile.TemporaryDirectory() for automatic resource management.
- Adjusting long_length calculation to use a fixed value for local testing and a memory-based value for CI environments.
- Enhancing _run_embedding_test to support both sequence_to_class and residue_to_class protocols.
- Updating _verify_result to validate output based on the protocol used.
- Adding new tests for comprehensive coverage of both embedding protocols.
@heispv
Copy link
Contributor Author

heispv commented Aug 25, 2024

Thank you for the suggestions on improving the changes I made. I just updated the code based on your feedback! :)

Copy link
Collaborator

@SebieF SebieF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the embedding calculation is much easier to understand now with your solution to use only one loop! 👍

I only have a few small suggestions left, then we are ready to rebase. 😄

biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved
biotrainer/embedders/embedding_service.py Show resolved Hide resolved
tests/test_embedding_service.py Show resolved Hide resolved
heispv added 2 commits August 26, 2024 11:44
- Moving progress bar initialization to ensure visibility during the first embedding calculation and updated its description.
- Consolidating memory cleanup: moving del self._embedder and gc.collect() to avoid repeated calls.
- Updating docstring in _max_embedding_fit to explain constants (0.75 and 18).
@heispv
Copy link
Contributor Author

heispv commented Aug 26, 2024

Sebastian, I just updated the code as you suggested 😉

Copy link
Collaborator

@SebieF SebieF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! :)

@SebieF SebieF merged commit 35519a6 into sacdallago:develop Aug 26, 2024
1 check passed
@SebieF SebieF mentioned this pull request Aug 26, 2024
@heispv heispv deleted the feature/embedding-saving-strategy branch October 29, 2024 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maintenance Code or example maintainance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants