Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103

heispv · 2024-08-16T13:03:17Z

Description:

This PR enhances the EmbeddingService by optimizing memory usage during embedding computations, refactoring core logic, and adding unit tests for robustness.

Key Changes:

Dynamic Memory Management:
- Automatically estimate and manage RAM usage by saving embeddings to disk when memory limits are reached.
Refactor EmbeddingService:
- Extracted core logic into a dedicated method.
- Handled ultra-long reads by saving embeddings immediately and preventing additional sequence loading.
- Removed SAVE_AFTER_N_EMBEDDINGS.
Unit Testing:
- Added tests for computing embeddings on long, short, and mixed sequences.
- Validated dynamic memory management and embedding computation accuracy.
- Ensured correct result paths and file existence in the tests.

Issue Reference:

Closes [#98]

Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage

- Extract core embedding service logic into embedding_service method - Add special case handling for ultra-long reads: - Immediately save ultra-long read embeddings to disk - Avoid loading additional sequences when ultra-long read detected - Dynamically calculate max embeddings that fit in available memory - Use this to determine when to flush embeddings to disk - Code cleanup: - Use type hints for improved readability - Docstrings for key methods

SebieF

Very nice work, thank you for looking into this quite complex problem! :)

Most of my comments are minor suggestions for improvements, the only major change that I would suggest is to unify the handling of the ultra-long reads and normal reads if at all possible, and removing the while loops altogether.

Other than that, the approach seems to work fine, I ran the tests locally and executed a run with ProtT5 for 2377 sequences successfully.

tests/test_embedding_service.py

biotrainer/embedders/embedding_service.py

…ogging - Renamed embedding_service to _do_embeddings_computation for clarity. - Replaced while loop with for loop to prevent infinite loops and simplify processing. - Combined handling of ultra-long and normal reads into a single loop to reduce complexity. - Updated logging levels for better clarity and reduced unnecessary logs. - Optimized garbage collection by consolidating gc.collect() calls.

…gement - Replacing tearDown and manual cleanup with tempfile.TemporaryDirectory() for automatic resource management. - Adjusting long_length calculation to use a fixed value for local testing and a memory-based value for CI environments. - Enhancing _run_embedding_test to support both sequence_to_class and residue_to_class protocols. - Updating _verify_result to validate output based on the protocol used. - Adding new tests for comprehensive coverage of both embedding protocols.

heispv · 2024-08-25T10:59:37Z

Thank you for the suggestions on improving the changes I made. I just updated the code based on your feedback! :)

SebieF

Thanks, the embedding calculation is much easier to understand now with your solution to use only one loop! 👍

I only have a few small suggestions left, then we are ready to rebase. 😄

biotrainer/embedders/embedding_service.py

tests/test_embedding_service.py

- Moving progress bar initialization to ensure visibility during the first embedding calculation and updated its description. - Consolidating memory cleanup: moving del self._embedder and gc.collect() to avoid repeated calls. - Updating docstring in _max_embedding_fit to explain constants (0.75 and 18).

heispv · 2024-08-26T09:50:24Z

Sebastian, I just updated the code as you suggested 😉

SebieF

Thank you very much! :)

heispv added 4 commits August 16, 2024 14:48

Implementing dynamic memory handling for embedding computations

7ab3d68

Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage

Removing SAVE_AFTER_N_EMBEDDINGS

8cbc6b1

Adding unit tests for EmbeddingService

3ef8037

SebieF added enhancement New feature or request maintenance Code or example maintainance labels Aug 19, 2024

SebieF self-requested a review August 19, 2024 07:31

SebieF requested changes Aug 19, 2024

View reviewed changes

SebieF assigned heispv Aug 19, 2024

heispv added 2 commits August 25, 2024 12:04

SebieF requested changes Aug 25, 2024

View reviewed changes

biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved

biotrainer/embedders/embedding_service.py Outdated Show resolved Hide resolved

biotrainer/embedders/embedding_service.py Show resolved Hide resolved

tests/test_embedding_service.py Show resolved Hide resolved

heispv added 2 commits August 26, 2024 11:44

Adding logging to clarify test duration in CI environments

0faefdc

SebieF approved these changes Aug 26, 2024

View reviewed changes

SebieF merged commit 35519a6 into sacdallago:develop Aug 26, 2024
1 check passed

SebieF mentioned this pull request Aug 26, 2024

Version 0.9.2 #107

Merged

heispv deleted the feature/embedding-saving-strategy branch October 29, 2024 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103

Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103

heispv commented Aug 16, 2024 •

edited by SebieF

Loading

SebieF left a comment

heispv commented Aug 25, 2024

SebieF left a comment

heispv commented Aug 26, 2024

SebieF left a comment

Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103

Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103

Conversation

heispv commented Aug 16, 2024 • edited by SebieF Loading

Description:

Key Changes:

Issue Reference:

SebieF left a comment

Choose a reason for hiding this comment

heispv commented Aug 25, 2024

SebieF left a comment

Choose a reason for hiding this comment

heispv commented Aug 26, 2024

SebieF left a comment

Choose a reason for hiding this comment

heispv commented Aug 16, 2024 •

edited by SebieF

Loading