-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103
Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103
Conversation
Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage
- Extract core embedding service logic into embedding_service method - Add special case handling for ultra-long reads: - Immediately save ultra-long read embeddings to disk - Avoid loading additional sequences when ultra-long read detected - Dynamically calculate max embeddings that fit in available memory - Use this to determine when to flush embeddings to disk - Code cleanup: - Use type hints for improved readability - Docstrings for key methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work, thank you for looking into this quite complex problem! :)
Most of my comments are minor suggestions for improvements, the only major change that I would suggest is to unify the handling of the ultra-long reads and normal reads if at all possible, and removing the while loops altogether.
Other than that, the approach seems to work fine, I ran the tests locally and executed a run with ProtT5 for 2377 sequences successfully.
…ogging - Renamed embedding_service to _do_embeddings_computation for clarity. - Replaced while loop with for loop to prevent infinite loops and simplify processing. - Combined handling of ultra-long and normal reads into a single loop to reduce complexity. - Updated logging levels for better clarity and reduced unnecessary logs. - Optimized garbage collection by consolidating gc.collect() calls.
…gement - Replacing tearDown and manual cleanup with tempfile.TemporaryDirectory() for automatic resource management. - Adjusting long_length calculation to use a fixed value for local testing and a memory-based value for CI environments. - Enhancing _run_embedding_test to support both sequence_to_class and residue_to_class protocols. - Updating _verify_result to validate output based on the protocol used. - Adding new tests for comprehensive coverage of both embedding protocols.
Thank you for the suggestions on improving the changes I made. I just updated the code based on your feedback! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the embedding calculation is much easier to understand now with your solution to use only one loop! 👍
I only have a few small suggestions left, then we are ready to rebase. 😄
- Moving progress bar initialization to ensure visibility during the first embedding calculation and updated its description. - Consolidating memory cleanup: moving del self._embedder and gc.collect() to avoid repeated calls. - Updating docstring in _max_embedding_fit to explain constants (0.75 and 18).
Sebastian, I just updated the code as you suggested 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! :)
Description:
This PR enhances the
EmbeddingService
by optimizing memory usage during embedding computations, refactoring core logic, and adding unit tests for robustness.Key Changes:
Dynamic Memory Management:
Refactor
EmbeddingService
:SAVE_AFTER_N_EMBEDDINGS
.Unit Testing:
Issue Reference:
Closes [#98]