-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evo2 #694
base: main
Are you sure you want to change the base?
Conversation
…ing, default 0 for topk/topp.
… debt in tokenizer and config, remove unused args in infer.py.
…d add transcript splicing script for preprocessing.
…ate test to use new checkpoint
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
@@ -18,7 +18,7 @@ repos: | |||
hooks: | |||
- id: detect-secrets | |||
name: detect-secrets (everything but notebooks) | |||
args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline)$', ] | |||
args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline|.*\.fasta)$', ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, we shouldnt have any .fasta file in the repo since it is data specific format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and you removed the one case of a fasta in the repo already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to remove it from the pre config since this way this format might sneak in accidently in the future and not be detected by pre commit hook
["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"], | ||
["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should use NGC specific resources when they are available
@@ -0,0 +1,6 @@ | |||
- dataset_prefix: /workspace/bionemo2/sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_datasets/test_promoters_uint8_distinct_byte-level_train |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this config file we moved to tests/configs?
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Co-authored-by: Dorota Toczydlowska <[email protected]> Signed-off-by: John St. John <[email protected]>
Co-authored-by: Dorota Toczydlowska <[email protected]> Signed-off-by: John St. John <[email protected]>
/build-ci |
@@ -0,0 +1,81 @@ | |||
- dataset_prefix: /workspace/bionemo2/data/metagenomics/pretraining_data_metagenomics/data_metagenomics_train_text_CharLevelTokenizer_document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to have this yaml file moved somewhere else, and maybe rename to full_training_config.yaml? Somehow, when it is under tests, it inexplicitly relates to testing/unit testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same applies to onther training/filetuning/preprocessing configs not related to testing. I would see it under bionemo-evo2/configs
@@ -118,28 +125,58 @@ def batch_collator( | |||
case [None, *_]: | |||
return None | |||
case [Tensor(), *_]: | |||
return torch.cat(batches, dim=batch_dim) | |||
# First shortcut if all tensors are 1D (they have at least one batch dim, and it must be at 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a bit of code here, do we want to have it unit tested maybe?
return TransformerConfig(num_layers=2, hidden_size=864, num_attention_heads=1) | ||
|
||
|
||
class TestParallelHyenaOperator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a bit different style of writing tests that we do in bionemo, are we ok with that?
from bionemo.testing.data.fasta import ALU_SEQUENCE, create_fasta_file | ||
|
||
|
||
def test_train_evo2_runs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the test be renamed?
The command is run in a subshell, and we assert that it returns an exit code of 0. | ||
""" | ||
fasta_file_path = tmp_path / "test.fasta" | ||
create_fasta_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test would execute much faster is we import predict
method and just test it
import pytest | ||
from lightning.fabric.plugins.environments.lightning import find_free_network_port | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO Dorota - this test would execute much faster is we import train
method and just tests it. Get this unit test from gitlab
assert results == ["T"] | ||
|
||
|
||
# def test_infer_model_generates_expected_single_token_output_from_input_seq(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove commented lines?
device = torch.device("cpu") | ||
|
||
|
||
def profile_memory_decorator(func: Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt those methods go under testing or other subpackage?
return ap.parse_args() | ||
|
||
|
||
class HyenaPredictor(LightningPassthroughPredictionMixin, HyenaModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt those classes be relocated to other files and be unit tested?
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
@@ -0,0 +1,66 @@ | |||
- tag: 1b-8k:1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change this to evo2-1b-8k ?
As a user looking at the README in sub-packages/bionemo-evo2/ I'd appreciate a lot if there are actually copy/paste-able command snippets to test something. This plus the one in sub-packages/bionemo-evo2/src/bionemo/evo2/data/README.md both need snippets |
### Description Slightly refactoring train script for evo2 to better handle unit testing and a bug fix ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. ### Usage <!--- How does a user interact with the changed code --> ```python TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [x] I have tested these changes locally - [ ] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [ ] All existing tests pass successfully --------- Signed-off-by: dorotat <[email protected]>
@@ -111,6 +111,7 @@ def test_load_with_file(mocked_s3_download, tmp_path): | |||
) | |||
|
|||
mocked_s3_download.side_effect = lambda _1, output_file, _2: Path(output_file).write_text("test") | |||
# TODO(dorotat-nv) remove source="pbss" when NGC resources are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swithc to ngc after artefacts are published
### Description Bugfixing and updating evo2 scripts for automated benchmarking execution ### Type of changes <!-- Mark the relevant option with an [x] --> - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. ### Usage <!--- How does a user interact with the changed code --> ```python TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [x] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully
Description
This provides an implementation of Evo2 supporting pre-training, fine-tuning and preprocessing of data for Evo2 from fasta files. This makes use of the new Hyena/Evo2 model support in NVIDIA/NeMo#12263.
Known issues
Type of changes
Pre-submit Checklist