Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evo2 #694

Open
wants to merge 113 commits into
base: main
Choose a base branch
from
Open

Evo2 #694

wants to merge 113 commits into from

Conversation

jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Feb 19, 2025

Description

This provides an implementation of Evo2 supporting pre-training, fine-tuning and preprocessing of data for Evo2 from fasta files. This makes use of the new Hyena/Evo2 model support in NVIDIA/NeMo#12263.

Known issues

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

cspades and others added 30 commits November 16, 2024 10:46
… debt in tokenizer and config, remove unused args in infer.py.
…d add transcript splicing script for preprocessing.
@@ -18,7 +18,7 @@ repos:
hooks:
- id: detect-secrets
name: detect-secrets (everything but notebooks)
args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline)$', ]
args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline|.*\.fasta)$', ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, we shouldnt have any .fasta file in the repo since it is data specific format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and you removed the one case of a fasta in the repo already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remove it from the pre config since this way this format might sneak in accidently in the future and not be detected by pre commit hook

Comment on lines 49 to 50
["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"],
["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should use NGC specific resources when they are available

@@ -0,0 +1,6 @@
- dataset_prefix: /workspace/bionemo2/sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_datasets/test_promoters_uint8_distinct_byte-level_train
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this config file we moved to tests/configs?

jstjohn and others added 2 commits February 25, 2025 10:59
Co-authored-by: Dorota Toczydlowska <[email protected]>
Signed-off-by: John St. John <[email protected]>
Co-authored-by: Dorota Toczydlowska <[email protected]>
Signed-off-by: John St. John <[email protected]>
@jstjohn jstjohn added the INCLUDE_NOTEBOOKS_TESTS Add Jupyter notebook validation to the CI pipeline label Feb 25, 2025
@jstjohn
Copy link
Collaborator Author

jstjohn commented Feb 25, 2025

/build-ci

@@ -0,0 +1,81 @@
- dataset_prefix: /workspace/bionemo2/data/metagenomics/pretraining_data_metagenomics/data_metagenomics_train_text_CharLevelTokenizer_document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to have this yaml file moved somewhere else, and maybe rename to full_training_config.yaml? Somehow, when it is under tests, it inexplicitly relates to testing/unit testing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same applies to onther training/filetuning/preprocessing configs not related to testing. I would see it under bionemo-evo2/configs

@@ -118,28 +125,58 @@ def batch_collator(
case [None, *_]:
return None
case [Tensor(), *_]:
return torch.cat(batches, dim=batch_dim)
# First shortcut if all tensors are 1D (they have at least one batch dim, and it must be at 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit of code here, do we want to have it unit tested maybe?

return TransformerConfig(num_layers=2, hidden_size=864, num_attention_heads=1)


class TestParallelHyenaOperator:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit different style of writing tests that we do in bionemo, are we ok with that?

from bionemo.testing.data.fasta import ALU_SEQUENCE, create_fasta_file


def test_train_evo2_runs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the test be renamed?

The command is run in a subshell, and we assert that it returns an exit code of 0.
"""
fasta_file_path = tmp_path / "test.fasta"
create_fasta_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test would execute much faster is we import predict method and just test it

import pytest
from lightning.fabric.plugins.environments.lightning import find_free_network_port


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO Dorota - this test would execute much faster is we import train method and just tests it. Get this unit test from gitlab

assert results == ["T"]


# def test_infer_model_generates_expected_single_token_output_from_input_seq():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove commented lines?

device = torch.device("cpu")


def profile_memory_decorator(func: Iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt those methods go under testing or other subpackage?

return ap.parse_args()


class HyenaPredictor(LightningPassthroughPredictionMixin, HyenaModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt those classes be relocated to other files and be unit tested?

Signed-off-by: John St John <[email protected]>
@@ -0,0 +1,66 @@
- tag: 1b-8k:1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to evo2-1b-8k ?

@dorotat-nv dorotat-nv added the INCLUDE_SLOW_TESTS Add unit tests marked as slow to CI pipeline label Feb 27, 2025
@darrenjhsu
Copy link
Collaborator

As a user looking at the README in sub-packages/bionemo-evo2/ I'd appreciate a lot if there are actually copy/paste-able command snippets to test something. This plus the one in sub-packages/bionemo-evo2/src/bionemo/evo2/data/README.md both need snippets

### Description
Slightly refactoring train script for evo2 to better handle unit testing
and a bug fix

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing


> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [x] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [x] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

---------

Signed-off-by: dorotat <[email protected]>
@@ -111,6 +111,7 @@ def test_load_with_file(mocked_s3_download, tmp_path):
)

mocked_s3_download.side_effect = lambda _1, output_file, _2: Path(output_file).write_text("test")
# TODO(dorotat-nv) remove source="pbss" when NGC resources are available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swithc to ngc after artefacts are published

### Description
Bugfixing and updating evo2 scripts for automated benchmarking execution

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [x]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [ ]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing


> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [x] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
INCLUDE_NOTEBOOKS_TESTS Add Jupyter notebook validation to the CI pipeline INCLUDE_SLOW_TESTS Add unit tests marked as slow to CI pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants