Evo2 #694

jstjohn · 2025-02-19T16:21:37Z

Description

This provides an implementation of Evo2 supporting pre-training, fine-tuning and preprocessing of data for Evo2 from fasta files. This makes use of the new Hyena/Evo2 model support in NVIDIA/NeMo#12263.

Known issues

1M context dataset pre-training depends on a nearly finished commit to Megatron-LM.
Verification of accuracy has been completed on the 7B parameter 8k context setting. Analysis of other settings are in progress.
Model implemented in unmerged upstream NeMo PR: Evo2 merge 20250214 NeMo#12263
Inference relies on Avoid init_ddp for inference NeMo#12011

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

…oNeMo.

…ing, default 0 for topk/topp.

…Hyena.

… debt in tokenizer and config, remove unused args in infer.py.

…d add transcript splicing script for preprocessing.

…ate test to use new checkpoint

Signed-off-by: John St John <[email protected]>

dorotat-nv · 2025-02-25T13:21:29Z

.pre-commit-config.yaml

@@ -18,7 +18,7 @@ repos:
    hooks:
      - id: detect-secrets
        name: detect-secrets (everything but notebooks)
-        args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline)$', ]
+        args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline|.*\.fasta)$', ]


afaik, we shouldnt have any .fasta file in the repo since it is data specific format

Right, and you removed the one case of a fasta in the repo already.

I suggest to remove it from the pre config since this way this format might sneak in accidently in the future and not be detected by pre commit hook

dorotat-nv · 2025-02-25T13:23:13Z

sub-packages/bionemo-core/tests/bionemo/core/data/test_load.py

-        ["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"],
+        ["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"],


yes, we should use NGC specific resources when they are available

dorotat-nv · 2025-02-25T14:45:07Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_dataset_config.yaml

@@ -0,0 +1,6 @@
+- dataset_prefix: /workspace/bionemo2/sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_datasets/test_promoters_uint8_distinct_byte-level_train


can this config file we moved to tests/configs?

ci/benchmarks/partial-conv/evo2_pretrain.yaml

ci/benchmarks/perf/evo2_pretrain.yaml

Signed-off-by: John St John <[email protected]>

Co-authored-by: Dorota Toczydlowska <[email protected]> Signed-off-by: John St. John <[email protected]>

jstjohn · 2025-02-25T19:01:00Z

/build-ci

dorotat-nv · 2025-02-25T18:52:25Z

sub-packages/bionemo-evo2/tests/config/test_dataset_config.yaml

@@ -0,0 +1,81 @@
+- dataset_prefix: /workspace/bionemo2/data/metagenomics/pretraining_data_metagenomics/data_metagenomics_train_text_CharLevelTokenizer_document


would be great to have this yaml file moved somewhere else, and maybe rename to full_training_config.yaml? Somehow, when it is under tests, it inexplicitly relates to testing/unit testing

the same applies to onther training/filetuning/preprocessing configs not related to testing. I would see it under bionemo-evo2/configs

dorotat-nv · 2025-02-25T18:53:47Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

@@ -118,28 +125,58 @@ def batch_collator(
        case [None, *_]:
            return None
        case [Tensor(), *_]:
-            return torch.cat(batches, dim=batch_dim)
+            # First shortcut if all tensors are 1D (they have at least one batch dim, and it must be at 0)


it is a bit of code here, do we want to have it unit tested maybe?

dorotat-nv · 2025-02-25T18:55:44Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_hyena_operators.py

+    return TransformerConfig(num_layers=2, hidden_size=864, num_attention_heads=1)
+
+
+class TestParallelHyenaOperator:


it is a bit different style of writing tests that we do in bionemo, are we ok with that?

dorotat-nv · 2025-02-25T18:57:33Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py

+from bionemo.testing.data.fasta import ALU_SEQUENCE, create_fasta_file
+
+
+def test_train_evo2_runs(


shouldn't the test be renamed?

dorotat-nv · 2025-02-25T18:58:40Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py

+    The command is run in a subshell, and we assert that it returns an exit code of 0.
+    """
+    fasta_file_path = tmp_path / "test.fasta"
+    create_fasta_file(


this test would execute much faster is we import predict method and just test it

dorotat-nv · 2025-02-25T18:59:19Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py

+import pytest
+from lightning.fabric.plugins.environments.lightning import find_free_network_port
+
+


TODO Dorota - this test would execute much faster is we import train method and just tests it. Get this unit test from gitlab

dorotat-nv · 2025-02-25T18:59:48Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_inference.py

+        assert results == ["T"]
+
+
+# def test_infer_model_generates_expected_single_token_output_from_input_seq():


should we remove commented lines?

dorotat-nv · 2025-02-25T19:01:08Z

sub-packages/bionemo-evo2/src/bionemo/evo2/utils/checkpoint/zero3_conversion_lib.py

+device = torch.device("cpu")
+
+
+def profile_memory_decorator(func: Iterable):


shouldnt those methods go under testing or other subpackage?

dorotat-nv · 2025-02-25T19:03:49Z

sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py

+    return ap.parse_args()
+
+
+class HyenaPredictor(LightningPassthroughPredictionMixin, HyenaModel):


shouldnt those classes be relocated to other files and be unit tested?

Signed-off-by: John St John <[email protected]>

skothenhill-nv · 2025-02-26T22:40:25Z

sub-packages/bionemo-core/src/bionemo/core/data/resources/evo2.yaml

@@ -0,0 +1,66 @@
+- tag: 1b-8k:1.0


Could you change this to evo2-1b-8k ?

darrenjhsu · 2025-02-27T16:43:41Z

As a user looking at the README in sub-packages/bionemo-evo2/ I'd appreciate a lot if there are actually copy/paste-able command snippets to test something. This plus the one in sub-packages/bionemo-evo2/src/bionemo/evo2/data/README.md both need snippets

### Description Slightly refactoring train script for evo2 to better handle unit testing and a bug fix ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [x] I have tested these changes locally - [ ] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [ ] All existing tests pass successfully --------- Signed-off-by: dorotat <[email protected]>

dorotat-nv · 2025-02-27T15:17:43Z

sub-packages/bionemo-core/tests/bionemo/core/data/test_load.py

@@ -111,6 +111,7 @@ def test_load_with_file(mocked_s3_download, tmp_path):
    )

    mocked_s3_download.side_effect = lambda _1, output_file, _2: Path(output_file).write_text("test")
+    # TODO(dorotat-nv) remove source="pbss" when NGC resources are available


swithc to ngc after artefacts are published

### Description Bugfixing and updating evo2 scripts for automated benchmarking execution ### Type of changes  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [x] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully

cspades and others added 30 commits November 16, 2024 10:46

[cye/evo2-llm-dev] Private internal development branch for Evo2 in Bi…

50db0ca

…oNeMo.

[cye/evo2-llm-dev] Add rough draft of data preprocessing for Evo2.

737f16c

Add manual data test for evo2

a142109

Change remotes for submodules for now

0ad0bee

Cye/nemo2 fixes

82c832f

Write model checkpoint context and set Evo2Dataset in the pre-training.

945506f

Fix inference script to make sense, i.e. no seq parallelism for decod…

4fc1d84

…ing, default 0 for topk/topp.

Cye/fix Hyena species biases

f5adde5

Hyena golden value test

b9dfd5c

[cye/blended-training] Expose blended weights for training Hyena.

e6278d9

Changes for 256 node training run

dd0aab1

Integrate BioNeMo Noodles into Hyena data preprocessing.

0560ee4

[cye/lineage-str] Clean up interface for taxonomic lineage tokens in …

5511fe7

…Hyena.

Changes made on 256 node branch

92d0352

Cye/hyena flops

923cbdf

Fix broken import of blended training config.

6460ea3

Cye/import fix

7e72f48

Add improved nsys profiling support

45923c6

[cye/hyena-doc-update] Add data preprocessing documentation, fix tech…

c805984

… debt in tokenizer and config, remove unused args in infer.py.

[cye/transcript-readme] Add main documentation snippets for Hyena, an…

f5b15f3

…d add transcript splicing script for preprocessing.

Bump nemo version to the new context length insensitive code, and upd…

9ba9e07

…ate test to use new checkpoint

added flag for tflops callback

854951f

[cye/evo2-ckpt-utils] Add Evo2 ZeRO-1/3 to NeMo checkpointing utils.

ada349e

Add test for evo2 tokenizer.

652dfe0

Fix nemo-savanna repo build in CI

265a0be

fixing format issues on evo2-dev

fb09377

Add tests for parallel hyena operators used in evo2

9cacf1b

Rebase on OSS.

9ac11eb

[cye/tp-comm-fix] Fix TP communication overlap inconsistency.

5631b93

Add temporary fix for shard-tensor bug in Megatron-LM

9ae9af0

jstjohn added 7 commits February 24, 2025 21:04

Adding in the predict method and test

78f92b5

Signed-off-by: John St John <[email protected]>

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into evo2

bb5f5a1

Signed-off-by: John St John <[email protected]>

bump NeMo commit

d9e4952

Signed-off-by: John St John <[email protected]>

Fix multipart download naming in nemo

b148750

Signed-off-by: John St John <[email protected]>

Update docs for checkpoint conversion

ba1d9bf

Signed-off-by: John St John <[email protected]>

shrink tests down to 1b case

0af3e0a

add end to end fine-tuning tutorial

c5e42d8

Signed-off-by: John St John <[email protected]>

dorotat-nv reviewed Feb 25, 2025

View reviewed changes

jstjohn added 2 commits February 25, 2025 18:23

ignore object hashes in precommit

544b7a8

Signed-off-by: John St John <[email protected]>

Bump nemo pointer to latest PR pointer

d7a8ea7

Signed-off-by: John St John <[email protected]>

jstjohn force-pushed the evo2 branch from 4f552cb to d7a8ea7 Compare February 25, 2025 18:57

jstjohn and others added 2 commits February 25, 2025 10:59

Update ci/benchmarks/partial-conv/evo2_pretrain.yaml

07c48b8

Co-authored-by: Dorota Toczydlowska <[email protected]> Signed-off-by: John St. John <[email protected]>

Update ci/benchmarks/perf/evo2_pretrain.yaml

e779f60

Co-authored-by: Dorota Toczydlowska <[email protected]> Signed-off-by: John St. John <[email protected]>

jstjohn added the INCLUDE_NOTEBOOKS_TESTS Add Jupyter notebook validation to the CI pipeline label Feb 25, 2025

dorotat-nv reviewed Feb 25, 2025

View reviewed changes

jstjohn added 4 commits February 25, 2025 19:15

Slightly smaller test_train.py

a1c8048

Signed-off-by: John St John <[email protected]>

Add missing main function for inference cli

46edcb6

Add --batch-size option to predict

e81eef3

Signed-off-by: John St John <[email protected]>

Fixing the description of the 1b model

4e5acda

Signed-off-by: John St John <[email protected]>

jstjohn requested a review from polinabinder1 as a code owner February 26, 2025 01:15

remove hard-coded PBSS

5bd0e2c

Signed-off-by: John St John <[email protected]>

jstjohn force-pushed the evo2 branch from eb63558 to 5bd0e2c Compare February 26, 2025 01:15

Remove comment block from code

ca16c2a

Signed-off-by: John St John <[email protected]>

skothenhill-nv reviewed Feb 26, 2025

View reviewed changes

dorotat-nv added the INCLUDE_SLOW_TESTS Add unit tests marked as slow to CI pipeline label Feb 27, 2025

dorotat-nv reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evo2 #694

Evo2 #694

jstjohn commented Feb 19, 2025 •

edited

Loading

dorotat-nv Feb 25, 2025

jstjohn Feb 25, 2025

dorotat-nv Feb 27, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

jstjohn commented Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

dorotat-nv Feb 25, 2025

skothenhill-nv Feb 26, 2025

darrenjhsu commented Feb 27, 2025

dorotat-nv Feb 27, 2025

		["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"],
		["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"],

		@@ -0,0 +1,6 @@
		- dataset_prefix: /workspace/bionemo2/sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_datasets/test_promoters_uint8_distinct_byte-level_train

		@@ -0,0 +1,81 @@
		- dataset_prefix: /workspace/bionemo2/data/metagenomics/pretraining_data_metagenomics/data_metagenomics_train_text_CharLevelTokenizer_document

		return TransformerConfig(num_layers=2, hidden_size=864, num_attention_heads=1)


		class TestParallelHyenaOperator:

		from bionemo.testing.data.fasta import ALU_SEQUENCE, create_fasta_file


		def test_train_evo2_runs(

		import pytest
		from lightning.fabric.plugins.environments.lightning import find_free_network_port

		assert results == ["T"]


		# def test_infer_model_generates_expected_single_token_output_from_input_seq():

		device = torch.device("cpu")


		def profile_memory_decorator(func: Iterable):

		return ap.parse_args()


		class HyenaPredictor(LightningPassthroughPredictionMixin, HyenaModel):

Evo2 #694

Are you sure you want to change the base?

Evo2 #694

Conversation

jstjohn commented Feb 19, 2025 • edited Loading

Description

Known issues

Type of changes

Pre-submit Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jstjohn commented Feb 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darrenjhsu commented Feb 27, 2025

Choose a reason for hiding this comment

jstjohn commented Feb 19, 2025 •

edited

Loading