Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evo2 merge 20250214 #12263

Open
wants to merge 53 commits into
base: main
Choose a base branch
from
Open

Evo2 merge 20250214 #12263

wants to merge 53 commits into from

Conversation

JRD971000
Copy link
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

JRD971000 and others added 14 commits February 14, 2025 21:13
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: Cory Ye <[email protected]>
Co-authored-by: Dorota Toczydlowska <[email protected]>
Co-authored-by: Guy Jacob <[email protected]>
Co-authored-by: Jared Wilber <[email protected]>
Co-authored-by: John St. John <[email protected]>

Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Alit/evo2 merge 20250214

See merge request ataghibakhsh/nemo-savanna!41
Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: John St John <[email protected]>
Signed-off-by: John St John <[email protected]>
TENorm,
TERowParallelLinear,
)
except ImportError:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
from nemo.collections.llm.gpt.model.megatron.hyena.hyena_utils import make_upper_case


class Evo2Dataset(GPTDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this dataset be here on in bionemo? cc @jstjohn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is very bio tech specific

from pydantic import BaseModel, model_validator


def infer_global_batch_size(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this method is in this file? we used to have it in bionemo, do you need it in NeMo? if yes, then it shouldn't be under megatron/hyena

return global_batch_size


class Evo2BlendedDatasetConfig(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, shouldn't Evo2BlendedDatasetConfig in fact be BlendedDatasetConfig and be located somewhere here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that in NeMo you pass mostly paths from command line

data = llm.PreTrainingDataModule(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dorotat-nv dorotat-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the condition that there will be a follow up PR with a cleanup

@JRD971000
Copy link
Collaborator Author

Approving on the condition that there will be a follow up PR with a cleanup

Thanks @dorotat-nv , we'll have a separate following PR with the cleaups you suggested.

@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Feb 28, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Feb 28, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Feb 28, 2025
@jstjohn jstjohn enabled auto-merge (squash) March 1, 2025 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants