Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BaseDataMix class #65

Merged
merged 6 commits into from
Oct 10, 2024
Merged

Add BaseDataMix class #65

merged 6 commits into from
Oct 10, 2024

Conversation

AkshitaB
Copy link
Contributor

@AkshitaB AkshitaB commented Oct 9, 2024

This allows extending the class to specify new mixes in projects using olmo_core.

@AkshitaB AkshitaB requested a review from epwalsh October 9, 2024 23:31
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple minor comments

Base class for enumeration of data mixes.
"""

def build(self, base_dir: str, tokenizer: TokenizerName) -> Tuple[List[str], List[str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the docstring for this method below to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -7,10 +7,19 @@

from ..tokenizer import TokenizerName

__all__ = ["DataMix"]
__all__ = ["BaseDataMix", "DataMix"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also export BaseDataMix in src/olmo_core/data/__init__.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -56,6 +56,7 @@
"NumpyDatasetDType",
"TokenizerConfig",
"TokenizerName",
"BaseDataMix",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing the actual import above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fixed



class DataMix(StrEnum):
class BaseDataMix(StrEnum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, minor nit, but I've gone with the convention of calling base classes XXXBase instead of BaseXXX. Can you change this to DataMixBase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AkshitaB AkshitaB merged commit 6e4ee4e into main Oct 10, 2024
15 checks passed
@AkshitaB AkshitaB deleted the akshitab/extend-data-mix branch October 10, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants