-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset Selection Integration #542
base: main
Are you sure you want to change the base?
Subset Selection Integration #542
Conversation
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
This reverts commit e4a4730. Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
httpx>=0.25.0,<1.0.0 | ||
instructlab-schema>=0.4.0 | ||
jinja2>=3.0.0 | ||
langchain-text-splitters | ||
openai>=1.13.3,<2.0.0 | ||
sentencepiece>=0.2.0 | ||
# Note: this dependency has to be built from source | ||
submodlib @ git+https://github.com/decile-team/submodlib.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we were doing previously with GPTDolomite and Dolomite Engine. We will likely need to move this to live in its own repo as we did for GPTDolomite so we can do builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have contacted the authors and we expect submodlib
to have a public release soon, so, I think we can fall back to moving it into its own repo, if that does not go through?
Signed-off-by: eshwarprasadS <[email protected]>
@eshwarprasadS can we also add a functional test? Something like this. #!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets
if __name__ == "__main__":
dataset_files = ["<dataset_path>"]
subset_config = {
"instruction": "conversation",
"query_description": "conversation",
"templates": {
"message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
"conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
},
"batch_size": 100000,
"num_folds": 25
,
"subset_sizes": [0.97],
"seed": 42,
"template_name": "text",
"combine_files": False,
"encoder_type": "bge",
"encoder_model": "BAAI/bge-base-en"
}
subset_datasets(input_files=dataset_files, **subset_config) Also, are you planning on adding the artic snowflake model? |
Thanks @abhi1092 for the comments. Yes, I am planning to add a host of unit and functional tests in the coming week for the feature, thanks for the suggestion. And yes, I am planning to incorporate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to go through in further detail but here are some comments so far
src/instructlab/sdg/encoders.py
Outdated
self.model = self.model.to(self.cfg.device) | ||
|
||
if self.cfg.num_gpus > 1: | ||
print(f"Using {self.cfg.num_gpus} GPUs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SDG has a logger class, should we be using that here instead of print
?
tensor2: Optional[Tensor] = None, | ||
batch_size: int = 10000, | ||
metric: str = "cosine", | ||
device: str = __DEVICE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like __DEVICE
is only used to pick a default for this function. I would recommend using the Python convention of setting this to None
as a default, and then falling back to the initialization as it was via:
if not device:
device = "cuda" if torch.cuda.is_available() else "cpu"
Also - I recommend going through torch.device
rather than using a str
, but if it works then it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh definitely! good catch, thanks for pointing these out.
Thanks a lot for looking at this! Good parts of this implementation was adapted directly from the upstream repository, so this really helps to have extra eyes on it. |
Signed-off-by: eshwarprasadS <[email protected]>
Minimal Implementation of Subset Selection
This resolves #541
Important Incoming Changes
h5py>=3.12.1
to requirementssubset_selection.py
encoders
dir housing bge and arctic encoderssubset_selection_utils.py
undersrc/instructlab/sdg/utils/