Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset Selection Integration #542

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

eshwarprasadS
Copy link
Contributor

@eshwarprasadS eshwarprasadS commented Feb 4, 2025

Minimal Implementation of Subset Selection

This resolves #541

Important Incoming Changes

  • Added h5py>=3.12.1 to requirements
  • Added submodlib source to requirements
  • Added subset_selection.py
  • Added encoders dir housing bge and arctic encoders
  • Added subset_selection_utils.py under src/instructlab/sdg/utils/
  • Support for snowflake-arctic-embed-l-v2.0 enabled

@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Feb 4, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 4, 2025
@mergify mergify bot removed the ci-failure label Feb 4, 2025
@shivchander shivchander self-requested a review February 5, 2025 20:01
httpx>=0.25.0,<1.0.0
instructlab-schema>=0.4.0
jinja2>=3.0.0
langchain-text-splitters
openai>=1.13.3,<2.0.0
sentencepiece>=0.2.0
# Note: this dependency has to be built from source
submodlib @ git+https://github.com/decile-team/submodlib.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we were doing previously with GPTDolomite and Dolomite Engine. We will likely need to move this to live in its own repo as we did for GPTDolomite so we can do builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have contacted the authors and we expect submodlib to have a public release soon, so, I think we can fall back to moving it into its own repo, if that does not go through?

Signed-off-by: eshwarprasadS <[email protected]>
@abhi1092
Copy link
Member

abhi1092 commented Feb 7, 2025

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

@eshwarprasadS
Copy link
Contributor Author

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

Thanks @abhi1092 for the comments. Yes, I am planning to add a host of unit and functional tests in the coming week for the feature, thanks for the suggestion.

And yes, I am planning to incorporate the arctic-snowflake encoder class in encoders.py and change default encoder to that one. I seem to have gotten hold of the implementation for that from the author, so we should have it in here soon.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to go through in further detail but here are some comments so far

self.model = self.model.to(self.cfg.device)

if self.cfg.num_gpus > 1:
print(f"Using {self.cfg.num_gpus} GPUs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SDG has a logger class, should we be using that here instead of print?

tensor2: Optional[Tensor] = None,
batch_size: int = 10000,
metric: str = "cosine",
device: str = __DEVICE,
Copy link
Member

@RobotSail RobotSail Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like __DEVICE is only used to pick a default for this function. I would recommend using the Python convention of setting this to None as a default, and then falling back to the initialization as it was via:

if not device:
  device = "cuda" if torch.cuda.is_available() else "cpu"

Also - I recommend going through torch.device rather than using a str, but if it works then it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh definitely! good catch, thanks for pointing these out.

@eshwarprasadS
Copy link
Contributor Author

I'll need to go through in further detail but here are some comments so far

Thanks a lot for looking at this! Good parts of this implementation was adapted directly from the upstream repository, so this really helps to have extra eyes on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minimal Integration of Subset Selection into SDG
4 participants