Subset Selection Integration #542

eshwarprasadS · 2025-02-04T20:15:37Z

Minimal Implementation of Subset Selection

This resolves #541

Important Incoming Changes

Added h5py>=3.12.1 to requirements
Added submodlib source to requirements
Added subset_selection.py
Added encoders dir housing bge and arctic encoders
Added subset_selection_utils.py under src/instructlab/sdg/utils/
Support for snowflake-arctic-embed-l-v2.0 enabled

Signed-off-by: eshwarprasadS <[email protected]>

This reverts commit e4a4730. Signed-off-by: eshwarprasadS <[email protected]>

Signed-off-by: eshwarprasadS <[email protected]>

RobotSail · 2025-02-06T02:46:56Z

requirements.txt

 httpx>=0.25.0,<1.0.0
 instructlab-schema>=0.4.0
 jinja2>=3.0.0
 langchain-text-splitters
 openai>=1.13.3,<2.0.0
 sentencepiece>=0.2.0
+# Note: this dependency has to be built from source
+submodlib @ git+https://github.com/decile-team/submodlib.git


This is what we were doing previously with GPTDolomite and Dolomite Engine. We will likely need to move this to live in its own repo as we did for GPTDolomite so we can do builds.

We have contacted the authors and we expect submodlib to have a public release soon, so, I think we can fall back to moving it into its own repo, if that does not go through?

src/instructlab/sdg/subset_selection.py

Signed-off-by: eshwarprasadS <[email protected]>

abhi1092 · 2025-02-07T01:34:31Z

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

eshwarprasadS · 2025-02-07T20:37:47Z

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

Thanks @abhi1092 for the comments. Yes, I am planning to add a host of unit and functional tests in the coming week for the feature, thanks for the suggestion.

And yes, I am planning to incorporate the arctic-snowflake encoder class in encoders.py and change default encoder to that one. I seem to have gotten hold of the implementation for that from the author, so we should have it in here soon.

RobotSail

I'll need to go through in further detail but here are some comments so far

RobotSail · 2025-02-08T02:45:09Z

src/instructlab/sdg/encoders.py

+        self.model = self.model.to(self.cfg.device)
+
+        if self.cfg.num_gpus > 1:
+            print(f"Using {self.cfg.num_gpus} GPUs")


I think SDG has a logger class, should we be using that here instead of print?

RobotSail · 2025-02-08T03:57:30Z

src/instructlab/sdg/utils/subset_selection_utils.py

+    tensor2: Optional[Tensor] = None,
+    batch_size: int = 10000,
+    metric: str = "cosine",
+    device: str = __DEVICE,


It seems like __DEVICE is only used to pick a default for this function. I would recommend using the Python convention of setting this to None as a default, and then falling back to the initialization as it was via:

if not device: device = "cuda" if torch.cuda.is_available() else "cpu"

Also - I recommend going through torch.device rather than using a str, but if it works then it works.

Oh definitely! good catch, thanks for pointing these out.

eshwarprasadS · 2025-02-08T04:11:04Z

I'll need to go through in further detail but here are some comments so far

Thanks a lot for looking at this! Good parts of this implementation was adapted directly from the upstream repository, so this really helps to have extra eyes on it.

Signed-off-by: eshwarprasadS <[email protected]>

eshwarprasadS added 8 commits February 1, 2025 01:54

adding subset selection driver script and updating requirements

d07f6ac

Signed-off-by: eshwarprasadS <[email protected]>

feat: fix params add defaults and kwargs

c64ad8b

Signed-off-by: eshwarprasadS <[email protected]>

chore: linting

d09a44a

Signed-off-by: eshwarprasadS <[email protected]>

fix: num-gpus, add error for non-gpu sys, docstrings

e8d2991

Signed-off-by: eshwarprasadS <[email protected]>

fix: add submodlib dependency

e4a4730

Signed-off-by: eshwarprasadS <[email protected]>

Revert "fix: add submodlib dependency"

1b5666a

This reverts commit e4a4730. Signed-off-by: eshwarprasadS <[email protected]>

Merge branch 'main' into subset-selection-integration

84e9ffc

Signed-off-by: eshwarprasadS <[email protected]>

fix: add submodlin dependency to requirements

bf307a1

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Feb 4, 2025

eshwarprasadS added 2 commits February 4, 2025 20:20

fix: fix submodlib dependency format

e351bc0

Signed-off-by: eshwarprasadS <[email protected]>

chore: linting

5ac0fed

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Feb 4, 2025

feat: refactor, modularize, lint

de945ad

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot removed the ci-failure label Feb 4, 2025

shivchander self-requested a review February 5, 2025 20:01

RobotSail reviewed Feb 6, 2025

View reviewed changes

shivchander requested changes Feb 6, 2025

View reviewed changes

src/instructlab/sdg/subset_selection.py Outdated Show resolved Hide resolved

ignore system role messages

a5c060c

Signed-off-by: eshwarprasadS <[email protected]>

RobotSail reviewed Feb 8, 2025

View reviewed changes

feat: add arctic encoder, reorg encoders, make arctic default encoder

93d0834

Signed-off-by: eshwarprasadS <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subset Selection Integration #542

Subset Selection Integration #542

eshwarprasadS commented Feb 4, 2025 •

edited

Loading

RobotSail Feb 6, 2025

eshwarprasadS Feb 10, 2025

abhi1092 commented Feb 7, 2025 •

edited

Loading

eshwarprasadS commented Feb 7, 2025

RobotSail left a comment

RobotSail Feb 8, 2025

RobotSail Feb 8, 2025 •

edited

Loading

eshwarprasadS Feb 8, 2025

eshwarprasadS commented Feb 8, 2025

Subset Selection Integration #542

Are you sure you want to change the base?

Subset Selection Integration #542

Conversation

eshwarprasadS commented Feb 4, 2025 • edited Loading

Minimal Implementation of Subset Selection

Important Incoming Changes

RobotSail Feb 6, 2025

Choose a reason for hiding this comment

eshwarprasadS Feb 10, 2025

Choose a reason for hiding this comment

abhi1092 commented Feb 7, 2025 • edited Loading

eshwarprasadS commented Feb 7, 2025

RobotSail left a comment

Choose a reason for hiding this comment

RobotSail Feb 8, 2025

Choose a reason for hiding this comment

RobotSail Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

eshwarprasadS Feb 8, 2025

Choose a reason for hiding this comment

eshwarprasadS commented Feb 8, 2025

eshwarprasadS commented Feb 4, 2025 •

edited

Loading

abhi1092 commented Feb 7, 2025 •

edited

Loading

RobotSail Feb 8, 2025 •

edited

Loading