Add tool to easily assess model compatibility #178

gkumbhat · 2023-09-06T19:43:49Z

Description

As we are exploring support for more models, more tuning techniques, larger size models and multi-gpu vs single gpu with various context sizes, we often need to perform test to figure out if a model is compatible or not and under which configuration (single-gpu vs multi-gpu).

This story is to create a script that automates some of above exploration a bit and provides output that is easier to document (or automatically create markdown file).

Discussion

Provide detailed discussion here

Acceptance Criteria

Script that is able to run test as described above.
Add help in the tool that helps users run the script.
Update docs with information about this script

olson-ibm · 2023-10-02T17:58:47Z

The goal here is to predict whether or not .train() is going to complete successfully, given

Tuning parameters
A model (path to a model version)
A compute configuration (GPU, CPU, RAM, conda config etc)

I see two ways to accomplish this, so if I am off base feel free to advise:

The "Here's what worked in the past" strategy:
We create some kind of publicly accessible data store that contains historic evidence (logs, etc) of successful tuning exercises (all of the above inputs) and what software packaging and compute configuration was used to perform the tuning. Maybe this get reported into a governance module or on a model card long term?
The "Let's give it a quick try" strategy:
We add a command line switch to the fine and peft tuning kickoff scripts --bootstrap_only, that if present, exits after a successful .bootstrap() or errors out with a message why bootstrapping failed, including due to lack of resources (GPU, RAM, etc). This doesn't guarantee .train() will execute, however.

I don't see a way to predict whether .train() will complete successfully without actually calling it and waiting for it to fail.

I do have several bash shell scripts that does handle the overhead of setting up a training session (fetch the model, public or 'bring your own') sets up the output, etc. I do not know how much help that would be, but I could look into parameterize it further.

gkumbhat · 2023-10-02T21:17:12Z

@olson-ibm lets start with 2, i.e Let's give it a quick try approach. I was thinking we can expose this via a --compatibility-test option in currently script and do following:

bootstrap
train

With train, we can set 1 epoch, so that it doesn't keep going and not save the model at the end. Later on I image we can also add a dry-run functionality to .train function itself, which just tries to estimate if it will work or not without actually executing training 🤔

olson-ibm · 2023-10-03T13:27:11Z

which just tries to estimate if it will work or not without actually executing training 🤔

Can't wait to see what your thinking is here :)

PR on the rest of the above will be out shortly...

Signed-off-by: Joe Olson <[email protected]>

chakrn · 2023-10-04T19:07:49Z

Gaurav says to leverage the 'estimate' module for compatibility testing without doing actual training.
@gkumbhat said he will create a new issue for this part.

chakrn · 2023-10-30T21:55:34Z

Moving this back to ToDo for now since Joe is working on a more pressing task in the internal repo

gkumbhat added this to caikit ecosystem Sep 6, 2023

github-project-automation bot moved this to ToDo in caikit ecosystem Sep 6, 2023

gkumbhat assigned olson-ibm Sep 6, 2023

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

324680e

olson-ibm mentioned this issue Oct 3, 2023

WIP #178 #221

Closed

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

2c5fed6

Signed-off-by: Joe Olson <[email protected]>

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

0c205ce

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

a10ee49

Signed-off-by: Joe Olson <[email protected]>

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

6623841

Signed-off-by: Joe Olson <[email protected]>

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

d2be6e7

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

1b3e94e

Signed-off-by: Joe Olson <[email protected]>

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

1b679bc

Signed-off-by: Joe Olson <[email protected]>

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023

WIP caikit#178

767419e

Signed-off-by: Joe Olson <[email protected]>

chakrn moved this from ToDo to In Progress in caikit ecosystem Oct 3, 2023

chakrn moved this from In Progress to ToDo in caikit ecosystem Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tool to easily assess model compatibility #178

Add tool to easily assess model compatibility #178

gkumbhat commented Sep 6, 2023

olson-ibm commented Oct 2, 2023 •

edited

Loading

gkumbhat commented Oct 2, 2023

olson-ibm commented Oct 3, 2023

chakrn commented Oct 4, 2023 •

edited

Loading

chakrn commented Oct 30, 2023

Add tool to easily assess model compatibility #178

Add tool to easily assess model compatibility #178

Comments

gkumbhat commented Sep 6, 2023

Description

Discussion

Acceptance Criteria

olson-ibm commented Oct 2, 2023 • edited Loading

gkumbhat commented Oct 2, 2023

olson-ibm commented Oct 3, 2023

chakrn commented Oct 4, 2023 • edited Loading

chakrn commented Oct 30, 2023

olson-ibm commented Oct 2, 2023 •

edited

Loading

chakrn commented Oct 4, 2023 •

edited

Loading