Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tool to easily assess model compatibility #178

Open
3 tasks
gkumbhat opened this issue Sep 6, 2023 · 5 comments
Open
3 tasks

Add tool to easily assess model compatibility #178

gkumbhat opened this issue Sep 6, 2023 · 5 comments
Assignees

Comments

@gkumbhat
Copy link
Collaborator

gkumbhat commented Sep 6, 2023

Description

As we are exploring support for more models, more tuning techniques, larger size models and multi-gpu vs single gpu with various context sizes, we often need to perform test to figure out if a model is compatible or not and under which configuration (single-gpu vs multi-gpu).

This story is to create a script that automates some of above exploration a bit and provides output that is easier to document (or automatically create markdown file).

Discussion

Provide detailed discussion here

Acceptance Criteria

  • Script that is able to run test as described above.
  • Add help in the tool that helps users run the script.
  • Update docs with information about this script
@olson-ibm
Copy link
Contributor

olson-ibm commented Oct 2, 2023

The goal here is to predict whether or not .train() is going to complete successfully, given

  1. Tuning parameters
  2. A model (path to a model version)
  3. A compute configuration (GPU, CPU, RAM, conda config etc)

I see two ways to accomplish this, so if I am off base feel free to advise:

  1. The "Here's what worked in the past" strategy:
    We create some kind of publicly accessible data store that contains historic evidence (logs, etc) of successful tuning exercises (all of the above inputs) and what software packaging and compute configuration was used to perform the tuning. Maybe this get reported into a governance module or on a model card long term?

  2. The "Let's give it a quick try" strategy:
    We add a command line switch to the fine and peft tuning kickoff scripts --bootstrap_only, that if present, exits after a successful .bootstrap() or errors out with a message why bootstrapping failed, including due to lack of resources (GPU, RAM, etc). This doesn't guarantee .train() will execute, however.

I don't see a way to predict whether .train() will complete successfully without actually calling it and waiting for it to fail.

I do have several bash shell scripts that does handle the overhead of setting up a training session (fetch the model, public or 'bring your own') sets up the output, etc. I do not know how much help that would be, but I could look into parameterize it further.

@gkumbhat
Copy link
Collaborator Author

gkumbhat commented Oct 2, 2023

@olson-ibm lets start with 2, i.e Let's give it a quick try approach. I was thinking we can expose this via a --compatibility-test option in currently script and do following:

  1. bootstrap
  2. train

With train, we can set 1 epoch, so that it doesn't keep going and not save the model at the end. Later on I image we can also add a dry-run functionality to .train function itself, which just tries to estimate if it will work or not without actually executing training 🤔

@olson-ibm
Copy link
Contributor

which just tries to estimate if it will work or not without actually executing training 🤔

Can't wait to see what your thinking is here :)

PR on the rest of the above will be out shortly...

olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
@olson-ibm olson-ibm mentioned this issue Oct 3, 2023
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
Signed-off-by: Joe Olson <[email protected]>
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
Signed-off-by: Joe Olson <[email protected]>
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
Signed-off-by: Joe Olson <[email protected]>
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
Signed-off-by: Joe Olson <[email protected]>
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
Signed-off-by: Joe Olson <[email protected]>
olson-ibm pushed a commit to olson-ibm/caikit-nlp that referenced this issue Oct 3, 2023
Signed-off-by: Joe Olson <[email protected]>
@chakrn chakrn moved this from ToDo to In Progress in caikit ecosystem Oct 3, 2023
@chakrn
Copy link
Collaborator

chakrn commented Oct 4, 2023

Gaurav says to leverage the 'estimate' module for compatibility testing without doing actual training.
@gkumbhat said he will create a new issue for this part.

@chakrn
Copy link
Collaborator

chakrn commented Oct 30, 2023

Moving this back to ToDo for now since Joe is working on a more pressing task in the internal repo

@chakrn chakrn moved this from In Progress to ToDo in caikit ecosystem Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ToDo
Development

No branches or pull requests

3 participants