Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

best practices for structuring nested experiment runs #416

Closed
kirk86 opened this issue Feb 13, 2019 · 11 comments
Closed

best practices for structuring nested experiment runs #416

kirk86 opened this issue Feb 13, 2019 · 11 comments

Comments

@kirk86
Copy link

kirk86 commented Feb 13, 2019

Hi folks,
I was wondering if there's a way to structure experiments for each each individual choice of dataset and algorithm.

For instance, you could have something in your code base like this:

foreach dataset
    foreach algorithm
          run() and record experiments

Eventually since these runs are based on individual combo of dataset and algo you would want to have an experiment for each of them.

How would you go about doing that?

One very bad way I came up with is the following:

experiments = []
foreach dataset
    foreach algorithm
          experiments.append(Experiment('dataset + algorithm combo name'))

foreach experiment in experiments
     experiment.main
     def main():
         main code calling models and running train

Please let me know what's a better way of doing that. Thank you!

@JarnoRFB
Copy link
Collaborator

Well I think your basic idea is right. However, given that you want to run the same experiment just with different configuration (model, data) you can reuse the same experiment object and just update the config. It would look something like this

ex = Experiment("generic_experiment")

@ex.main
def run(dataset, model):
    ...

for dataset in datasets:
    for model in models:
        ex.run(config_updates={"dataset": dataset, "model": model},
               options={'--name': f"{dataset}_{model}")

For a more complete example you might look at Klaus' code https://github.com/Qwlouse/Binding/blob/master/run_evaluation.py

@kirk86
Copy link
Author

kirk86 commented Feb 14, 2019

@JarnoRFB thanks for the reply. In the end that's what I ended up doing but it was a bit more involved since I was using different files src.py & main.py. The src.py contains Ingredient and main.py contains the actual Experiment with the ingredients. One issue that I faced was that I was loading the config from yaml files inside the main.py uisng ex.add_config but then in the main.py I couldn't get those configs to be injected during run time, even though I had the ingredient.capture decorator in some of the methods. This bit through me off. Maybe I was doing it wrong?

The other thing that I would like to ask is how would anyone go about saving validation folds in each experiment. Should that be in different columns using _run.info? Or should each of them be a separate experiment, but then it would unnecessary populate many entries in the db. Is there a proper way to populate those folds each as a separate entry a.k.a row but under it's own experiment.

@JarnoRFB
Copy link
Collaborator

In the end that's what I ended up doing but it was a bit more involved since I was using different files src.py & main.py. The src.py contains Ingredient and main.py contains the actual Experiment with the ingredients. One issue that I faced was that I was loading the config from yaml files inside the main.py uisng ex.add_config but then in the main.py I couldn't get those configs to be injected during run time, even though I had the ingredient.capture decorator in some of the methods. This bit through me off. Maybe I was doing it wrong?

Sorry, but I cannot quite follow. A minimal code example would greatly help here.

The other thing that I would like to ask is how would anyone go about saving validation folds in each experiment. Should that be in different columns using _run.info? Or should each of them be a separate experiment, but then it would unnecessary populate many entries in the db. Is there a proper way to populate those folds each as a separate entry a.k.a row but under it's own experiment.

What exactly do you want to save from the validation fold. If it is just a metric, e.g. accuracy, why not save this into a metric of the run? You can call

_run.log_scalar("validation_fold_acc", acc)

for each validation fold.

It would also be nice if you could ask such general question under the python-sacred on stackoverflow so the answers remain more visible to the general public.

@kirk86
Copy link
Author

kirk86 commented Feb 15, 2019

Sorry, but I cannot quite follow. A minimal code example would greatly help here.

I apologize for the confusion let me provide a MWE as you requested in order to make things clear
src.py

import sacred
ingred = sacred.Ingredient('default-params')
ingred.add_config('some/yaml/file')   <--- added config
ingred.add_config('another/yaml/file') <--- added config

class MyModel(object):
   @ingred.capture
    def __init__(self):
        do some stuff...

main.py

import sacred
from src import MyModel, ingred
ex = sacred.Experiment('test-exper', ingredients=[ingred])

@ex.main   <--- this also works as capture decorator
def main(param1, param2):  <--- if I pass my params they are not recognized, only if I access them through _run
    MyModel()

What exactly do you want to save from the validation fold. If it is just a metric, e.g. accuracy, why not save this into a metric of the run?

Yup that's doable but when you examine through omniboard I think it shows the validation loss of the last training epoch and not the best valid.

In other words since _run.log_scalar("validation_fold_acc", acc) always puts a counter I am not sure whether the values are appended for each count or if they are overwritten.

It would also be nice if you could ask such general question under the python-sacred on stackoverflow so the answers remain more visible to the general public.

Thanks for the pointer, wasn't aware of it, from now on I'll post related stuff there.

@JarnoRFB
Copy link
Collaborator

Yup that's doable but when you examine through omniboard I think it shows the validation loss of the last training epoch and not the best valid.
In other words since _run.log_scalar("validation_fold_acc", acc) always puts a counter I am not sure whether the values are appended for each count or if they are overwritten.

I believe that if you do not set the step explicitly, it will append to the metrics array. If you want to see the current best validation metric, you could set it as a result. While the experiment is running with

_run.result = best_validation_acc

and for the final result by returning the result value from the main function. See also https://sacred.readthedocs.io/en/latest/collected_information.html#live-information. This way it would be displayed in omniboard in the result column.

On the ingredient issue I unfortunately cannot comment without looking a bit deeper into it. I have not really used ingredients myself. But do I see it right that you want to access parameters from the ingredient config in the experiment main function?

@kirk86
Copy link
Author

kirk86 commented Feb 15, 2019

On the ingredient issue I unfortunately cannot comment without looking a bit deeper into it. I have not really used ingredients myself. But do I see it right that you want to access parameters from the ingredient config in the experiment main function?

Exactly, without having to again use ex.add_config or either _run, just by accessing the params in the captured method def main(params). It seems that the params are not injected when using ingredients? Although I might be wrong!

@Qwlouse
Copy link
Collaborator

Qwlouse commented Feb 18, 2019

Hi @kirk86,

ingredients create their own namespace in the configuration, as if the values where part of a dictionary with the name of the ingredient. If you slightly modify your example to use a python-compatible name for the ingredient you can access it from there:

import sacred
ingred = sacred.Ingredient('default_params')
ingred.add_config('some/yaml/file')   # <--- added config
ingred.add_config('another/yaml/file') # <--- added config

class MyModel(object):
   @ingred.capture
    def __init__(self):
        pass # do some stuff...
import sacred
from src import MyModel, ingred
ex = sacred.Experiment('test_exper', ingredients=[ingred])

@ex.main   # <--- this also works as capture decorator
def main(default_params):
    param1 = default_params['param1']
    param2 = default_params['param2']
    MyModel()

@kirk86
Copy link
Author

kirk86 commented Feb 18, 2019

@Qwlouse
thanks a lot. That's great!
I'll close the issue for now to keep things clean.

@pedropalb
Copy link

pedropalb commented Oct 5, 2021

Well I think your basic idea is right. However, given that you want to run the same experiment just with different configuration (model, data) you can reuse the same experiment object and just update the config. It would look something like this

ex = Experiment("generic_experiment")

@ex.main
def run(dataset, model):
    ...

for dataset in datasets:
    for model in models:
        ex.run(config_updates={"dataset": dataset, "model": model},
               options={'--name': f"{dataset}_{model}")

For a more complete example you might look at Klaus' code https://github.com/Qwlouse/Binding/blob/master/run_evaluation.py

@JarnoRFB, I came up with a similar solution but I’m having problems passing the dataset as an argument for the config_updates. Since I’m using the MongoObserver, the whole dataset is being saved to the MongoDB.

Is there any way to pass data to the ex.run() method without touching the config? As a workaround, I thought to use a global variable to hold the dataset reference but I wonder there is a more elegant solution. Maybe a way to, at least, tell the MongoObserver to ignore some config entries.

@JarnoRFB
Copy link
Collaborator

JarnoRFB commented Oct 5, 2021

@pedropalb Sorry not quite sure what you mean. I guess in the example I meant dataset to represent a reference to dataset, e.g. a string identifying the dataset or a path to the data. Otherwise all datasets need to loaded in memory upfront. As you pointed out, putting an instantiated dataset in the config is not great, but I think this should be the case irrespective of the observer used.

@pedropalb
Copy link

pedropalb commented Oct 5, 2021

@JarnoRFB I see! I misunderstood your dataset variable.

I’ve been using the dataset path as a config entry. But now I need to run multiple times with the same dataset. Passing the dataset path is not an option anymore since I would have to load it from the data path and preprocess it in every single call to the ex.run.

I need a way to pass the same loaded and preprocessed dataset to multiple ex.run calls. The workaround I commented previously is to have a global variable to hold this preloaded dataset:

ex = Experiment("generic_experiment")

dataset = None

@ex.command
def train(model):
    global dataset
    ...

@ex.commnad
def run(dataset_paths, models):

    for dataset_path in dataset_paths:
        global dataset
        dataset = load_dataset(dataset_path)
   
        for model in models:
            ex.run('train', config_updates={"model": model}, options={'--name': f"{dataset}_{model}")

But I’m wondering if there is a better and elegant way to do it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants