Using Mlflow logger #11197

JCardoso9 · 2021-12-21T11:17:13Z

JCardoso9
Dec 21, 2021

I have been trying to use Mlflow to track my experiments. I wanted to record my runs under the experiment name "Training"

However I seem unable to have a run record the run name, all of the parameters (from argparser and the PL callbacks), log the loss and the final model and have the symbol saying that it finished correctly or not, all in one line.

If I only use the mlflow logger without autolog, then I am unable to log the model. If I then log the model manually using mlflow.pytorch.log_model(model, "model") then another run will be created just to log this model, while the original one still doesnt have the model.

If I use both mlflow logger and autolog two runs are created and one will log the model and not the parameters and the other the opposite. Only the run logging the model will have the finished symbol.

If I use only autolog then I won't have metrics for training or the full set of parameters as the argparser ones arent saved.

Using with mlflow.start_run() as run: also doesnt seem to help.

Here's the basic idea of my code. This is one combination I tried.

experiment_name="Training"
run_name="baseline_xlmr"
exp = mlflow.set_experiment(experiment_name)

mlf_logger = MLFlowLogger(
    experiment_name=experiment_name,
    run_name=run_name, 
)

model = EfficientEL_c(**vars(args))

trainer = Trainer.from_argparse_args(args,logger=mlf_logger)

mlflow.pytorch.autolog()

# Train the model
# with mlflow.start_run() as run:

trainer.fit(model)

Here is how the two runs issue then appears on mlflow UI

twsl · 2022-01-27T17:13:58Z

twsl
Jan 27, 2022

I'm having the same issue. The problem is that the the autolog function patches the fit function and tries to start a run on its own.
Try moving the autolog behind the manual start run, I think that might help. But the logger create an experiment and run as well.
I'm currently trying to implement the autolog function as PL callback.

https://docs.databricks.com/applications/mlflow/databricks-autologging.html#disable-databricks-autologging can help as well, if you are using databricks

0 replies

anshugarg1 · 2022-02-08T18:31:46Z

anshugarg1
Feb 8, 2022

I faced the same problem. For my scenario the solution was-
I didn't use MLFlowLogger. Just create an experiment and start a new run and use mlflow.log_metric() etc to log the information. You can check the run created at the beginning of the experiment has its context throughout the program (use this code to get active run- run_id = mlflow.active_run(). Check run id whereever you are logging information. It will be same run id as created in the beginning of the experiment).

0 replies

carsonmclean · 2022-10-13T16:42:02Z

carsonmclean
Oct 13, 2022

If I use both mlflow logger and autolog two runs are created and one will log the model and not the parameters and the other the opposite. Only the run logging the model will have the finished symbol.

I was able to share the same MLflow run while using both mlflow.pytorch.autolog() and pytorch_lightning.loggers.MLFlowLogger by passing the run_id from mlflow to the MLFlowLogger.

mlflow.set_tracking_uri(mlflow_uri)
mlflow.set_experiment("Training")
mlflow.pytorch.autolog()
mlflow.start_run(run_name="baseline_xlmr")

mlflow.log_params(pl.utilities.logger._flatten_dict(config))

mlf_logger = MLFlowLogger(
    experiment_name=mlflow.get_experiment(mlflow.active_run().info.experiment_id).name,
    tracking_uri=mlflow.get_tracking_uri(),
    run_id=mlflow.active_run().info.run_id,
)

trainer = pl.Trainer(
    logger=mlf_logger,
    ...
)

...

mlf_logger.experiment.log_artifact(
    run_id=mlf_logger.run_id,
    local_path=checkpoint_callback.best_model_path)

So this yields only one run in MLflow UI with all functionality (autologged model params, manually logged params via mlflow, manually logged artifacts via mlf_logger, auto model logging, ...) from what I have been able to tell so far. Only outstanding "issue" is that the run ends in the UI and gets a checkmark when the PyTorch Lightning MLFlowLogger wraps up in the Trainer, but I am still able to add artifacts and metrics to the same run with basic mlflow after the fact. Not much of a problem in my eyes, but if there's a fix, let me know!

6 replies

cbuob Jan 27, 2023

Anyone get this to work on multiple GPU's?

carsonmclean Jan 27, 2023

I've moved off the AI team a few months ago from when I posted the above, but I believe my team was using multi GPU with the approach I mentioned.

Kalyankr Jan 27, 2023

@cbuob
I solved the multi-logging issue by creating experiments in AML pipelines (PythonScriptStep from azureml.pipeline.steps ) and get the experiment by run_context.

Model.py

model
............
......
............



from azureml.core.run import Run
run = Run.get_context()
mlflow_url = run.experiment.workspace.get_mlflow_tracking_uri()
mlf_logger = MLFlowLogger(experiment_name=run.experiment.name, tracking_uri=mlflow_url)
mlf_logger._run_id = run.id
trainer.logger = mlf_logger

run.complete()

train_step = PythonScriptStep(name = "finetune",
source_directory = experiment_folder,
script_name = "model.py",
arguments = ['--train_file','train.parquet'),
'--validation_file', 'evar.parquet'),
'--gpus', 2,
'--strategy', 'ddp',
'--max_seq_length',75,
'--max_epochs',5,
'--log_every_n_steps',50,
'--precision', 16,
'--deterministic',True,
'--default_root_dir','outputs/',
'--fast_dev_run', False,
],
compute_target = pipeline_cluster,
runconfig = pipeline_run_config,
allow_reuse = True)

cbuob Jan 27, 2023

@carsonmclean

I've moved off the AI team a few months ago from when I posted the above, but I believe my team was using multi GPU with the approach I mentioned.

Sadly this did not work for me. Always creates multiple mlflow runs. Thanks anyway!

cbuob Jan 27, 2023

@Kalyankr Thanks a lot. Will give this a try tomorrow!

josh-melton-db · 2024-01-09T19:17:12Z

josh-melton-db
Jan 9, 2024

For Databricks users try setting the default_root_dir, as in
trainer = pl.Trainer(max_epochs=5, logger=logger, callbacks=[early_stopping], default_root_dir=log_path)
along with the other advice in the thread on disabling autolog and setting the experiment path to a workspace path. It's included in the docs here but not called out explicitly

0 replies

zACIID · 2024-04-19T09:46:38Z

zACIID
Apr 19, 2024

I had a very similar problem and I was able to solve it as follows:

    mlflow.pytorch.autolog()

    with mlflow.start_run(log_system_metrics=True, run_name='training_test') as run:
        mlflow_logger = MLFlowLogger(
            # This should be by default (check MLFlowLogger source code),
            #   but apparently it doesn't correctly read the uri from the env
            tracking_uri=os.getenv('MLFLOW_TRACKING_URI'),
            experiment_name=EXPERIMENT_NAME,
            log_model=True,
            run_id=run.info.run_id
        )

      ...

      trainer.fit(...)

Basically the solution is to initialize MLFlowLogger inside the start_run() context and specify the run_id of the manually started run, because such a logger tries itself to start a run if no run_id argument is provided

Hope it helps someone, cheers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Mlflow logger #11197

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using Mlflow logger #11197

Replies: 5 comments · 6 replies

Replies: 5 comments 6 replies