[2024-08-20 22:54:11.526][ERROR](ignite.engine.engine.SupervisedTrainer) - Engine run is terminating due to exception: Metric 'train_mean_dice' is malformed; persisted metric data contained 1 fields. Expected 2 or 3 fields. 2024-08-20 22:54:11,526 - ERROR - Exception: Metric 'train_mean_dice' is malformed; persisted metric data contained 1 fields. Expected 2 or 3 fields. Traceback (most recent call last): File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen self._fire_event(Events.STARTED) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 425, in _fire_event func(*first, *(event_args + others), **kwargs) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/monai/handlers/mlflow_handler.py", line 215, in start runs = self.client.search_runs(self.experiment.experiment_id) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/tracking/client.py", line 3000, in search_runs return self._tracking_client.search_runs( File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 941, in search_runs return self.store.search_runs( File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/abstract_store.py", line 519, in search_runs runs, token = self._search_runs( File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 937, in _search_runs runs.extend(self._get_run_from_info(r) for r in run_infos) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 937, in runs.extend(self._get_run_from_info(r) for r in run_infos) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 684, in _get_run_from_info metrics = self._get_all_metrics(run_info) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 775, in _get_all_metrics metrics.append(self._get_metric_from_file(parent_path, metric_file)) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 754, in _get_metric_from_file metric_objs = [ File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 755, in FileStore._get_metric_from_line(metric_name, line) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 782, in _get_metric_from_line raise MlflowException( mlflow.exceptions.MlflowException: Metric 'train_mean_dice' is malformed; persisted metric data contained 1 fields. Expected 2 or 3 fields. W0820 22:54:13.636000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543169 via signal SIGTERM W0820 22:54:13.636000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543170 via signal SIGTERM W0820 22:54:13.637000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543171 via signal SIGTERM W0820 22:54:13.638000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543172 via signal SIGTERM W0820 22:54:13.639000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543174 via signal SIGTERM W0820 22:54:13.641000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543177 via signal SIGTERM W0820 22:54:13.645000 140489045842816 torch/multiprocessing/spawn.py:145] Terminating process 1543181 via signal SIGTERM Traceback (most recent call last): File "/root/.pyenv/versions/3.10.14/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/.pyenv/versions/3.10.14/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/dgxadmin/MONAI/MONAILabel/monailabel/interfaces/utils/app.py", line 128, in run_main() File "/home/dgxadmin/MONAI/MONAILabel/monailabel/interfaces/utils/app.py", line 113, in run_main result = a.train(request) File "/home/dgxadmin/MONAI/MONAILabel/monailabel/interfaces/app.py", line 423, in train result = task(request, self.datastore()) File "/home/dgxadmin/MONAI/MONAILabel/monailabel/tasks/train/basic_train.py", line 466, in __call__ torch.multiprocessing.spawn(main_worker, nprocs=world_size, args=(world_size, req, datalist, self)) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes while not context.join(): File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap fn(i, *args) File "/home/dgxadmin/MONAI/MONAILabel/monailabel/tasks/train/basic_train.py", line 735, in main_worker task.train(rank, world_size, request, datalist) File "/home/dgxadmin/MONAI/MONAILabel/monailabel/tasks/train/basic_train.py", line 559, in train context.trainer.run() File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/monai/engines/trainer.py", line 56, in run super().run() File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/monai/engines/workflow.py", line 283, in run super().run(data=self.data_loader, max_epochs=self.state.max_epochs) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 892, in run return self._internal_run() File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 935, in _internal_run return next(self._internal_run_generator) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen self._handle_exception(e) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 636, in _handle_exception self._fire_event(Events.EXCEPTION_RAISED, e) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 425, in _fire_event func(*first, *(event_args + others), **kwargs) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/monai/handlers/stats_handler.py", line 202, in exception_raised raise e File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen self._fire_event(Events.STARTED) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/ignite/engine/engine.py", line 425, in _fire_event func(*first, *(event_args + others), **kwargs) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/monai/handlers/mlflow_handler.py", line 215, in start runs = self.client.search_runs(self.experiment.experiment_id) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/tracking/client.py", line 3000, in search_runs return self._tracking_client.search_runs( File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 941, in search_runs return self.store.search_runs( File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/abstract_store.py", line 519, in search_runs runs, token = self._search_runs( File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 937, in _search_runs runs.extend(self._get_run_from_info(r) for r in run_infos) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 937, in runs.extend(self._get_run_from_info(r) for r in run_infos) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 684, in _get_run_from_info metrics = self._get_all_metrics(run_info) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 775, in _get_all_metrics metrics.append(self._get_metric_from_file(parent_path, metric_file)) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 754, in _get_metric_from_file metric_objs = [ File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 755, in FileStore._get_metric_from_line(metric_name, line) File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 782, in _get_metric_from_line raise MlflowException( mlflow.exceptions.MlflowException: Metric 'train_mean_dice' is malformed; persisted metric data contained 1 fields. Expected 2 or 3 fields. /root/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' [2024-08-20 22:54:16,066] [1538626] [ThreadPoolExecutor-1_0] [INFO] (monailabel.utils.async_tasks.utils:83) - Return code: 1