How to modify baseline architectures #197

jainspoornima · 2022-04-15T14:15:02Z

I just wished to ask how we can modify the baseline architectures to include our modifications before training them. Thanks.

georgeyiasemis · 2022-04-15T14:35:49Z

Hi @jainspoornima. Can you be a bit more elaborate on your question?

jainspoornima · 2022-04-15T14:40:54Z

Hi @georgeyiasemis. I wished to modify the LPDNet architecture by replacing its convolution blocks with a different block for my research experiments, but I could not find the model definition in the repo. Thus I wished to ask if the code for model implementation is open-sourced so that we can experiment with it. Thank you.

georgeyiasemis · 2022-04-15T15:16:17Z

Hi @georgeyiasemis. I wished to modify the LPDNet architecture by replacing its convolution blocks with a different block for my research experiments, but I could not find the model definition in the repo. Thus I wished to ask if the code for model implementation is open-sourced so that we can experiment with it. Thank you.

You can modify anything in the code. For the models specifically, please refer to direct/nn/<model_name>/<model_name>.py. For example, for LPDNet you can modify direct/nn/lpd/lpd.py. Note that for the parameters you add/modify, you would need to do that in the model configuration as well (i.e. LPDNetConfig in direct/nn/lpd/config.py).

If you want to modify any code it might be best to install direct in dev mode using python3 -m pip install -e ".[dev]", instead of python3 setup.py install.

I hope these help.

jainspoornima · 2022-04-16T04:42:06Z

Hi @georgeyiasemis, I executed the following command inside direct/direct folder to train LPDNet on Calgary Campinas Dataset:

!python3 train.py /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

But it just runs for a few seconds and doesn't save any logs in LPD_Net_Real directory, or apparently do any training.

(I executed the command direct train <data_root>/Train/ <data_root>/Val/ <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> [--other-flags] after installing 'direct' through conda in Google Colab, as Docker is not supported by Colab. But it gives the error 'direct is not a recognized bash command'. So I resorted to executing the training file through the above command).

georgeyiasemis · 2022-04-16T06:22:36Z

Hi @georgeyiasemis, I executed the following command inside direct/direct folder to train LPDNet on Calgary Campinas Dataset:
!python3 train.py /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \
But it just runs for a few seconds and doesn't save any logs in LPD_Net_Real directory, or apparently do any training.

(I executed the command direct train <data_root>/Train/ <data_root>/Val/ <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> [--other-flags] after installing 'direct' through conda in Google Colab, as Docker is not supported by Colab. But it gives the error 'direct is not a recognized bash command'. So I resorted to executing the training file through the above command).

Hi @jainspoornima. Direct is supposed to work on gpu nodes and was not designed or tested in colab. Not sure if colab is compatible with torch.distributed module.

I will need more context to be able to help you. Is there some output you can show? Is it possible that colab runs out of memory?

jainspoornima · 2022-04-16T07:29:13Z

Hi, I executed the following commands in Colab -

from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/My Drive/Calgary_PDNet_Experiments

!git clone https://github.com/NKI-AI/direct.git

%cd direct
!python3 -m pip install -e ".[dev]"

Till here they were fine, and then I executed this command (I have saved 12-channel data of Calgary-Campinas dataset in '/content/drive/MyDrive/Calgary_PDNet_Experiments/Data/' folder) -

!direct train /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

which gave the following error -

/bin/bash: direct: command not found

So I executed these commands -

%cd direct
!python3 train.py /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

This command just completes execution in 3-4 seconds with no error message and without the RAM usage exceeding at all.

I also wished to add that I am using Colab-Pro, which offers one 16 GB GPU, so I hoped that the code may run on it. Maybe not, but I haven't got a resource exhausted error yet - just no training, no error or no logs saved in LPD_Net_Real directory.

jainspoornima · 2022-04-16T18:34:53Z

Hi @georgeyiasemis, could you please tell if I can continue training on Colab? I have easy access to Colab, but for a single physical 24/32 GB GPU machine I will need to ask for permissions for access. Thus it may be helpful if you can tell that. Thanks.

georgeyiasemis · 2022-04-17T13:46:40Z

@jainspoornima unfortunately I cannot provide support without any error output. I will let you know if I have any more insight about colab

georgeyiasemis · 2022-04-19T09:53:23Z

Hi @jainspoornima. So the following are directions for setting up DIRECT on colab:

First mount your google drive in colab and create a directory named, e.g. DIRECT, and cd there:

%cd /content/drive/MyDrive/DIRECT/

Clone the repo:

!git clone https://github.com/NKI-AI/direct.git

Copy paste and run the following

!wget -O mini.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_4.8.2-Linux-x86_64.sh
!chmod +x mini.sh
!bash ./mini.sh -b -f -p /usr/local
!conda install -q -y jupyter
!conda install -q -y google-colab -c conda-forge
!python -m ipykernel install --name "py38" --user

This is needed to install python 3.8. (Somehow there are only older versions in colab.)

Run the following:

!pip3 uninstall torch
!pip3 uninstall torchvision
!pip3 uninstall torchaudio
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip3 install omegaconf==2.1.1

Navigate to the repo:

%cd direct/

Install package.
Run experiments

georgeyiasemis · 2022-04-21T13:44:52Z

@jainspoornima if the above worked for you, I will go ahead and close the issue. Let me know

jainspoornima · 2022-04-21T16:11:20Z

Hi @georgeyiasemis, that worked in Colab, thanks a lot. I trained LPDNet on Calgary Campinas multicoil dataset, so it gave an out of memory error for CUDA for batch size of 3, but fit for batch size 1. I am not sure if this is the right place to ask this, but the loss did not seem to decrease monotonically in the training -

georgeyiasemis · 2022-04-21T18:17:56Z

@jainspoornima Glad to hear it worked.
Colab GPUs are generally not really good memory-wise.
As for the loss not dropping monotonically that makes sense because of the batch_size=1. Also, if you have crop_outer_slices enabled for the training datasets it's possible that the loss will decrease over time but it will be a lot noisy. You should check if the validation metrics improve though, that's a better indication that everything work well.

jainspoornima added the enhancement Improvement of existing feature label Apr 15, 2022

georgeyiasemis added help wanted Extra attention is needed and removed enhancement Improvement of existing feature labels Apr 15, 2022

georgeyiasemis closed this as completed Apr 22, 2022

georgeyiasemis reopened this Apr 22, 2022

georgeyiasemis linked a pull request Apr 26, 2022 that will close this issue

Add tutorials #199

Merged

georgeyiasemis closed this as completed in #199 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to modify baseline architectures #197

How to modify baseline architectures #197

jainspoornima commented Apr 15, 2022 •

edited

Loading

georgeyiasemis commented Apr 15, 2022

jainspoornima commented Apr 15, 2022

georgeyiasemis commented Apr 15, 2022 •

edited

Loading

jainspoornima commented Apr 16, 2022

georgeyiasemis commented Apr 16, 2022

jainspoornima commented Apr 16, 2022 •

edited

Loading

jainspoornima commented Apr 16, 2022

georgeyiasemis commented Apr 17, 2022

georgeyiasemis commented Apr 19, 2022

georgeyiasemis commented Apr 21, 2022

jainspoornima commented Apr 21, 2022 •

edited

Loading

georgeyiasemis commented Apr 21, 2022

How to modify baseline architectures #197

How to modify baseline architectures #197

Comments

jainspoornima commented Apr 15, 2022 • edited Loading

georgeyiasemis commented Apr 15, 2022

jainspoornima commented Apr 15, 2022

georgeyiasemis commented Apr 15, 2022 • edited Loading

jainspoornima commented Apr 16, 2022

georgeyiasemis commented Apr 16, 2022

jainspoornima commented Apr 16, 2022 • edited Loading

jainspoornima commented Apr 16, 2022

georgeyiasemis commented Apr 17, 2022

georgeyiasemis commented Apr 19, 2022

georgeyiasemis commented Apr 21, 2022

jainspoornima commented Apr 21, 2022 • edited Loading

georgeyiasemis commented Apr 21, 2022

jainspoornima commented Apr 15, 2022 •

edited

Loading

georgeyiasemis commented Apr 15, 2022 •

edited

Loading

jainspoornima commented Apr 16, 2022 •

edited

Loading

jainspoornima commented Apr 21, 2022 •

edited

Loading