Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to modify baseline architectures #197

Closed
jainspoornima opened this issue Apr 15, 2022 · 12 comments · Fixed by #199
Closed

How to modify baseline architectures #197

jainspoornima opened this issue Apr 15, 2022 · 12 comments · Fixed by #199
Labels
help wanted Extra attention is needed

Comments

@jainspoornima
Copy link

jainspoornima commented Apr 15, 2022

I just wished to ask how we can modify the baseline architectures to include our modifications before training them. Thanks.

@jainspoornima jainspoornima added the enhancement Improvement of existing feature label Apr 15, 2022
@georgeyiasemis
Copy link
Contributor

Hi @jainspoornima. Can you be a bit more elaborate on your question?

@jainspoornima
Copy link
Author

Hi @georgeyiasemis. I wished to modify the LPDNet architecture by replacing its convolution blocks with a different block for my research experiments, but I could not find the model definition in the repo. Thus I wished to ask if the code for model implementation is open-sourced so that we can experiment with it. Thank you.

@georgeyiasemis
Copy link
Contributor

georgeyiasemis commented Apr 15, 2022

Hi @georgeyiasemis. I wished to modify the LPDNet architecture by replacing its convolution blocks with a different block for my research experiments, but I could not find the model definition in the repo. Thus I wished to ask if the code for model implementation is open-sourced so that we can experiment with it. Thank you.

You can modify anything in the code. For the models specifically, please refer to direct/nn/<model_name>/<model_name>.py. For example, for LPDNet you can modify direct/nn/lpd/lpd.py. Note that for the parameters you add/modify, you would need to do that in the model configuration as well (i.e. LPDNetConfig in direct/nn/lpd/config.py).

If you want to modify any code it might be best to install direct in dev mode using python3 -m pip install -e ".[dev]", instead of python3 setup.py install.

I hope these help.

@georgeyiasemis georgeyiasemis added help wanted Extra attention is needed and removed enhancement Improvement of existing feature labels Apr 15, 2022
@jainspoornima
Copy link
Author

Hi @georgeyiasemis, I executed the following command inside direct/direct folder to train LPDNet on Calgary Campinas Dataset:

!python3 train.py /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

But it just runs for a few seconds and doesn't save any logs in LPD_Net_Real directory, or apparently do any training.

(I executed the command direct train <data_root>/Train/ <data_root>/Val/ <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> [--other-flags] after installing 'direct' through conda in Google Colab, as Docker is not supported by Colab. But it gives the error 'direct is not a recognized bash command'. So I resorted to executing the training file through the above command).

@georgeyiasemis
Copy link
Contributor

Hi @georgeyiasemis, I executed the following command inside direct/direct folder to train LPDNet on Calgary Campinas Dataset:

!python3 train.py /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

But it just runs for a few seconds and doesn't save any logs in LPD_Net_Real directory, or apparently do any training.

(I executed the command direct train <data_root>/Train/ <data_root>/Val/ <experiment_directory> --num-gpus <number_of_gpus> --cfg <path_or_url_to_yaml_file> [--other-flags] after installing 'direct' through conda in Google Colab, as Docker is not supported by Colab. But it gives the error 'direct is not a recognized bash command'. So I resorted to executing the training file through the above command).

Hi @jainspoornima. Direct is supposed to work on gpu nodes and was not designed or tested in colab. Not sure if colab is compatible with torch.distributed module.

I will need more context to be able to help you. Is there some output you can show? Is it possible that colab runs out of memory?

@jainspoornima
Copy link
Author

jainspoornima commented Apr 16, 2022

Hi, I executed the following commands in Colab -

from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/My Drive/Calgary_PDNet_Experiments

!git clone https://github.com/NKI-AI/direct.git

%cd direct
!python3 -m pip install -e ".[dev]"

Till here they were fine, and then I executed this command (I have saved 12-channel data of Calgary-Campinas dataset in '/content/drive/MyDrive/Calgary_PDNet_Experiments/Data/' folder) -

!direct train /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

which gave the following error -

/bin/bash: direct: command not found

So I executed these commands -

%cd direct
!python3 train.py /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Train/ \
                  /content/drive/MyDrive/Calgary_PDNet_Experiments/Data/Val/ \
                  LPD_Net_Real \
                    --cfg /content/drive/MyDrive/Calgary_PDNet_Experiments/direct/projects/calgary_campinas/configs/base_lpd.yaml \
                    --num-gpus 1 \

This command just completes execution in 3-4 seconds with no error message and without the RAM usage exceeding at all.

I also wished to add that I am using Colab-Pro, which offers one 16 GB GPU, so I hoped that the code may run on it. Maybe not, but I haven't got a resource exhausted error yet - just no training, no error or no logs saved in LPD_Net_Real directory.

@jainspoornima
Copy link
Author

Hi @georgeyiasemis, could you please tell if I can continue training on Colab? I have easy access to Colab, but for a single physical 24/32 GB GPU machine I will need to ask for permissions for access. Thus it may be helpful if you can tell that. Thanks.

@georgeyiasemis
Copy link
Contributor

@jainspoornima unfortunately I cannot provide support without any error output. I will let you know if I have any more insight about colab

@georgeyiasemis
Copy link
Contributor

Hi @jainspoornima. So the following are directions for setting up DIRECT on colab:

  1. First mount your google drive in colab and create a directory named, e.g. DIRECT, and cd there:
%cd /content/drive/MyDrive/DIRECT/
  1. Clone the repo:
!git clone https://github.com/NKI-AI/direct.git
  1. Copy paste and run the following
!wget -O mini.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_4.8.2-Linux-x86_64.sh
!chmod +x mini.sh
!bash ./mini.sh -b -f -p /usr/local
!conda install -q -y jupyter
!conda install -q -y google-colab -c conda-forge
!python -m ipykernel install --name "py38" --user

This is needed to install python 3.8. (Somehow there are only older versions in colab.)

  1. Run the following:
!pip3 uninstall torch
!pip3 uninstall torchvision
!pip3 uninstall torchaudio
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip3 install omegaconf==2.1.1
  1. Navigate to the repo:
%cd direct/
  1. Install package.
  2. Run experiments

@georgeyiasemis
Copy link
Contributor

@jainspoornima if the above worked for you, I will go ahead and close the issue. Let me know

@jainspoornima
Copy link
Author

jainspoornima commented Apr 21, 2022

Hi @georgeyiasemis, that worked in Colab, thanks a lot. I trained LPDNet on Calgary Campinas multicoil dataset, so it gave an out of memory error for CUDA for batch size of 3, but fit for batch size 1. I am not sure if this is the right place to ask this, but the loss did not seem to decrease monotonically in the training -
image

@georgeyiasemis
Copy link
Contributor

@jainspoornima Glad to hear it worked.
Colab GPUs are generally not really good memory-wise.
As for the loss not dropping monotonically that makes sense because of the batch_size=1. Also, if you have crop_outer_slices enabled for the training datasets it's possible that the loss will decrease over time but it will be a lot noisy. You should check if the validation metrics improve though, that's a better indication that everything work well.

@georgeyiasemis georgeyiasemis linked a pull request Apr 26, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants