-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using taichi kernel within pytorch dataloader multiprocessing #6725
Comments
@binarydaddy would you mind sharing a minimal repro for this? Thanks! |
Same issue. I wrote a function with taichi for data preprocessing in a script, and call this function in my custom dataset class derived from torch.utils.data.Dataset. When the main training process was sent to a node of a slurm cluster with num_workers > 0 in my dataloader, the training process is hang forever. But with num_worker = 0 everything is fine. I also tried submitting a single taichi program as unit test to the node, it worked normally. So I think it is not the problem of taichi in the cluster, but taichi in the pytorch dataloader. Basically I first have my preprocess.py: # preprocess.py
import taichi as ti
ti.init(arch=ti.cpu)
def cross_difference(img):
H, W, C = img.shape
img_out = np.copy(img)
_cross_diff(img_out)
return img_out
@ti.kernel
def _cross_diff(img: ti.types.ndarray()):
H, W, C = img.shape
# some processing code ... Then I have my dataset defined by: # dataset.py
import numpy as np
from torch.utils.data import Dataset
from preprocess import cross_difference
class MyDataset(Dataset):
def __getitem__(self, idx):
img = ... # load an image
img = cross_difference(img)
# some other processing ... In my main script I used it like: # main.py
from torch import utils
my_dataset = MyDataset()
train_loader = utils.data.DataLoader(my_dataset, batch_size=4, shuffle=False, num_workers=4)
# then use train_loader for data loading From the printed log I got: Does anyone know the possible causes and solutions? Thanks! |
Same issue here - is there any solution or workaround? |
Same issue here |
Using the "spawn" mode might be required... And ti.init needs to be called in the workers as well |
Hi, I am having trouble using taichi kernel within pytorch's dataloader.
Currently I have a setup where I have a separate class dedicated for image augmentation using taichi kernel (taichi.init() is called here on initialization of this class), and pytorch's Dataset class holds this class and calls for augmentation on every getitem call.
Current issue I have been experiencing is that when num_worker is 0, everything works fine. However, when I use num_workers > 0, then the program hangs forever.
I believe this has to do something with calling of 'taichi.init()' within python's multiprocesses, but I am not entirely sure how to solve this issue.
Any help would be much appreciated. Thank you.
The text was updated successfully, but these errors were encountered: