Using taichi kernel within pytorch dataloader multiprocessing #6725

binarydaddy · 2022-11-24T08:33:35Z

Hi, I am having trouble using taichi kernel within pytorch's dataloader.
Currently I have a setup where I have a separate class dedicated for image augmentation using taichi kernel (taichi.init() is called here on initialization of this class), and pytorch's Dataset class holds this class and calls for augmentation on every getitem call.

Current issue I have been experiencing is that when num_worker is 0, everything works fine. However, when I use num_workers > 0, then the program hangs forever.

I believe this has to do something with calling of 'taichi.init()' within python's multiprocesses, but I am not entirely sure how to solve this issue.

Any help would be much appreciated. Thank you.

ailzhang · 2022-11-24T11:47:25Z

@binarydaddy would you mind sharing a minimal repro for this? Thanks!

li-yanhao · 2023-01-12T11:28:35Z

Same issue. I wrote a function with taichi for data preprocessing in a script, and call this function in my custom dataset class derived from torch.utils.data.Dataset. When the main training process was sent to a node of a slurm cluster with num_workers > 0 in my dataloader, the training process is hang forever. But with num_worker = 0 everything is fine.

I also tried submitting a single taichi program as unit test to the node, it worked normally. So I think it is not the problem of taichi in the cluster, but taichi in the pytorch dataloader.

Basically I first have my preprocess.py:

# preprocess.py

import taichi as ti

ti.init(arch=ti.cpu)

def cross_difference(img):
    H, W, C = img.shape

    img_out = np.copy(img)
    _cross_diff(img_out)
    return img_out


@ti.kernel
def _cross_diff(img: ti.types.ndarray()):
    H, W, C = img.shape
    # some processing code ...

Then I have my dataset defined by:

# dataset.py

import numpy as np
from torch.utils.data import Dataset
from preprocess import cross_difference
 
class MyDataset(Dataset):
    def __getitem__(self, idx):
        img = ... # load an image
        img = cross_difference(img)
        # some other processing ...

In my main script I used it like:

# main.py

from torch import utils

my_dataset = MyDataset()
train_loader = utils.data.DataLoader(my_dataset, batch_size=4, shuffle=False, num_workers=4)
# then use train_loader for data loading

From the printed log I got:
[Taichi] version 1.3.0, llvm 16.0.0git, commit 0f25b95, linux, python 3.8.13
[Taichi] Starting on arch=x64
then the training process is hang. But with num_workers=0 everything is normal.

Does anyone know the possible causes and solutions? Thanks!

pableeto · 2023-08-03T08:38:43Z

Same issue here - is there any solution or workaround?

eleboss · 2024-01-14T06:30:29Z

Same issue here

bobcao3 · 2024-01-14T10:08:15Z

Using the "spawn" mode might be required... And ti.init needs to be called in the workers as well

taichi-gardener added this to Taichi Lang Nov 24, 2022

taichi-gardener moved this to Untriaged in Taichi Lang Nov 24, 2022

ailzhang moved this from Untriaged to Todo in Taichi Lang Nov 25, 2022

ailzhang self-assigned this Nov 25, 2022

ailzhang moved this from Todo to Backlog in Taichi Lang Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using taichi kernel within pytorch dataloader multiprocessing #6725

Using taichi kernel within pytorch dataloader multiprocessing #6725

binarydaddy commented Nov 24, 2022

ailzhang commented Nov 24, 2022

li-yanhao commented Jan 12, 2023 •

edited

Loading

pableeto commented Aug 3, 2023

eleboss commented Jan 14, 2024

bobcao3 commented Jan 14, 2024

Using taichi kernel within pytorch dataloader multiprocessing #6725

Using taichi kernel within pytorch dataloader multiprocessing #6725

Comments

binarydaddy commented Nov 24, 2022

ailzhang commented Nov 24, 2022

li-yanhao commented Jan 12, 2023 • edited Loading

pableeto commented Aug 3, 2023

eleboss commented Jan 14, 2024

bobcao3 commented Jan 14, 2024

li-yanhao commented Jan 12, 2023 •

edited

Loading