You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We decided to use a threshold of the number of task limit for each task mapping like the old Parla runtime.
This is reasonable since the phase was not finished before all tasks are mapped. It means that it does not overlap the scheduling phase and task execution. So I pushed the threshold mechanism to the current main.
However, during independent experiments, I noticed that it actually degraded performance of small granularity tasks.
The case in which difference becomes noticeable was 10MB data move + 0.5ms + 1000 tasks. It was previously 1.79s, 0.96s, 0.71s, and 0.56s for 4, 3, 2, and 1 GPU, respectively. But now these became 2.9s, 1.8s, 1.4s, and 1.1s. I am still waiting for other configurations, but 0.5ms + 500 tasks was not changed noticeably like this. My hypothesis now is that task granularity is too small and so each task is finished immediately after it is launched. At the same time, the scheduler takes more time than the task execution time and might degrade the total execution time.
I am not sure which option would be better but I am trying to collect all results to check if there is any consistent change.
The text was updated successfully, but these errors were encountered:
We decided to use a threshold of the number of task limit for each task mapping like the old Parla runtime.
This is reasonable since the phase was not finished before all tasks are mapped. It means that it does not overlap the scheduling phase and task execution. So I pushed the threshold mechanism to the current main.
However, during independent experiments, I noticed that it actually degraded performance of small granularity tasks.
The case in which difference becomes noticeable was 10MB data move + 0.5ms + 1000 tasks. It was previously 1.79s, 0.96s, 0.71s, and 0.56s for 4, 3, 2, and 1 GPU, respectively. But now these became 2.9s, 1.8s, 1.4s, and 1.1s. I am still waiting for other configurations, but 0.5ms + 500 tasks was not changed noticeably like this. My hypothesis now is that task granularity is too small and so each task is finished immediately after it is launched. At the same time, the scheduler takes more time than the task execution time and might degrade the total execution time.
I am not sure which option would be better but I am trying to collect all results to check if there is any consistent change.
The text was updated successfully, but these errors were encountered: