add utility to perform timings and some performance improvements #1237

KrisThielemans · 2023-08-29T08:48:26Z

Allow standardised timings. Could do with other tests of course. @gschramm @markus-jehl @NicoleJurjew want to have a look?

This is convenient but also means it is the same as HighResWallClockTimer

gschramm · 2023-08-29T08:58:28Z

Thanks Kris. This looks super interesting. Unfortuantely, I am currently busy relocating to Leuven and moving to a new house. @markus-jehl could you have a first look?

KrisThielemans · 2023-08-29T11:18:01Z

WARNING: This branch will be subject to rebases etc and force-pushed occasionally to keep history clean.

TimedObject is not thread-safe, and timing results were incorrect. Currently just remove the calls. work-around UCL#1238

The loop to construct xstart/end etc is now multi-threaded (although a little bit uglier!). Testing shows a speed-up of about 2-3. Using too many threads is counterproductive, so I limited to 8 (not necessarily optimal!).

Timers were stopped too early due to nested calls. This is now checked by asserts (by adding HighResWallClockTimer), allowing me to catch these problems.

KrisThielemans · 2023-08-29T14:29:12Z

Example timings that I'm currently getting on my desktop (AMD Ryzen 9 5900 12-Core Processor, 3001 Mhz, 12 Core(s), 24 Logical Processor(s); 32GB RAM; GEForce RTX 3070; WSL2 with gcc 11.4.0 and nvcc 12.2) for a similar set-up as @gschramm https://arxiv.org/pdf/2212.12519v1.pdf, i.e. DMI 4-ring span=1 but only 8 views, 215x215x71 image

copy_image                                                 1.250                           1.235
PMRT_projector_setup                                     141.125                           6.051
PMRT_forward_file_first                                13050.000                         731.934
PMRT_forward_file                                      12590.000                         535.147
PMRT_forward_memory                                    12740.000                         536.942
PMRT_back_file_first                                   15590.000                         924.100
PMRT_back_file                                         16700.000                         771.430
PMRT_back_memory                                       16790.000                         765.870
PP_projector_setup                                       810.000                         116.063
PP_forward_file_first                                    970.000                        1094.286
PP_forward_file                                          970.000                          68.710
PP_forward_memory                                        958.750                          68.178
PP_back_file_first                                      1200.000                         490.020
PP_back_file                                            1317.500                         256.327
PP_back_memory                                          1330.000                         264.932

with first column CPU and 2nd wall-clock time, both in ms.

For comparison with all 272 views

copy_image                                                 1.250                           1.508
PMRT_projector_setup                                     139.000                           6.327
PMRT_forward_file_first                               423650.000                       42344.702
PMRT_forward_file                                     469500.000                       20058.765
PMRT_forward_memory                                   480970.000                       20464.665
PMRT_back_file_first                                  557800.000                       23641.086
PMRT_back_file                                        559140.000                       23672.374
PMRT_back_memory                                      559940.000                       23645.234
PP_projector_setup                                     20290.000                        9662.081
PP_forward_file_first                                  38420.000                        3550.294
PP_forward_file                                        18885.000                        1765.791
PP_forward_memory                                      12270.000                        1436.956
PP_back_file_first                                     14760.000                        2693.606
PP_back_file                                           14645.000                        2621.722
PP_back_memory                                         13975.000                        2574.534

Currently, #1236 doesn't make a lot of difference (PP_forward_file_first is slower, PP_back_file_first is faster. No idea why).

Template files (had to rename as .txt for GitHub upload)
DMI4.hs.txt
DMI4.hv.txt
DMI4_8v.hs.txt, i.e. 8 views only

Running OSEM is still slow with subsets due to GPU projector set-up. That needs some thought.

KrisThielemans · 2023-08-29T14:33:00Z

One factor slowing down the parallelproj projections is the call to truncate_rim to restrict to cylindrical FOV, see here. @gschramm do we still need that?

In any case, loops in truncate_rim should be rewritten to only loop x over voxels outside the radius, as opposed to all.

gschramm · 2023-08-29T15:11:02Z

Example timings that I'm currently getting on my desktop (AMD Ryzen 9 5900 12-Core Processor, 3001 Mhz, 12 Core(s), 24 Logical Processor(s); 32GB RAM; GEForce RTX 3070; WSL2 with gcc 11.4.0 and nvcc 12.2) for a similar set-up as @gschramm https://arxiv.org/pdf/2212.12519v1.pdf, i.e. DMI 4-ring span=1 but only 8 views, 215x215x71 image

copy_image                                                 1.250                           1.235
PMRT_projector_setup                                     141.125                           6.051
PMRT_forward_file_first                                13050.000                         731.934
PMRT_forward_file                                      12590.000                         535.147
PMRT_forward_memory                                    12740.000                         536.942
PMRT_back_file_first                                   15590.000                         924.100
PMRT_back_file                                         16700.000                         771.430
PMRT_back_memory                                       16790.000                         765.870
PP_projector_setup                                       810.000                         116.063
PP_forward_file_first                                    970.000                        1094.286
PP_forward_file                                          970.000                          68.710
PP_forward_memory                                        958.750                          68.178
PP_back_file_first                                      1200.000                         490.020
PP_back_file                                            1317.500                         256.327
PP_back_memory                                          1330.000                         264.932

with first column CPU and 2nd wall-clock time, both in ms.

For comparison with all 272 views

copy_image                                                 1.250                           1.508
PMRT_projector_setup                                     139.000                           6.327
PMRT_forward_file_first                               423650.000                       42344.702
PMRT_forward_file                                     469500.000                       20058.765
PMRT_forward_memory                                   480970.000                       20464.665
PMRT_back_file_first                                  557800.000                       23641.086
PMRT_back_file                                        559140.000                       23672.374
PMRT_back_memory                                      559940.000                       23645.234
PP_projector_setup                                     20290.000                        9662.081
PP_forward_file_first                                  38420.000                        3550.294
PP_forward_file                                        18885.000                        1765.791
PP_forward_memory                                      12270.000                        1436.956
PP_back_file_first                                     14760.000                        2693.606
PP_back_file                                           14645.000                        2621.722
PP_back_memory                                         13975.000                        2574.534

Currently, #1236 doesn't make a lot of difference (PP_forward_file_first is slower, PP_back_file_first is faster. No idea why).

Template files (had to rename as .txt for GitHub upload) DMI4.hs.txt DMI4.hv.txt DMI4_8v.hs.txt, i.e. 8 views only

Running OSEM is still slow with subsets due to GPU projector set-up. That needs some thought.

Very interesting comparison. Thanks a lot Kris! How do I interpret PP_forward_file_first, PP_forward_file, and PP_forward_memory?

gschramm · 2023-08-29T15:12:57Z

One factor slowing down the parallelproj projections is the call to truncate_rim to restrict to cylindrical FOV, see here. @gschramm do we still need that?

In any case, loops in truncate_rim should be rewritten to only loop x over voxels outside the radius, as opposed to all.

I don't remember 100% why we added that. The projectors themselves shouldn't care about the FOV.

KrisThielemans · 2023-08-29T15:16:14Z

How do I interpret PP_forward_file_first, PP_forward_file, and PP_forward_memory?

sorry. first is do it once straight after construction+set_up as the underlying object will change. Then it repeats it for a number of runs, and the average timing of those is then reported. file means it will write the result to file, memory means it won't. Looks like I have a fast SSD... (The difference between first and subsequent surprised me. I didn't check why, or if it's a bug in my timings!).

(note that it's the set_up that computes the end-points. They then get stored in an std::vector)

KrisThielemans · 2023-08-29T15:18:18Z

One good think to add would be an OSEM update to the timings. This should be done, but it might be different from what @gschramm reports, as we use the "additive term" normally (I guess I could run without).

gschramm · 2023-08-29T15:56:02Z

Hi Kris,
I just checked with a minimal forward and back projection of a single LOR and parallelproj v1.5.
I don't see a reason why the limitation to the cylindrical FOV is needed.
The fwd and back projection between the LOR start and end point works as expected, even if the
start / end points are within the image (I remember someone reported an issue related to that).

Georg

KrisThielemans · 2023-08-29T17:44:37Z

Running without truncate_rim actually gives very little difference. I also saw a faster set_up for the "full" case, which I guess means the computer was busy doing some other stuff in the previous run.

PP_projector_setup    24660.000                        4562.182

This is of course always going to be tricky. (Note sure if people ever report a "minimum wall clock" time to avoid this).

markus-jehl · 2023-08-30T13:13:28Z

Here are the timings on my machine (Intel Xeon CPU E5-2699 [email protected]; 18 cores; 256GB RAM; NVIDIA Quadro M4000; WSL2 with clang 14.0.0-1ubuntu1 and nvcc V12.0.140) for different templates. Unfortunately I still haven't found a solution for the extremely slow caching of the system matrix that happens in the first projection (most likely caused by WSL2/Docker memory allocation), and don't have a GPU on the native Ubuntu system to compare timings there. Interestingly, though, it doesn't appear to be as bad for the DMI geometry!

DMI4_8v:

PMRT_projector_setup                                    1235.000                          38.053
PMRT_forward_file_first                                45360.000                        2621.903
PMRT_forward_file                                      54150.000                        1663.641
PMRT_forward_memory                                    54820.000                        1568.988
PMRT_back_file_first                                   82790.000                        6372.559
PMRT_back_file                                         91310.000                        2984.327
PMRT_back_memory                                       93980.000                        3041.237
PP_projector_setup                                      2730.000                         329.540
PP_forward_file_first                                  30450.000                        1422.549
PP_forward_file                                        14214.000                         841.846
PP_forward_memory                                      11193.000                         781.549
PP_back_file_first                                     24780.000                        6539.762
PP_back_file                                           25674.000                        3342.135
PP_back_memory                                         25368.000                        3125.126

DMI4:

PMRT_projector_setup                                    1816.700                          54.978
PMRT_forward_file_first                              1306680.000                       82427.485
PMRT_forward_file                                    2175610.000                       63507.069
PMRT_forward_memory                                  1736600.000                       48904.101
PMRT_back_file_first                                 2843670.000                       83610.838
PMRT_back_file                                       2849450.000                       80436.218
PMRT_back_memory                                     2861700.000                       80760.121
PP_projector_setup                                     50840.000                        7097.105
PP_forward_file_first                                 120530.000                       15386.625
PP_forward_file                                       131839.000                       15608.557
PP_forward_memory                                      59788.000                       13018.664
PP_back_file_first                                     55170.000                       29732.630
PP_back_file                                          141154.000                       28716.604
PP_back_memory                                         96673.000                       27483.875

NeuroLF:

PMRT_projector_setup                                    1412.600                          44.311
PMRT_forward_file_first                              2860280.000                     1407126.362
PMRT_forward_file                                     386400.000                       11259.394
PMRT_forward_memory                                   399230.000                       11213.618
PMRT_back_file_first                                  632680.000                       19305.444
PMRT_back_file                                        604560.000                       17057.971
PMRT_back_memory                                      606780.000                       17079.259
PP_projector_setup                                    218550.000                       27038.826
PP_forward_file_first                                 114650.000                        6001.208
PP_forward_file                                       108945.000                        5852.599
PP_forward_memory                                      42513.000                        3514.213
PP_back_file_first                                     26340.000                        6817.084
PP_back_file                                           56988.000                        5708.406
PP_back_memory                                         58234.000                        5847.578

KrisThielemans · 2023-08-30T17:01:36Z

thanks @markus-jehl. Seems that my system is about twice as far as yours, also for parallelproj (could be that its performance is dominated by the CPU as well). Quite weird about your NeuroLF PMRT "first run" timings. Maybe you could compare memory usage.

Aside from timing other things, I think we'll need some client code to be able to make some nice plots for different systems etc, as this will soon get unmanageable.

also added extra options for friendlier usage

KrisThielemans · 2023-09-09T13:02:51Z

This seems clean enough to merge now. We can always add some more later.

I've added a log-likelihood run (set-up: currently essentially computation of sensitivity; "grad_no_sens" essentially the MLEM computation back(data/forw(image))) and a few options.

KrisThielemans added 3 commits August 29, 2023 09:28

add optional do_reset argument to Timer

aee2bd1

This is convenient but also means it is the same as HighResWallClockTimer

add HighResWallClockTimer to TimedObject

9098576

add basic utility to perform timings

0afdea5

KrisThielemans added enhancement in-progress labels Aug 29, 2023

KrisThielemans mentioned this pull request Aug 29, 2023

TimedObject is not thread-safe #1238

Open

KrisThielemans added 4 commits August 29, 2023 13:38

remove assert on timer already running

e0139cd

remove start/stop timers in functions that are called multi-threaded

9c5dabc

TimedObject is not thread-safe, and timing results were incorrect. Currently just remove the calls. work-around UCL#1238

multi-thread ParallelprojHelper

8d69631

The loop to construct xstart/end etc is now multi-threaded (although a little bit uglier!). Testing shows a speed-up of about 2-3. Using too many threads is counterproductive, so I limited to 8 (not necessarily optimal!).

avoid some nested calls of start/stop_timers

7e6c9b4

Timers were stopped too early due to nested calls. This is now checked by asserts (by adding HighResWallClockTimer), allowing me to catch these problems.

KrisThielemans changed the title ~~add utility to perform timings~~ add utility to perform timings and some performance improvements Aug 30, 2023

KrisThielemans mentioned this pull request Aug 30, 2023

update Array hierarchy and allocate nD arrays in a contiguous block by default #1236

Merged

5 tasks

KrisThielemans added 2 commits September 9, 2023 11:49

added timings for ProjData and Poisson log-likelihoood

1cd9609

clean-up and doc of stir_timings

7cbd82d

also added extra options for friendlier usage

KrisThielemans removed the in-progress label Sep 9, 2023

KrisThielemans merged commit 3159ebc into UCL:master Sep 9, 2023

KrisThielemans deleted the timings branch September 9, 2023 23:35

KrisThielemans added this to the v5.2 milestone Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add utility to perform timings and some performance improvements #1237

add utility to perform timings and some performance improvements #1237

KrisThielemans commented Aug 29, 2023

gschramm commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023 •

edited

Loading

gschramm commented Aug 29, 2023

gschramm commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

gschramm commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

markus-jehl commented Aug 30, 2023

KrisThielemans commented Aug 30, 2023

KrisThielemans commented Sep 9, 2023

add utility to perform timings and some performance improvements #1237

add utility to perform timings and some performance improvements #1237

Conversation

KrisThielemans commented Aug 29, 2023

gschramm commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023 • edited Loading

gschramm commented Aug 29, 2023

gschramm commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

gschramm commented Aug 29, 2023

KrisThielemans commented Aug 29, 2023

markus-jehl commented Aug 30, 2023

KrisThielemans commented Aug 30, 2023

KrisThielemans commented Sep 9, 2023

KrisThielemans commented Aug 29, 2023 •

edited

Loading