Numba in Parallel Reduces Precision #366

mj-gomes · 2025-02-25T15:55:44Z

mj-gomes
Feb 25, 2025
Collaborator

Hello everyone.

I came across an interesting problem with numba parallelization and its influence in the precision of the computations, at least for the TOD and Mapmaking operations in hwp_sys.py.

Context: I was running some tests for the hwp_sys.py code and the tests kept failing, even though I was just testing for the ideal case, and I knew (from looking at the maps) that it should work. The output maps should be equal to the input maps. The problem was that a percentage of the (output maps - input maps) pixels was not inside the 1e-9 tolerance that I had put to account for eventual approximations. This percentage became smaller when increasing the nside.

Origin: I found out that this comes from @njit(parallel=True) decorator, specifically the "parallel=True" part. Because addition and multiplication of floating numbers are not associative operations, and apparently numba in the parallel mode changes the order of the operations, there are roundings which accumulate and end up being fairly important. At least this is the best explanation I could find. And indeed, if parallel=False, I have 100% equality up to 1e-9 tolerance.

For nside 256, for example, using 2 detectors in a 1y simulation, I had about 0.03% of the pixels above the 1e-9 tolerance. This is a very small percentage of the pixels, but it should increase with the number of detectors. For now I wrote a test method that ignores this, but I was talking with @hivon about it and it would be good to know if anyone as already ran into this problem and has a good solution. Doing parallel=False is, in my opinion, not an option, because this parallelization increases the speed by a significant amount (I can run some tests to obtain specific values for this speedup if needed).

I don't know much about numba, and from what I have seen, there is no straightforward way to force it to keep the order of operations. Should we restructure the methods to force the order of operations? Or else, should we consider using other options, like parallelizing these methods with mpi instead of numba? I suppose however that numba is much faster that mpi for this numpy operations in loops since I understand it is optimized for that.

Thank you in advance for the discussion,
Miguel

mreineck · 2025-02-25T16:26:11Z

mreineck
Feb 25, 2025
Collaborator

Hi Miguel,

are you observing this 1e-9 discrepancy with double precision data or with single precision?
The latter would be quite expected (anything better than 1e-7 accuracy would be a surprise actually), but with double precision it would be catastrophic. Even assuming non-associative FP math there should not be any relative errors larger than, say, 1e-14. If there are, this needs closer investigation; I'll be happy to help if needed!

Cheers,
Martin

0 replies

mj-gomes · 2025-02-26T10:43:49Z

mj-gomes
Feb 26, 2025
Collaborator Author

Hi @mreineck,

I had the bigger arrays in single precision for memory saving. I put everything in double precision and I still get the same problem. I still get about 0.03% of the pixels with a difference bigger than 1e-9 between output and input maps.

Furthermore, this is the output of np.testing.assert_almost_equal(input_maps, output_maps, decimal=9, verbose=True):

Arrays are not almost equal to 9 decimals

Mismatched elements: 810 / 2359296 (0.0343%)
Max absolute difference: 0.00011122
Max relative difference: 2085.3437791
x: array([[-9.640286380e-04, -9.747654477e-04, -1.090827223e-03, ...,
1.103412259e-03, 1.129195597e-03, 1.130668143e-03],
[ 1.349309625e-06, 2.491627361e-06, 3.743196745e-06, ...,...
y: array([[-9.640286380e-04, -9.747654477e-04, -1.090827223e-03, ...,
1.103412259e-03, 1.129195597e-03, 1.130668143e-03],
[ 1.349309625e-06, 2.491627361e-06, 3.743196745e-06, ...,...

So, in (at least) one of the pixels, this is as bad as 1e-4.

0 replies

ziotom78 · 2025-02-26T11:31:31Z

ziotom78
Feb 26, 2025
Maintainer

Hi @mj-gomes , good catch! Yes, instruction reordering might be the culprit here. May you try to use parallel={'fusion': False} to see if the problem disappears? (See here: https://numba.pydata.org/numba-doc/0.40.0/developer/architecture.html.)

Otherwise, we could investigate the machine code produced by Numba using inspect_asm, but for this, we need the shortest Numba function that exhibits the problem; otherwise, it would be a pain to debug the problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba in Parallel Reduces Precision #366

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Numba in Parallel Reduces Precision #366

mj-gomes Feb 25, 2025 Collaborator

Replies: 3 comments

mreineck Feb 25, 2025 Collaborator

mj-gomes Feb 26, 2025 Collaborator Author

ziotom78 Feb 26, 2025 Maintainer

mj-gomes
Feb 25, 2025
Collaborator

mreineck
Feb 25, 2025
Collaborator

mj-gomes
Feb 26, 2025
Collaborator Author

ziotom78
Feb 26, 2025
Maintainer