Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure there's a 16-byte aligned 128-bit int in the heap #50

Closed
wants to merge 1 commit into from

Conversation

dra27
Copy link
Contributor

@dra27 dra27 commented Oct 27, 2020

This sprang from the unexpected discovery in a library that Stdint.Uint32.max_int was -1! This PR consists of 3 sort of related pieces of work:

  • Long_val does not return long - it returns intnat. OCaml uses long to mean 64-bits (defined in its headers) but C does not. Uint32.max_int now returns 0xFFFFFFFF (and other fixes)
  • Windows with GCC 9.2 reveals alignment problems with 128-bit numbers which manifests itself by segfaulting on load as the top-level values in Stdint.Uint128 and Stdint.Int128 initialise (see Segfaults with 128-bit integers on Windows #45). I believe the problem is architecture-specific, just being triggered on Windows. It's impossible to store 128 bit values in the OCaml heap and expect them to be 16-byte aligned. Having tried various experiments and concluded that the fallbacks functions (for when compiler support isn't available) are simply too incomplete/buggy, I propose here a halfway house. Values are manipulated as __int128_t and __uint128_t, but are stored as a struct with two 64-bit members. There's obviously a performance impact - some functions may well be faster manipulating the struct directly, others (division, for example) will still be faster this way. This version at least works!
  • The custom_operations weren't setting up the compare_ext field and also weren't taking advantage of 4.08.0's fixed_length field. I'm not sure if changing fixed_length now means that the library cannot deserialize older values (@stedolan?)

Deals with 128-bit alignment. The value may move both if it is promoted to
the major heap and subsequently by the compactor and 16-byte alignment is
not guaranteed.

This approach allocated a 5 word custom value and stores the 128 bit number
in words 1+2 and 4+5. It will always be the case that one is aligned and one
is not.
@rgrinberg
Copy link
Collaborator

Looks good to me. Let's get a second opinion from @rixed

@rixed
Copy link
Contributor

rixed commented Oct 28, 2020

I do not understand what is meant by "Stdint.Uint32.max_int is -1". On a ARCH_64 I have Uint32.(max_int |> to_int) = 4294967295. On a 32bits machine I would indeed expect this to be -1.

Regarding the patch, and without spending too much thought on it, since the problem lies in the allocator and since there are only 2 calls to caml_alloc_custom for int128s, I would favor a fix in the allocator, for instance asking for 24 bytes instead of 16 and returning the address that's 16 bytes aligned. I think it would also lead to faster code, as allocation happens probably less often than the getter, and address arithmetic is faster than 128bits arithmetic.

I'd be happy to propose a patch implementing that so we can compare ((un?)fortunately I won't be able to test against the bug as I cannot repro it).

Other than that, don't you think the high and low members of int128_ocaml struct should be swapped in the hope the compiler might optimize the shift away on those architectures where it's safe to do so?

@rixed
Copy link
Contributor

rixed commented Oct 28, 2020

a fix in the allocator

Problem is, the allocator initializes the value header itself so we cannot move it to the next aligned address. So that offsetting would have to be done in Int128_val, which I believe would be quite more expensive.

@rixed
Copy link
Contributor

rixed commented Oct 28, 2020

This is what I was thinking about: #52
Honestly, I don't think it's any better, but I'm curious if there is any obvious drawback/advantage or any noticeable difference.

@dra27
Copy link
Contributor Author

dra27 commented Oct 28, 2020

I do not understand what is meant by "Stdint.Uint32.max_int is -1". On a ARCH_64 I have Uint32.(max_int |> to_int) = 4294967295. On a 32bits machine I would indeed expect this to be -1.

On a 64-bit Windows system, 0xffffffff gets passed to int_of_uint32 which contains (long)Uint32_val(v). long on Windows is 32-bit (even on a 64-bit system) and so 0xffffffff gets turned into -1 (i.e. 0x7fffffffffffffff).

@dra27
Copy link
Contributor Author

dra27 commented Oct 28, 2020

This is what I was thinking about: #52
Honestly, I don't think it's any better, but I'm curious if there is any obvious drawback/advantage or any noticeable difference.

This doesn't work. The alignment may change both when the value is promoted to the major heap and if/when the compactor runs.

@dra27
Copy link
Contributor Author

dra27 commented Oct 28, 2020

It is not possible to store a 16-byte aligned value in the heap but it would be possible to store 2 16-byte values where one of them is aligned. That may work, given that they're immutable - at the cost of more expensive creation and of course using 40 bytes per int128 instead of 16!

@dra27 dra27 changed the title 128-bit alignment, Windows and custom operations Ensure there's a 16-byte aligned 128-bit int in the heap Oct 28, 2020
@dra27
Copy link
Contributor Author

dra27 commented Oct 28, 2020

I've split the other two parts off, since they should be entirely "non-controversial"! Pushed now is a revised implementation of the suggestion above. The original struct-based version is in #55. There may be a good Hacker's Delight way of getting 0 from 0 and 3 from 8 for calculating the offset!

I can confirm that both approaches work on Windows with the same 11 tests failing as fail elsewhere. The approach here is certainly slower to create 128 bit numbers but should be faster for reading them than the struct-based approach. Both need benchmarking. My personal intuition is that the overhead of storing the int128 twice and the complicated pointer manipulation to read the aligned version will outweigh the saving over the struct version, but I have no evidence!

At the start of this, I was faced with a library whose initialisations were segfaulting - both of these fix it and allow me to use the 32-bit arithmetic I was actually after! I actually don't have any interest in the 128-bit stuff at the moment, so I'm nearing the end of my time on this, if that's OK 🙂 I am very interested in having a released version of this library which works on Windows (slow is better than not-at-all).

@msprotz - do you have anything which can provide any kind of benchmark for these two? If you're trying it on Windows, I'd pull #53 in as well!

@rixed
Copy link
Contributor

rixed commented Oct 28, 2020

This doesn't work. The alignment may change both when the value is promoted to the major heap and if/when the compactor runs.

Very true! My erroneous mental model told me the custom data was allocated in the heap outside of the custom block but it is actually embedded within the custom block so it is indeed subject to relocation with compaction.
This makes this alternative approach rather awful.
Thank you for fixing my bad assumption.

Unless we do run some bench and there is a large (and unexpected) difference in perf, my favorite implementation is therefore #55.

@msprotz
Copy link

msprotz commented Oct 28, 2020

@dra27 unfortunately I don't have anything to benchmark, I'm at the same stage as you: as long as we get something that compiles and doesn't require our setup script to fixup the opam-installed package, I'm a happier man

@msprotz
Copy link

msprotz commented Oct 28, 2020

Concerning two versions vs. struct: I think your struct-based patch is much more readable and easier to maintain in the long run, so I would advocate for merging #53, #55 as soon as possible, and issuing a new release. That would solve everyone's issues quite nicely I think. Thanks for digging into this @dra27! Much appreciated.

@rgrinberg
Copy link
Collaborator

Do we still need this PR now that #53 and #55 are merged?

@dra27
Copy link
Contributor Author

dra27 commented Oct 28, 2020

It’s an alternative which someone might like to pick up to benchmark... up to you if it wants keeping open or not!

@dra27 dra27 closed this Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants