Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Varbinary bit manipulations #23682

Open
brunokim opened this issue Oct 4, 2024 · 5 comments
Open

Varbinary bit manipulations #23682

brunokim opened this issue Oct 4, 2024 · 5 comments

Comments

@brunokim
Copy link

brunokim commented Oct 4, 2024

tl;dr: can bitwise_* functions be overloaded to varbinary types?

I wanted to generate an UUIDv4 from the first 128 bits of the output of a SHA256 hash. This requires some bit manipulations according to RFC4122:

  • set bits [48, 52) to "0100" (version 4 - random)
  • set bits [64, 66) to "10" (variant 2)

To accomplish that with Trino, I managed to do the following:

  1. Split the varbinary into 4 bigints, each containing only 32 bits of the input and left-padded with 32 zeros.
  2. Use bitwise_and and bitwise_or to reset the affected bits, and then set the desired values.
  3. Extract the lower half of bigints and concat them to recreate a 128-bit varbinary.

The reason to use 32 bits within a 64-bit integer is to be able to set the MSB, perhaps because hexadecimals are always considered positive? In the commented code below is my attempt to use full 64 bits, which fails because the MSB is set on the integer literals.

WITH
t1 AS (
    SELECT sha256(to_utf8('trino 🐰')) AS hash
),
t2 AS (
    SELECT
        hash
        -- Using int32
        , from_big_endian_64(X'0000 0000' || substr(hash, 1, 4)) AS octets_0_to_3
        , from_big_endian_64(X'0000 0000' || substr(hash, 5, 4)) AS octets_4_to_7
        , from_big_endian_64(X'0000 0000' || substr(hash, 9, 4)) AS octets_8_to_11
        , from_big_endian_64(X'0000 0000' || substr(hash, 13, 4)) AS octets_12_to_15
        -- Using int64
        , from_big_endian_64(substr(hash, 1, 8)) AS octets_0_to_7
        , from_big_endian_64(substr(hash, 9, 8)) AS octets_8_to_15
    FROM t1
),
t3 AS (
    SELECT
        hash
        -- Using int32
        , octets_0_to_3
        , bitwise_or(bitwise_and(octets_4_to_7, 0xFFFF_0FFF), 0x0000_4000) AS octets_4_to_7
        , bitwise_or(bitwise_and(octets_8_to_11, 0x3FFF_FFFF), 0x8000_0000) AS octets_8_to_11
        , octets_12_to_15
        -- Using int64
        --, bitwise_or(bitwise_and(octets_0_to_7, 0xFFFF_FFFF_FFFF_0FFF), 0x0000_0000_0000_4000) AS octets_0_to_7
        --, bitwise_or(bitwise_and(octets_8_to_15, 0x3FFF_FFFF_FFFF_FFFF), 0x8000_0000_0000_0000) AS octets_8_to_15
    FROM t2
),
t4 AS (
    SELECT
        hash
        -- Using int32
        , substr(to_big_endian_64(octets_0_to_3), 5, 4)
        || substr(to_big_endian_64(octets_4_to_7), 5, 4)
        || substr(to_big_endian_64(octets_8_to_11), 5, 4)
        || substr(to_big_endian_64(octets_12_to_15), 5, 4)
        AS uuid_octets_1
        -- Using int64
        --, to_big_endian_64(octets_0_to_7) || to_big_endian_64(octets_8_to_15)
        --AS uuid_octets_2
    FROM t3
),
t5 AS (
    SELECT
        to_hex(hash) AS hash_hex
        , CAST(CAST(uuid_octets_1 AS uuid) AS varchar) AS uuidv4_1
        --, CAST(CAST(uuid_octets_2 AS uuid) AS varchar) AS uuidv4_2
    FROM t4
)
SELECT * FROM t5;

This is the result

hash_hex: 13C1C0C6 818A 593E 5016 F2A4B463937B4592500D83F3F3B3134BF0CCA501F0B2
uuidv4_1: 13c1c0c6-818a-493e-9016-f2a4b463937b

I wish bitwise_and and bitwise_or were available to varbinary types, so this could have been only

bitwise_or(
  bitwise_and(
    substr(sha256(to_utf8('trino 🐰')), 1, 16),
    X'FFFF FFFF FFFF 0FFF 3FFF FFFF FFFF FFFF'),
  X'0000 0000 0000 4000 8000 0000 0000 0000')

Some more feature requests I surfaced from this exercise:

  • It was not clear from documentation that I could cast from varbinary to uuid, as well as the format used from and to varchar. BigQuery provides an exhaustive list of available casts.
  • varbinary literals could accept _ as a separator, to mirror numeric literals. It was just a guess that space was accepted within the string, because it's not included in the docs
@mosabua
Copy link
Member

mosabua commented Oct 8, 2024

It would be great if you can send pull requests for the documentation improvements.

With regards to the function I think it would be best for @martint to chime in and confirm if this would be okay to implement. If yes, then you could also go ahead and send a PR for that ideally.

@brunokim
Copy link
Author

brunokim commented Oct 9, 2024

I'd love to contribute! I may be able to work on this in November. Can you point me where are the docs and bitwise implementation, in the meantime?

@mosabua
Copy link
Member

mosabua commented Oct 9, 2024

Look at BitwiseFunctions.java in io.trino.operator.scalar

Also https://trino.io/development/process

@wendigo
Copy link
Contributor

wendigo commented Oct 10, 2024

The open question is how these functions are meant to work if the both varbinary have different lengths, i.e. 000011 & 0010100000

@martint
Copy link
Member

martint commented Oct 10, 2024

We should add support for varbinary, but I would start conservatively and disallow varbinaries of different lengths. We can expand later. In particular, we're going to have to look into how SQL does conversions between varbinary(n) with different n (currently, Trino only supports varbinary without length indicator). The semantics of the bitwise_* functions should be consistent with those behaviors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants