fix: `parse_vector` for f32s with a 48-char representation #531

alexquick · 2024-07-17T04:27:36Z

Howdy, thanks for such a cool project! I found an issue parsing vectors containing values with 48 character representations.
For example -0.000000000000000000000000000000000000023509886, a valid f32 given by 0x80FFFFFF with length 48.
I ran into this because I had saved a vector with this value in a prior version of pgvecto.rs (0.2.1) and was running into the Bad literal error when importing the same data into version 0.3.0.

I believe that the tokenizer in parse_vector is prepending a superfluous b'$' into the token buffer. This PR removes it.

Before

vectors=# CREATE TABLE items (embedding vector(3));
vectors=# insert into items VALUES ('[0.3,-1.3,-0.000000000000000000000000000000000000023509886]');
ERROR:  pgvecto.rs: Bad literal.
INFORMATION: hint = Too long number at position 52
LINE 1: insert into items VALUES ('[0.3,-1.3,-0.0000000000000000000000000...

After

  vectors=# insert into items VALUES ('[0.3,-1.3,-0.000000000000000000000000000000000000023509886]');
  INSERT 0 1
  vectors=# select * from items;
                          embedding                         
  ----------------------------------------------------------
   [0.3, -1.3, -0.000000000000000000000000000000000000023509886]
  (1 row)

Discussion

Taking Chesterson's fence to heart, I really did try squinting every which way to figure out why the '$' is pushed to token at first but came up blank. I think it may have been testing code in the initial benchmark rig that @usamoi was using when speeding up the parsing (#316)? At any rate, if it is required I think that the buffer on line 50 needs to be constructed with len=49 rather than 48 to account for the extra '$'. Note that very similar code parse_pgvector_svector does not seem to be affected by this issue.

Signed-off-by: Alex Quick <[email protected]>

usamoi · 2024-07-17T05:36:29Z

Thank you for pointing it! $ is used for rejecting inputs like [1,,,] at first but apparently it can be detected without a flag character:)

fix: parsing for f32s with a 48-char representation

b5e5c4c

Signed-off-by: Alex Quick <[email protected]>

alexquick force-pushed the fix-parse-vector branch from bb8ebd7 to b5e5c4c Compare July 17, 2024 04:36

usamoi approved these changes Jul 17, 2024

View reviewed changes

usamoi added this pull request to the merge queue Jul 17, 2024

Merged via the queue into tensorchord:main with commit 6d64e97 Jul 17, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `parse_vector` for f32s with a 48-char representation #531

fix: `parse_vector` for f32s with a 48-char representation #531

alexquick commented Jul 17, 2024 •

edited

Loading

usamoi commented Jul 17, 2024 •

edited

Loading

fix: parse_vector for f32s with a 48-char representation #531

fix: parse_vector for f32s with a 48-char representation #531

Conversation

alexquick commented Jul 17, 2024 • edited Loading

Before

After

Discussion

usamoi commented Jul 17, 2024 • edited Loading

fix: `parse_vector` for f32s with a 48-char representation #531

fix: `parse_vector` for f32s with a 48-char representation #531

alexquick commented Jul 17, 2024 •

edited

Loading

usamoi commented Jul 17, 2024 •

edited

Loading