Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parse_vector for f32s with a 48-char representation #531

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

alexquick
Copy link
Contributor

@alexquick alexquick commented Jul 17, 2024

Howdy, thanks for such a cool project! I found an issue parsing vectors containing values with 48 character representations.
For example -0.000000000000000000000000000000000000023509886, a valid f32 given by 0x80FFFFFF with length 48.
I ran into this because I had saved a vector with this value in a prior version of pgvecto.rs (0.2.1) and was running into the Bad literal error when importing the same data into version 0.3.0.

I believe that the tokenizer in parse_vector is prepending a superfluous b'$' into the token buffer. This PR removes it.

Before

vectors=# CREATE TABLE items (embedding vector(3));
vectors=# insert into items VALUES ('[0.3,-1.3,-0.000000000000000000000000000000000000023509886]');
ERROR:  pgvecto.rs: Bad literal.
INFORMATION: hint = Too long number at position 52
LINE 1: insert into items VALUES ('[0.3,-1.3,-0.0000000000000000000000000...

After

  vectors=# insert into items VALUES ('[0.3,-1.3,-0.000000000000000000000000000000000000023509886]');
  INSERT 0 1
  vectors=# select * from items;
                          embedding                         
  ----------------------------------------------------------
   [0.3, -1.3, -0.000000000000000000000000000000000000023509886]
  (1 row)

Discussion

Taking Chesterson's fence to heart, I really did try squinting every which way to figure out why the '$' is pushed to token at first but came up blank. I think it may have been testing code in the initial benchmark rig that @usamoi was using when speeding up the parsing (#316)? At any rate, if it is required I think that the buffer on line 50 needs to be constructed with len=49 rather than 48 to account for the extra '$'. Note that very similar code parse_pgvector_svector does not seem to be affected by this issue.

@usamoi
Copy link
Collaborator

usamoi commented Jul 17, 2024

Thank you for pointing it! $ is used for rejecting inputs like [1,,,] at first but apparently it can be detected without a flag character:)

@usamoi usamoi added this pull request to the merge queue Jul 17, 2024
Merged via the queue into tensorchord:main with commit 6d64e97 Jul 17, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants