Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question Re. Image Search Query With Distances Annotated As A New Column #501

Closed
rsomani95 opened this issue Jun 13, 2024 · 10 comments
Closed
Labels
type/question 🙋 Further information is requested

Comments

@rsomani95
Copy link

rsomani95 commented Jun 13, 2024

Originally started this discussion in Discord (I think with @VoVAllen) but posting here to keep the discussion more focused:
https://discord.com/channels/974584200327991326/1220203706553860227/1245115744224542770

In spirit, what I'm trying to do is a standard vector query with an additional condition where the results are filtered by the cosine distance to the query vector.


Dataset Context

My dataset has 3 tables collection, asset, and segment, setup like so:

CREATE TABLE collection (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL
);

CREATE TABLE asset (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  collection_id UUID REFERENCES collection(id) ON DELETE CASCADE,
  title TEXT NOT NULL
);

CREATE TABLE segment (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  asset_id UUID REFERENCES asset(id) ON DELETE CASCADE,
  embedding vector(640) NOT NULL, -- 640 dimensions
  category TEXT NOT NULL
);

I'm using the 0.3.0-alpha2 docker container for all my tests.
I then build some standard indexes on theese tables, followed by the HNSW index. Here is a fully reproducible script to reproduce the dataset:
https://gist.github.com/rsomani95/56194e33364208019a09846cb9bacabb#file-build_dataset_with_indexes-sql-L1-L113


Queries

Before running any queries, I've set vectors.hnsw_ef_search=100 globally.

1. Correct results, but no index usage

This query does a brute force search. However, the results are useful because they serve as a ground truth for checking future results for other queries. In my case, I got 96 results (but this will vary because the embeddings are randomised)

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        distance
)
SELECT distances.id, distances.distance, distances.category
FROM distances
WHERE distance < 0.99
;

2. Adding LIMIT to use index, but incorrect results

As suggested in the thread here, I added a LIMIT under the ORDER BY clause to activate the index.

Here's the query with the LIMIT added:

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        distance
    LIMIT
        100
)
SELECT distances.id, distances.distance, distances.category
FROM distances
WHERE distance < 0.99
;

This does activate the index, but now only returns 1 row as opposed to 96 earlier. I suspect this is because the query is formulated incorrectly; in its current formulation, we must be doing post filtering (I am assuming that the vbase implementation is correct and not the root cause of the issue here)

I tried another approach, adding LIMIT to the bottom of the query:

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        distance
)
SELECT distances.id, distances.distance, distances.category
FROM distances
WHERE distance < 0.99
LIMIT 100
;

and while this uses the index, the query runs a lot slower, and still produces incorrect results.


I'd love some guidance on how to formulate the SQL above correctly to use the index and get correct results.

@mertalev
Copy link

mertalev commented Jun 13, 2024

You can get it to use the index and still do pre-filtering with a query like this:

SELECT
    segment.*,
    embedding <=> :embedding AS distance
FROM
    segment
INNER JOIN
    asset ON segment.asset_id = asset.id
WHERE
    distance < 0.99
    AND segment.category = 'visual'
    AND asset.collection_id = :collection_id
ORDER BY
    embedding <=> :embedding
LIMIT 100

This will still be slower than the post-filtering approach because it needs to evaluate the cosine distance for (potentially many) assets in the WHERE condition, but the recall should be correct.

I used embedding <=> :embedding instead of distance for ordering because I remember this being necessary for the index to get used. If the query plan is the same with distance, feel free to use that instead.

@gaocegege gaocegege added type/question 🙋 Further information is requested labels Jun 14, 2024
@VoVAllen
Copy link
Member

VoVAllen commented Jun 14, 2024

The last SQL should be the right one. Does it still return only one result? Or multiple results but to be slow? And vector index is an approximate index which means the results might not be the exact same as the exact one (without using the index).'

For your cases, if you don't have too many categories, you can put category_id and vector in one table and create partial index on each category. This will accelerate the filter condition.

@VoVAllen
Copy link
Member

Unfortunately it's hard to co-optimize the vector search and filter condition together. We have some internal experiments and only a very limited cases can have improvments.

@rsomani95
Copy link
Author

@mertalev Have you had success running that exact query? When I tried running it, I get this error saying "distance" is not defined:

ERROR:  column "distance" does not exist
LINE 10:     distance < 0.99

I also tried another variant of what you suggested but ran into the same error:

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding
            AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        distance < 0.99
        AND segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        embedding <=> :embedding
)
SELECT distances.id, distances.distance, distances.category
FROM distances
LIMIT 100
;

@rsomani95
Copy link
Author

@VoVAllen the last SQL I'd shared in the opening comment does give correct number of results, and yes I'm keeping in mind that the index gives approx matches so the results won't necessarily be identical to brute force.
It takes ~16.2s to run

Here's the query plan visualised:

CleanShot 2024-06-14 at 11 28 54@2x

PS: You can click on "Raw" to see the exact query planner output and "Query" for the query that generated it.
https://explain.dalibo.com/plan/345ga89gc5edefae

Unfortunately it's hard to co-optimize the vector search and filter condition together. We have some internal experiments and only a very limited cases can have improvments.

I see. I'm slightly confused by this, because I was under the impression that this was one of the promises of the VBASE algorithm? Are there ways you would adjust what my query is trying to achieve to leverage the indexes and vbase correctly that I'm missing out on?

@rsomani95
Copy link
Author

rsomani95 commented Jun 14, 2024

Just a thought: are compound indexes with a vector + non-vector column supported? If so, perhaps that could help?

if you don't have too many categories, you can put category_id and vector in one table and create partial index on each category. This will accelerate the filter condition.

Unfortunately, I don't think this will work for our use case

@rsomani95
Copy link
Author

Did some further experiments, and we found that returning the distance itself in a new column has negligible cost, and the part of the query responsible for the slowdown is the WHERE distance < 0.99

To be more speicifc, the commented out line is responsible for this query to go from 269ms -> 16200ms:

WITH distances AS (
    SELECT
        segment.id, segment.category,
        embedding <=> '[0.048492431640625,0.004207611083984375,0.020538330078125,-0.06622314453125,-0.009918212890625,0.0819091796875,-0.0247344970703125,-0.01090240478515625,0.0679931640625,-0.040924072265625,-0.0034122467041015625,0.072021484375,0.0008335113525390625,-0.006832122802734375,0.00824737548828125,-0.03314208984375,0.025909423828125,0.04852294921875,-0.0171661376953125,0.0048370361328125,-0.0230712890625,-0.01190185546875,-0.0043792724609375,0.01007080078125,-0.052276611328125,0.0180816650390625,-0.00858306884765625,0.0078582763671875,-0.050811767578125,0.007328033447265625,0.0712890625,0.033416748046875,-0.03240966796875,-0.0029735565185546875,-0.0399169921875,-0.07537841796875,-0.0631103515625,-0.02191162109375,-0.0292510986328125,0.063232421875,0.093017578125,0.022674560546875,-0.0186920166015625,0.0041046142578125,-0.0300140380859375,-0.0941162109375,0.038970947265625,0.036651611328125,-0.0223388671875,0.002079010009765625,0.00818634033203125,-0.0187225341796875,0.031982421875,-0.01050567626953125,-0.00460052490234375,-0.0517578125,-0.0269317626953125,-0.038055419921875,0.007781982421875,0.038482666015625,-0.05145263671875,0.00878143310546875,0.021575927734375,0.053375244140625,0.0005512237548828125,0.03448486328125,-0.01157379150390625,0.0237274169921875,-0.05712890625,0.099365234375,-0.0019664764404296875,0.056243896484375,-0.00019276142120361328,-0.027069091796875,0.1966552734375,-0.01096343994140625,0.00528717041015625,-0.0244140625,-0.0211944580078125,0.0116424560546875,-0.053863525390625,-0.042266845703125,-0.0241546630859375,0.016571044921875,0.01535797119140625,0.0203704833984375,-0.036376953125,0.029815673828125,-0.048858642578125,-0.027984619140625,0.0044097900390625,0.049072265625,-0.06146240234375,0.002292633056640625,-0.01023101806640625,-0.0404052734375,0.03216552734375,-0.0093536376953125,-0.01009368896484375,0.03692626953125,-0.0257568359375,0.0193939208984375,0.004894256591796875,0.005229949951171875,-0.007106781005859375,0.06427001953125,0.007354736328125,0.050048828125,0.05267333984375,0.044952392578125,-0.0908203125,-0.02679443359375,0.01000213623046875,0.0247955322265625,0.0106658935546875,-0.0261688232421875,0.0207672119140625,0.045806884765625,0.035491943359375,-0.014556884765625,0.0421142578125,-0.045654296875,-0.048065185546875,0.0234832763671875,-0.14111328125,-0.004230499267578125,0.024566650390625,-0.0244293212890625,0.00693511962890625,0.01145172119140625,0.040283203125,-0.06060791015625,0.00921630859375,0.029083251953125,-0.09417724609375,0.028656005859375,-0.00201416015625,0.09478759765625,-0.01023101806640625,0.0249786376953125,0.018310546875,-0.02783203125,-0.022979736328125,-0.032073974609375,-0.0009412765502929688,0.00971221923828125,0.0071563720703125,-0.06329345703125,-0.01146697998046875,0.0149078369140625,0.0007505416870117188,0.01476287841796875,-0.0672607421875,-0.05224609375,-0.004695892333984375,0.006256103515625,0.00579833984375,0.0203704833984375,-0.01812744140625,-0.003818511962890625,0.007167816162109375,0.041107177734375,-0.01318359375,0.01480865478515625,-0.05267333984375,-0.0017557144165039062,0.011199951171875,-0.0153656005859375,0.0550537109375,0.03387451171875,-0.0196990966796875,0.033782958984375,-0.0158233642578125,0.0,0.037567138671875,0.0020351409912109375,-0.0206451416015625,-0.0242767333984375,0.037933349609375,0.03271484375,-0.035003662109375,-0.0303802490234375,-0.0282440185546875,0.0411376953125,0.023040771484375,0.183349609375,0.034759521484375,-0.01788330078125,0.07275390625,-0.025360107421875,0.0416259765625,0.037567138671875,-0.04315185546875,-0.06610107421875,0.001766204833984375,0.06866455078125,0.0164642333984375,-0.014373779296875,-0.033905029296875,-0.01079559326171875,-0.03521728515625,0.07550048828125,-0.01229095458984375,0.0278778076171875,0.039764404296875,-0.046722412109375,-0.0186309814453125,0.033660888671875,0.00748443603515625,0.0007295608520507812,-0.0947265625,0.04132080078125,-0.0140380859375,0.01447296142578125,-0.051544189453125,-0.055419921875,0.02691650390625,0.0095062255859375,-0.01568603515625,0.004299163818359375,-0.041656494140625,-0.0280303955078125,0.07684326171875,0.00970458984375,0.04351806640625,-0.078125,-0.0384521484375,0.0222930908203125,-0.003314971923828125,-0.055999755859375,0.01373291015625,0.01485443115234375,-0.04168701171875,0.04571533203125,-0.04693603515625,-0.0166473388671875,-0.072998046875,-0.0240478515625,0.002040863037109375,0.0220794677734375,-0.0009369850158691406,-0.027069091796875,0.04248046875,-0.0750732421875,-0.0960693359375,-0.03765869140625,-0.0173187255859375,-0.0325927734375,0.050750732421875,-0.052520751953125,-0.038818359375,-0.177001953125,-0.0188751220703125,0.0130462646484375,0.04058837890625,0.042236328125,-0.03338623046875,-0.031982421875,0.04229736328125,-0.05474853515625,0.0287017822265625,0.007640838623046875,-0.033447265625,0.0033664703369140625,-0.055694580078125,-0.0714111328125,-0.036346435546875,0.00142669677734375,-0.03070068359375,0.0299530029296875,-0.03369140625,-0.00966644287109375,0.005718231201171875,0.0187225341796875,-0.0213775634765625,-0.0185546875,-0.04180908203125,-0.0450439453125,-0.0229034423828125,0.04693603515625,0.036163330078125,0.061004638671875,-0.0672607421875,0.0009226799011230469,-0.046600341796875,-0.0242767333984375,-0.058624267578125,-0.1256103515625,0.023223876953125,0.0295867919921875,-0.046234130859375,0.0163421630859375,0.050384521484375,-0.025115966796875,0.037994384765625,0.027984619140625,0.0261383056640625,-0.0117645263671875,-0.004352569580078125,0.00960540771484375,0.07293701171875,0.062225341796875,-0.01007080078125,-0.02685546875,-0.0005059242248535156,-0.0122833251953125,-0.024749755859375,0.036529541015625,-0.03192138671875,0.015869140625,0.0777587890625,0.0109405517578125,-0.050018310546875,0.06787109375,0.08477783203125,-0.035736083984375,0.01482391357421875,0.05218505859375,-0.004047393798828125,-0.0165252685546875,0.1019287109375,-0.04510498046875,0.022186279296875,0.005382537841796875,0.004451751708984375,0.01129913330078125,-0.048980712890625,0.047027587890625,-0.0146026611328125,-0.012939453125,-0.0210418701171875,-0.0308990478515625,-0.057220458984375,0.0005764961242675781,0.01168060302734375,-0.004848480224609375,0.01641845703125,-0.0226898193359375,0.0187835693359375,0.1494140625,0.055328369140625,-0.0178985595703125,-0.0169219970703125,-0.043792724609375,-0.0238800048828125,-0.013580322265625,0.0104522705078125,0.0526123046875,-0.033050537109375,0.00775909423828125,-0.025238037109375,0.00724029541015625,-0.025909423828125,-0.00861358642578125,0.040679931640625,0.1619873046875,0.0263519287109375,-0.0229034423828125,0.0101165771484375,-0.041107177734375,-0.0301361083984375,0.006916046142578125,-0.0175323486328125,0.01502227783203125,-0.0161895751953125,-0.01323699951171875,-0.004383087158203125,-0.0087432861328125,-0.0511474609375,0.00356292724609375,0.011383056640625,-0.0548095703125,-0.039947509765625,0.045318603515625,-0.01392364501953125,-0.01788330078125,0.025634765625,0.00868988037109375,-0.00244903564453125,0.0030956268310546875,0.0032634735107421875,0.06744384765625,0.0064697265625,0.005268096923828125,-0.014190673828125,-0.01392364501953125,0.01105499267578125,0.0307769775390625,-0.017425537109375,0.002033233642578125,0.034698486328125,-0.0250244140625,-0.00201416015625,-0.0233154296875,-0.019287109375,0.0181884765625,0.008575439453125,-0.03564453125,0.045379638671875,0.003162384033203125,0.0053863525390625,0.011322021484375,-0.01247406005859375,-0.04608154296875,0.02166748046875,0.0238189697265625,0.022125244140625,0.0035915374755859375,0.022064208984375,0.00432586669921875,-0.03546142578125,0.01241302490234375,0.02740478515625,0.004619598388671875,-0.09027099609375,-0.052764892578125,0.0325927734375,-0.02593994140625,0.0094146728515625,-0.01071929931640625,0.01453399658203125,0.00960540771484375,0.05364990234375,0.003635406494140625,-0.020416259765625,-0.0279083251953125,-0.004726409912109375,-0.007053375244140625,0.004291534423828125,-0.00629425048828125,-0.0200958251953125,-0.049163818359375,0.02899169921875,-0.06494140625,-0.0117034912109375,-0.04913330078125,0.0148468017578125,-0.070556640625,0.04052734375,0.0009059906005859375,0.021759033203125,-0.0352783203125,-0.0341796875,-0.050262451171875,0.00033593177795410156,0.043792724609375,-0.024658203125,0.0182952880859375,-0.01245880126953125,-0.01078033447265625,-0.0157012939453125,0.045257568359375,0.0224609375,-0.0308074951171875,0.06298828125,0.0051727294921875,0.00559234619140625,-0.025604248046875,-0.0242919921875,0.01508331298828125,-0.01375579833984375,0.0262298583984375,-0.043365478515625,-0.032958984375,-0.0101776123046875,-0.0250244140625,-0.0118408203125,0.0208282470703125,-0.006626129150390625,0.047698974609375,0.0027217864990234375,-0.023956298828125,-0.0016622543334960938,0.051513671875,-0.001529693603515625,-0.0250396728515625,-0.0016460418701171875,-0.0206451416015625,0.047576904296875,-0.06134033203125,0.03338623046875,-0.0251922607421875,0.17041015625,0.04351806640625,-0.07464599609375,0.03369140625,1.4543533325195312e-05,-0.0011463165283203125,-0.048004150390625,0.00016164779663085938,0.0247344970703125,-0.0215301513671875,0.03076171875,-0.016937255859375,-0.0943603515625,-0.026824951171875,0.01163482666015625,0.039276123046875,-0.01502227783203125,0.00400543212890625,0.01165771484375,0.0169677734375,-0.033050537109375,-0.03863525390625,-0.09918212890625,0.04339599609375,-0.04229736328125,-0.005825042724609375,-0.030731201171875,-0.0289306640625,0.0068206787109375,0.0158538818359375,0.043853759765625,0.005344390869140625,-0.0231170654296875,0.0287017822265625,0.01654052734375,0.041961669921875,0.005596160888671875,0.0307769775390625,0.0102081298828125,0.03472900390625,0.0164337158203125,0.00878143310546875,0.061492919921875,-0.0435791015625,-0.08380126953125,0.0120391845703125,-0.020660400390625,-0.049346923828125,0.03900146484375,-0.03497314453125,0.00788116455078125,-0.053192138671875,0.0191497802734375,0.02227783203125,0.0333251953125,-0.00919342041015625,-0.104248046875,-0.049774169921875,-0.0645751953125,0.028350830078125,-0.03839111328125,-0.01654052734375,-0.00263214111328125,-0.007259368896484375,0.017120361328125,-0.00391387939453125,-0.0848388671875,0.006214141845703125,-0.0175018310546875,-0.051971435546875,-0.048828125,-0.08807373046875,0.00550079345703125,0.030303955078125,0.00983428955078125,-0.0084686279296875,0.01213836669921875,0.01532745361328125,0.003978729248046875,0.03924560546875,-0.02044677734375,0.0391845703125,-0.051513671875,-0.02532958984375,-0.0233612060546875,0.03936767578125,-0.0286712646484375,-0.037811279296875,0.05712890625,0.0418701171875,0.0294036865234375,0.006137847900390625,-0.0252532958984375,0.003887176513671875,-0.0084991455078125,0.003604888916015625,-0.01824951171875,0.04150390625,-0.08380126953125,-0.00284576416015625,0.01535797119140625,0.0009307861328125,-0.00376129150390625,0.01319122314453125,-0.057586669921875,0.0261383056640625,0.01068878173828125,0.0295867919921875,-0.00908660888671875,-0.0201873779296875,-0.01419830322265625,0.0159149169921875,0.06597900390625,-0.0137176513671875,0.014495849609375,0.01540374755859375,0.01279449462890625,-0.0262298583984375,-0.0182952880859375,0.04644775390625,-0.00547027587890625,-0.0021648406982421875,-0.03570556640625,-0.04400634765625,0.00457000732421875,0.01441192626953125,0.0352783203125,0.043182373046875,-0.0010404586791992188,-0.032196044921875,-0.0079345703125,-0.043792724609375,-0.016204833984375,0.05029296875,0.004558563232421875,-0.055328369140625,-0.0265960693359375,0.028350830078125,0.03192138671875,0.029693603515625,0.040435791015625,0.004878997802734375,-0.034515380859375,0.025146484375,-0.01326751708984375,-0.0140838623046875,-0.0008788108825683594,-0.0230865478515625,0.0088653564453125,0.00522613525390625,-0.0179290771484375,0.0303802490234375,0.032928466796875,-0.0184326171875,0.033935546875,-0.002887725830078125,-0.0090484619140625,0.0176849365234375]'
            AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = 'ed0bbe8d-ce06-4b49-9e01-0d763439fa8a'
    ORDER BY
        distance
)
SELECT distances.id, distances.distance, distances.category
FROM distances
-- WHERE distances.distance < 0.99
LIMIT 100
;

@mertalev
Copy link

@rsomani95 Sorry, the earlier one was from memory. The corrected query is this:

SELECT
    segment.*
FROM
    segment
INNER JOIN
    asset ON segment.asset_id = asset.id
WHERE
    embedding <=> :embedding < 0.99
    AND segment.category = 'visual'
    AND asset.collection_id = :collection_id
ORDER BY
    embedding <=> :embedding
LIMIT 100;

@rsomani95
Copy link
Author

rsomani95 commented Jun 17, 2024

@mertalev thank you.

This query seems to have similar results to my last comment, where it takes ~16.2s, and removing the embedding <=> :embedding < 0.99 takes ~375ms

@mertalev
Copy link

I think it's because the limit is 100, but there are only 96 results that meet the threshold. It will end up checking every embedding in the table to find another 4 matches. Setting the limit to 96 should be much faster.

I wonder if a way to stop early would help performance here. The ordering should be informative of when there's no longer a point to keep searching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question 🙋 Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants