Update pgvector loading method to use binary format (#488)

Particularly on large vector types, the pgvector module was spending significant time on converting floating point values to ASCII before being transmitted to the PostgreSQL server. This changes keeps the format in binary, reducing overhead. One test demonstrated a 63% reduction in load time, which would have an impact on the overall "build" time as reported by this benchmark.
erikbern · Feb 29, 2024 · c091271 · c091271
1 parent 77113e0
commit c091271
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/ann_benchmarks/algorithms/pgvector/module.py b/ann_benchmarks/algorithms/pgvector/module.py
@@ -30,7 +30,8 @@ def fit(self, X):
         cur.execute("CREATE TABLE items (id int, embedding vector(%d))" % X.shape[1])
         cur.execute("ALTER TABLE items ALTER COLUMN embedding SET STORAGE PLAIN")
         print("copying data...")
-        with cur.copy("COPY items (id, embedding) FROM STDIN") as copy:
+        with cur.copy("COPY items (id, embedding) FROM STDIN WITH (FORMAT BINARY)") as copy:
+            copy.set_types(["int4", "vector"])
             for i, embedding in enumerate(X):
                 copy.write_row((i, embedding))
         print("creating index...")