Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uploading batches of ndjson vectors from a source file #4034

Merged

Conversation

ndisidore
Copy link
Contributor

@ndisidore ndisidore commented Sep 26, 2023

Extends the Vectorize wrangler support work to allow uploading vectors from a file source.
Comments on the original PR here: #4029

What this PR solves / how to test:

We want to improve the Vector Store onboarding for our customers by allowing them an easy way to source vectors. This vector files may be quite big (100s of MB - GB). This supports a streaming approach to reading and loading these files in an efficient manner.

@ndisidore ndisidore requested a review from a team as a code owner September 26, 2023 12:48
@changeset-bot
Copy link

changeset-bot bot commented Sep 26, 2023

🦋 Changeset detected

Latest commit: a06767d

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
wrangler Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Contributor

github-actions bot commented Sep 26, 2023

A wrangler prerelease is available for testing. You can install this latest build in your project with:

npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-wrangler-4034

You can reference the automatically updated head of this PR with:

npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/prs/6315244316/npm-package-wrangler-4034

Or you can use npx with this latest build directly:

npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-wrangler-4034 dev path/to/script.js
Additional artifacts:
npm install https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-cloudflare-pages-shared-4034

Note that these links will no longer work once the GitHub Actions artifact expires.


[email protected] includes the following runtime dependencies:

Package Constraint Resolved
miniflare 3.20230922.0 3.20230922.0
workerd 1.20230922.0 1.20230922.0
workerd --version 1.20230922.0 2023-09-22

|

Please ensure constraints are pinned, and miniflare/workerd minor versions match.

@penalosa penalosa added the e2e Run e2e tests on a PR label Sep 26, 2023
"wrangler": patch
---

[Vectorize] Support uploading batches of ndjson vectors from a source file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this include an example?

Comment on lines +122 to +155
FormData.prototype.toString = mockFormDataToString;
export interface RestRequestWithFormData extends MockedRequest, RestRequest {
formData(): Promise<FormData>;
}
(MockedRequest.prototype as RestRequestWithFormData).formData =
mockFormDataFromString;

function mockFormDataToString(this: FormData) {
const entries = [];
for (const [key, value] of this.entries()) {
if (value instanceof Blob) {
const reader = new FileReaderSync();
reader.readAsText(value);
const result = reader.result;
entries.push([key, result]);
} else {
entries.push([key, value]);
}
}
return JSON.stringify({
__formdata: entries,
});
}

async function mockFormDataFromString(this: MockedRequest): Promise<FormData> {
const { __formdata } = await this.json();
expect(__formdata).toBeInstanceOf(Array);

const form = new FormData();
for (const [key, value] of __formdata) {
form.set(key, value);
}
return form;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// actually do any parsing - that will be handled on the backend
async function* getBatchFromFile(file: FileHandle, batchSize = 3) {
let batch: string[] = [];
for await (const line of file.readLines()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filehandle.readLines() was added in Node 18—Wrangler supports Node 16, and so this won't be usable, I don't think. https://nodejs.org/api/readline.html is probably the best alternative?

let index: Optional<VectorizeVectorMutation, "ids"> | undefined;
for await (const batch of getBatchFromFile(file, args.batchSize)) {
const formData = new FormData();
formData.append("vectors", new File([batch.join(`\n`)], "vectors.ndjson"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably include a mime type for the File?

Comment on lines 57 to 67
logger.warn(
`🚧 While Vectorize is in beta, we've limited uploads to 100k vectors`
);
logger.warn(
`🚧 You may run this again with another batch to upload further`
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.warn(
`🚧 While Vectorize is in beta, we've limited uploads to 100k vectors`
);
logger.warn(
`🚧 You may run this again with another batch to upload further`
);
logger.warn(
`🚧 While Vectorize is in beta, we've limited uploads to 100k vectors. You may run this again with another batch to upload further`
);

}

const index = await insertIntoIndex(config, args.name, vectors);
// remove the ids - skip tracking these for bulk uploads since this could be in the 100s of thousands.
if (index) delete index.ids;
logger.log(JSON.stringify(index, null, 2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command should have additional --json flag for outputting JSON, otherwise, it should output in a human readable format

if (args.vectors) {
// Parse each vector into a Vector type
// Think about potential limits on args.vectors.length?
if (index.count > VECTORIZE_MAX_UPSERT_VECTOR_RECORDS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limiting behaviour should be covered in tests

@@ -30,13 +42,42 @@ export async function handler(
args: StrictYargsOptionsToInterface<typeof options>
) {
const config = readConfig(args.config, args);
const file = await open(args.file);

let index: Optional<VectorizeVectorMutation, "ids"> | undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ids field isn't being used here, could this just be an incrementing integer count?

@penalosa penalosa removed the e2e Run e2e tests on a PR label Sep 26, 2023
return await fetchResult(
`/accounts/${accountId}/vectorize/indexes/${indexName}/insert`,
{
method: "POST",
body: JSON.stringify(vectors),
headers: {
"Content-Type": "multipart/form-data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be omitted, so the boundary can be generated and included in the content-type. cc @Skye-31 for spotting this!

Copy link
Contributor Author

@ndisidore ndisidore Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting: we're sure about this? I check the incoming content type in my test in here and omitting this results in

    Expected: "multipart/form-data"
    Received: "text/plain;charset=UTF-8"

Which is not the correct type for a multipart request, but this may be an artifact of the test server

Copy link
Contributor

@penalosa penalosa Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an artifact of the MSW test setup, I think, which doesn't properly support FormData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR work when manually testing against the API?

@ndisidore ndisidore force-pushed the nathan/vectorize-insert-from-file branch from 3376d2e to bbc0e0d Compare September 26, 2023 14:51
@ndisidore ndisidore force-pushed the nathan/vectorize-insert-from-file branch from bbc0e0d to eade8a5 Compare September 26, 2023 15:01
@@ -38,7 +38,9 @@ export function options(yargs: CommonYargsArgv) {
preset: {
type: "string",
choices: [
"workers-ai/bge-small-en",
"@cf/baai/bge-small-en-v1.5",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per @elithrar 's wishes

});
});

FormData.prototype.toString = mockFormDataToString;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will pollute the FormData prototype for all tests

This needs to be done in a beforeEach, after storing the original value, and then resetting to the original value in an afterEach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this pattern already exists within wrangler. So no need to block this PR on this comment

Comment on lines +126 to +127
(MockedRequest.prototype as RestRequestWithFormData).formData =
mockFormDataFromString;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above for prototype pollution across all tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this pattern already exists within wrangler. So no need to block this PR on this comment

function mockFormDataToString(this: FormData) {
const entries = [];
for (const [key, value] of this.entries()) {
if (value instanceof Blob) {
Copy link
Contributor

@RamIdeas RamIdeas Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are failing because Blob is not a global prior to Node v18 nodejs/node#42220 (comment)

It needs to be imported from node:buffer

it is not a global prior to node v18
@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

Merging #4034 (a06767d) into main (3cd7286) will increase coverage by 0.13%.
Report is 3 commits behind head on main.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4034      +/-   ##
==========================================
+ Coverage   74.94%   75.08%   +0.13%     
==========================================
  Files         207      208       +1     
  Lines       11826    11871      +45     
  Branches     3079     3085       +6     
==========================================
+ Hits         8863     8913      +50     
+ Misses       2963     2958       -5     
Files Coverage Δ
packages/wrangler/src/vectorize/client.ts 52.27% <ø> (+9.09%) ⬆️
packages/wrangler/src/vectorize/create.tsx 74.07% <ø> (ø)
packages/wrangler/src/vectorize/index.ts 100.00% <100.00%> (ø)
packages/wrangler/src/vectorize/insert.ts 86.84% <82.75%> (ø)

... and 4 files with indirect coverage changes

@lrapoport-cf lrapoport-cf merged commit bde9d64 into cloudflare:main Sep 26, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants