-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support uploading batches of ndjson vectors from a source file #4034
Support uploading batches of ndjson vectors from a source file #4034
Conversation
🦋 Changeset detectedLatest commit: a06767d The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
A wrangler prerelease is available for testing. You can install this latest build in your project with: npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-wrangler-4034 You can reference the automatically updated head of this PR with: npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/prs/6315244316/npm-package-wrangler-4034 Or you can use npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-wrangler-4034 dev path/to/script.js Additional artifacts:npm install https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-cloudflare-pages-shared-4034 Note that these links will no longer work once the GitHub Actions artifact expires.
| Please ensure constraints are pinned, and |
.changeset/eleven-monkeys-tan.md
Outdated
"wrangler": patch | ||
--- | ||
|
||
[Vectorize] Support uploading batches of ndjson vectors from a source file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this include an example?
FormData.prototype.toString = mockFormDataToString; | ||
export interface RestRequestWithFormData extends MockedRequest, RestRequest { | ||
formData(): Promise<FormData>; | ||
} | ||
(MockedRequest.prototype as RestRequestWithFormData).formData = | ||
mockFormDataFromString; | ||
|
||
function mockFormDataToString(this: FormData) { | ||
const entries = []; | ||
for (const [key, value] of this.entries()) { | ||
if (value instanceof Blob) { | ||
const reader = new FileReaderSync(); | ||
reader.readAsText(value); | ||
const result = reader.result; | ||
entries.push([key, result]); | ||
} else { | ||
entries.push([key, value]); | ||
} | ||
} | ||
return JSON.stringify({ | ||
__formdata: entries, | ||
}); | ||
} | ||
|
||
async function mockFormDataFromString(this: MockedRequest): Promise<FormData> { | ||
const { __formdata } = await this.json(); | ||
expect(__formdata).toBeInstanceOf(Array); | ||
|
||
const form = new FormData(); | ||
for (const [key, value] of __formdata) { | ||
form.set(key, value); | ||
} | ||
return form; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: it would be nice to not have to duplicate this mocking here and in https://github.com/cloudflare/workers-sdk/blob/4692d92c531fde29aaefbf7c0e2c375ede3e3b1d/packages/wrangler/src/__tests__/deploy.test.ts#L9207-L9247
// actually do any parsing - that will be handled on the backend | ||
async function* getBatchFromFile(file: FileHandle, batchSize = 3) { | ||
let batch: string[] = []; | ||
for await (const line of file.readLines()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filehandle.readLines()
was added in Node 18—Wrangler supports Node 16, and so this won't be usable, I don't think. https://nodejs.org/api/readline.html is probably the best alternative?
let index: Optional<VectorizeVectorMutation, "ids"> | undefined; | ||
for await (const batch of getBatchFromFile(file, args.batchSize)) { | ||
const formData = new FormData(); | ||
formData.append("vectors", new File([batch.join(`\n`)], "vectors.ndjson")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably include a mime type for the File
?
logger.warn( | ||
`🚧 While Vectorize is in beta, we've limited uploads to 100k vectors` | ||
); | ||
logger.warn( | ||
`🚧 You may run this again with another batch to upload further` | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.warn( | |
`🚧 While Vectorize is in beta, we've limited uploads to 100k vectors` | |
); | |
logger.warn( | |
`🚧 You may run this again with another batch to upload further` | |
); | |
logger.warn( | |
`🚧 While Vectorize is in beta, we've limited uploads to 100k vectors. You may run this again with another batch to upload further` | |
); |
} | ||
|
||
const index = await insertIntoIndex(config, args.name, vectors); | ||
// remove the ids - skip tracking these for bulk uploads since this could be in the 100s of thousands. | ||
if (index) delete index.ids; | ||
logger.log(JSON.stringify(index, null, 2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command should have additional --json
flag for outputting JSON, otherwise, it should output in a human readable format
if (args.vectors) { | ||
// Parse each vector into a Vector type | ||
// Think about potential limits on args.vectors.length? | ||
if (index.count > VECTORIZE_MAX_UPSERT_VECTOR_RECORDS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limiting behaviour should be covered in tests
@@ -30,13 +42,42 @@ export async function handler( | |||
args: StrictYargsOptionsToInterface<typeof options> | |||
) { | |||
const config = readConfig(args.config, args); | |||
const file = await open(args.file); | |||
|
|||
let index: Optional<VectorizeVectorMutation, "ids"> | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the ids
field isn't being used here, could this just be an incrementing integer count
?
return await fetchResult( | ||
`/accounts/${accountId}/vectorize/indexes/${indexName}/insert`, | ||
{ | ||
method: "POST", | ||
body: JSON.stringify(vectors), | ||
headers: { | ||
"Content-Type": "multipart/form-data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be omitted, so the boundary can be generated and included in the content-type. cc @Skye-31 for spotting this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting: we're sure about this? I check the incoming content type in my test in here and omitting this results in
Expected: "multipart/form-data"
Received: "text/plain;charset=UTF-8"
Which is not the correct type for a multipart request, but this may be an artifact of the test server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an artifact of the MSW test setup, I think, which doesn't properly support FormData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR work when manually testing against the API?
3376d2e
to
bbc0e0d
Compare
bbc0e0d
to
eade8a5
Compare
@@ -38,7 +38,9 @@ export function options(yargs: CommonYargsArgv) { | |||
preset: { | |||
type: "string", | |||
choices: [ | |||
"workers-ai/bge-small-en", | |||
"@cf/baai/bge-small-en-v1.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per @elithrar 's wishes
}); | ||
}); | ||
|
||
FormData.prototype.toString = mockFormDataToString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will pollute the FormData prototype for all tests
This needs to be done in a beforeEach, after storing the original value, and then resetting to the original value in an afterEach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this pattern already exists within wrangler. So no need to block this PR on this comment
(MockedRequest.prototype as RestRequestWithFormData).formData = | ||
mockFormDataFromString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above for prototype pollution across all tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this pattern already exists within wrangler. So no need to block this PR on this comment
function mockFormDataToString(this: FormData) { | ||
const entries = []; | ||
for (const [key, value] of this.entries()) { | ||
if (value instanceof Blob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are failing because Blob is not a global prior to Node v18 nodejs/node#42220 (comment)
It needs to be imported from node:buffer
it is not a global prior to node v18
Codecov Report
@@ Coverage Diff @@
## main #4034 +/- ##
==========================================
+ Coverage 74.94% 75.08% +0.13%
==========================================
Files 207 208 +1
Lines 11826 11871 +45
Branches 3079 3085 +6
==========================================
+ Hits 8863 8913 +50
+ Misses 2963 2958 -5
|
Extends the Vectorize wrangler support work to allow uploading vectors from a file source.
Comments on the original PR here: #4029
What this PR solves / how to test:
We want to improve the Vector Store onboarding for our customers by allowing them an easy way to source vectors. This vector files may be quite big (100s of MB - GB). This supports a streaming approach to reading and loading these files in an efficient manner.