-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add standalone partition assignment operation #2556
feat: add standalone partition assignment operation #2556
Conversation
|
||
dataset: LanceDataset | ||
the dataset containing the data | ||
column: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IndicesBuilder
is vector only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, can probably rename in future PR.
row_id: uint64 | ||
partition: uint32 | ||
|
||
Note: There is no advantage to separately computing the partition assignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is also useful for checkpoint purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...perhaps. Though I think a better way to checkpoint the "transform vectors" stage will be to break up the work by fragments.
python/python/lance/indices.py
Outdated
def assign_ivf_partitions( | ||
self, | ||
ivf_model: IvfModel, | ||
dst_dataset_uri: Optional[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dst_dataset_uri
=> output_uri
? we use less abbreviation in python SDKs.
Btw, should this be a dict arg, i.e., output_uri = None
if we expect that user won't set it for most of the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change to output_uri
. I could set it to output_uri = None
(which would use a random tmp path?). I think that would be confusing though since the other methods are going to need paths. E.g. if a user is going down the "piece by piece" standalone route then they want to know and be aware of all the temp files because that is what they are trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I came around. I changed to output_uri
and default to None
.
f446aa8
to
81759a6
Compare
This also fixes up the tqdm progress bar for partition assignment so that it has a definite end. This also makes one small tweak to the indices builder, moving the column argument into the builder constructor, since this argument will be shared by all methods in the class.