-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(python)!: simplify marshalling of Fragment
, DataFile
, Operation
, Transaction
#3240
refactor(python)!: simplify marshalling of Fragment
, DataFile
, Operation
, Transaction
#3240
Conversation
merged = Transaction(**merged) | ||
# This logic is specific to append, which is all that should | ||
# be returned here. | ||
# TODO: generalize this to all other transaction types. | ||
merged.operation["fragments"] = [ | ||
FragmentMetadata.from_metadata(f) for f in merged.operation["fragments"] | ||
] | ||
merged.operation = LanceOperation.Append(**merged.operation) | ||
if merged.blobs_op: | ||
merged.blobs_op["fragments"] = [ | ||
FragmentMetadata.from_metadata(f) for f in merged.blobs_op["fragments"] | ||
] | ||
merged.blobs_op = LanceOperation.Append(**merged.blobs_op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the messy marshaling logic that made me want this refactor. As I extend to other transaction types, this would have become completely unmanageable.
Fragment
, DataFile
, Operation
, Transaction
Fragment
, DataFile
, Operation
, Transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good cleanup, thanks! Is the goal to eventually start moving more things into the PyLance<T>
pattern?
Operation::Append { ref fragments } => { | ||
let fragments = export_vec(py, fragments.as_slice()); | ||
let cls = namespace | ||
.getattr("Append") | ||
.expect("Failed to get Append class"); | ||
cls.call1((fragments,)).unwrap().to_object(py) | ||
} | ||
_ => todo!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually use ToPyObject
for Append
or is this more just an example of how you might do this in case we need to later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset::commit_batch
returns a PyLance<Transaction>
, which will contain an operation. Currently just Append.
python/src/utils.rs
Outdated
pub fn extract_vec<'a, T>(ob: &Bound<'a, PyAny>) -> PyResult<Vec<T>> | ||
where | ||
PyLance<T>: FromPyObject<'a>, | ||
{ | ||
ob.extract::<Vec<PyLance<T>>>() | ||
.map(|v| v.into_iter().map(|t| t.0).collect()) | ||
} | ||
|
||
pub fn export_vec<'a, T>(py: Python<'a>, vec: &'a [T]) -> Vec<PyObject> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document these methods? extract_vec
I guess is pretty obvious since extract
is well-used in pyo3 but export_vec
is maybe a touch confusing.
row_id_meta = json_data.get("row_id_meta") | ||
if row_id_meta is not None: | ||
row_id_meta = RowIdMeta(**row_id_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've fallen behind the times again. What is row_id_meta
? Is this where we put a row id mapping file for stable row id stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, each fragment has it's own row id index stored here. Either an inline index (if it's small), or a reference to a separate file.
python/python/lance/fragment.py
Outdated
fields : List[int] | ||
The field ids of the columns in this file. | ||
column_indices : List[int] | ||
The column indices in the original schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fields : List[int] | |
The field ids of the columns in this file. | |
column_indices : List[int] | |
The column indices in the original schema. | |
fields : List[int] | |
The ids of the fields in this file. | |
column_indices : List[int] | |
The column indices where the fields are stored in the file. Will have the same length | |
as `fields`. |
Minor terminology nit (pic for pedantic reference):
file_major_version: int = 0, | ||
file_minor_version: int = 0, | ||
): | ||
# TODO: only we eliminate the path method, we can remove this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: only we eliminate the path method, we can remove this | |
# TODO: once we eliminate the path method, we can remove this |
Maybe?
What does it mean to eliminate the path method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously path
was a method DataFile.path()
. But now, it's a property: DataFile.path
. I go through a lot of hoops to support both, but eventually I'd like to remove the method support, so it's only accessible as a property. Then all of this code will be much more straightforward.
file_major_version : int | ||
The major version of the data storage format. | ||
file_minor_version : int | ||
The minor version of the data storage format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we store the file version on the dataset these properties are kind of redundant and I guess we can deprecate them at some point. Not sure that affects this PR at all but figured I'd mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I forgot we are deprecating them.
I think it's a reasonable pattern for any class that is a data holder. For objects that have a lot of methods (such as I don't have anything additional in mind to migrate immediately, but if there are others where it would be useful to transition, we should. |
In PR #3240, python code is refactored, fragment is dataclass now. This PR refactors Java code, make the API consistent with python api.
BREAKING CHANGE:
DataFile.deletion_file
is now a property, not a method.For
Fragment
andOperation
, we had a sort of intermediateinner
layer to handle translating between Rust struct and Python objects. This worked fine in isolation, but once you need to convert at the top of a hierarchy it became tedious. This was the case forTransaction
.Transaction
contained anOperation
, which could contain manyFragment
s, which contains manyDataFile
s.These structures are primarily data holders, so they've been made into
dataclasses
. A newtype wrapper structPyLance<T>
is used to provide implementations ofFromPyObject
andToPyObject
. This makes signatures more readable, and makes the wrappers thinner. For example, instead of a special PythonFragmentMetadata
struct, we just have aPyLance<Fragment>
, whereFragment
is from the Rust crate.