-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German strings, attempt 3 #1082
Conversation
fn pack_views( | ||
chunks: &[Array], | ||
dtype: &DType, | ||
validity: Validity, | ||
) -> VortexResult<VarBinViewArray> { | ||
let mut views = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole section was copied over from the last 2 PRs
🇩🇪 |
the transmute was causing issues with deallocation.
values | ||
.into_canonical() | ||
.vortex_expect("VarBin to canonical") | ||
.into_varbinview() | ||
.vortex_expect("VarBinView"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preferably do this in a single-pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to "fast-path" build a VarBinView since we know that the strings in values
are unique (doesn't need to be in this PR)
too ambitious |
Seems this might be just an issue with to_string or parts of the codepath when you need to get pandas type out of arrow type |
Yea, so it works fine when I bind it to a variable, and then print the variable
But not if I let it log to the repl
I'll file an issue with pandas, in the meantime just fixed up the example to assign to variable then log |
Wat |
this does look like tostring being wrong |
Filed: pandas-dev/pandas#60068 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vortex bytes_at
Benchmark suite | Current: f524f8f | Previous: acdd46b | Ratio |
---|---|---|---|
bytes_at/array_data |
711.5127204052014 ns (1.031384300668492 ) |
688.4294123541362 ns (1.9603893620694635 ) |
1.03 |
bytes_at/array_view |
179.63084933526906 ns (0.3510182229461378 ) |
878.9548024570145 ns (2.514577200262977 ) |
0.20 |
This comment was automatically generated by workflow using github-action-benchmark.
Putting back into draft to make a couple of small changes. Namely I think we can just use Arrow casting to canonicalize VarBin -> View, it looks ~10x faster than what I have based on microbenchmarks. |
let array_arrow = self.clone().into_array().into_canonical()?.into_arrow()?; | ||
let array_ref = varbinview_as_arrow(self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one line made a ~20x difference in take performance.
take_strings benchmark was 40µs with old version, mostly due to the try_from
.
This brings it down to 2µs
vortex-array/Cargo.toml
Outdated
|
||
[[bench]] | ||
name = "take_strings" | ||
harness = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline EOF
dtype: &DType, | ||
validity: Validity, | ||
) -> VortexResult<VarBinViewArray> { | ||
let mut views: Vec<u128> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small optimization: chunks.iter().map(|a| a.len()).sum()
is the views capacity
// Create a view to hold the scalar bytes. | ||
// If the scalar cannot be inlined, allocate a single buffer large enough to hold it. | ||
let view: u128 = make_view(scalar_bytes, 0, 0); | ||
let mut buffers = Vec::with_capacity(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conversely, better for this to be Vec::new()
because current version will always allocate even if not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was sort of betting most constant strings will be > 12bytes but actually, most constant strings are probably very short. will change
} | ||
|
||
// Clone our constant view `len` times. | ||
// TODO(aduffy): switch this out for a ConstantArray once we support u128 scalars in Vortex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let offsets = match offsets.ptype() { | ||
PType::I32 | PType::I64 => offsets, | ||
PType::U64 => offsets.reinterpret_cast(PType::I64), | ||
PType::U32 => offsets.reinterpret_cast(PType::I32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should call try_cast
here, which IIRC will reinterpret cast after checking that offsets max is <= I64::MAX. (if not, that's what it should do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just copying existing code from canonical module into varbin module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but also, yes
} | ||
|
||
// TODO(aduffy): do we really need to do this with copying? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, can we just return a VortexResult<&[u8]>
instead, and the caller can copy if it wants owned bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit awkward with the inlined views (b/c you can't return a reference to a u128 that is on the stack)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can return a Buffer that points to the underlying bytes within the inlined view I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the only place this is used currently is ScalarAt
which would already just take the reference and convert it to an owned bytes anyway
/// Binary and String views are a new, better encoding format for nearly all use-cases. For now, | ||
/// because DataFusion does not include pervasive support for compute over StringView, we opt to use | ||
/// the [`VarBinArray`] as the canonical encoding (which corresponds to the Arrow `BinaryViewArray`). | ||
/// Binary and String views, also known as "German strings" are a better encoding format for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🇩🇪
values | ||
.into_canonical() | ||
.vortex_expect("VarBin to canonical") | ||
.into_varbinview() | ||
.vortex_expect("VarBinView"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to "fast-path" build a VarBinView since we know that the strings in values
are unique (doesn't need to be in this PR)
@lwwmanning I'd prefer to do the DictArray thing in a followup, I think that also gets easier with u128 support, filed #1111 as an independent but related issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🇩🇪 🚀 🥳
Historically, we've gated the ability to go from Vortex -> Arrow arrays behind the `Canonical` type, which picks one "blessed" Arrow encoding for each of our DTypes. Since the introduction of VarBinView in #1082, we are in a position where there are now 2 Vortex string encodings that can each be directly converted to Arrow. What's more, FSSTArray internally uses a `VarBin` array to encode the FSST-compressed strings. It delegates in its CompareFn implementation to running a comparison against the values, which are `VarBin` that will use the default `compare` codepath which does `into_canonical()?.into_arrow()?` and then uses the Arrow codec. This is slow now, because VarBin.into_canonical() will iterate over all the strings to build a canonical `VarBinView`. This requires a full decompress which makes the pushdown pointless. This PR augments the existing `IntoCanonicalVTable` allowing encodings to implement their own `into_arrow()` method. The default continues to call `into_canonical().into_arrow()`, but we implement a fast version for VarBin.
Implementing VarBinView as our canonical string type. Concretely, this involves
Canonical
variant fromVarBin
toVarBinView
and updating all of those method names accordinglyimpl IntoCanonical for VarBinArray
that involves doing a single-pass construction of viewsUtf8/Binary
toUtf8View/BinaryView
and propagating that in Datafusion and Python bindingsSome caveats
bytes
heap backing the varbin as the buffer for varbinview. This means that if the array contains > 2GiB of string data, we will fail currently. I can work on the rollover logic, it's just going to make the PR larger. Alternatively we can say that having chunks > 2GiB is ok, but then it fails converting to Arrow and we just say you can't have a RecordBatch of strings > 2GiB in size. But I don't think that's a great API.