Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German strings, attempt 3 #1082

Merged
merged 39 commits into from
Oct 22, 2024
Merged

German strings, attempt 3 #1082

merged 39 commits into from
Oct 22, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Oct 17, 2024

Implementing VarBinView as our canonical string type. Concretely, this involves

  1. Changing Canonical variant from VarBin to VarBinView and updating all of those method names accordingly
  2. impl IntoCanonical for VarBinArray that involves doing a single-pass construction of views
  3. Changing the Arrow type from Utf8/Binary to Utf8View/BinaryView and propagating that in Datafusion and Python bindings
  4. Changing how ChunkedArray canonicalize works to repack views instead of repacking arrays

Some caveats

  • Currently in varbin -> varbinview, we reuse the bytes heap backing the varbin as the buffer for varbinview. This means that if the array contains > 2GiB of string data, we will fail currently. I can work on the rollover logic, it's just going to make the PR larger. Alternatively we can say that having chunks > 2GiB is ok, but then it fails converting to Arrow and we just say you can't have a RecordBatch of strings > 2GiB in size. But I don't think that's a great API.
  • This change has little effect on TPC-H benchmarks. Most queries are the same, some are slightly faster and some are ~10-30% slower. Previously canonicalizing FSST was pretty trivial and didn't require looping over the values, now with the additional view construction it adds overhead to get all bytes and construct the views
image

@a10y a10y changed the title German strings, part 3 German strings, attempt 3 Oct 17, 2024
Comment on lines 196 to 201
fn pack_views(
chunks: &[Array],
dtype: &DType,
validity: Validity,
) -> VortexResult<VarBinViewArray> {
let mut views = Vec::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole section was copied over from the last 2 PRs

@lwwmanning
Copy link
Member

🇩🇪

a10y added 4 commits October 17, 2024 21:34
the transmute was causing issues with deallocation.
Comment on lines +102 to +106
values
.into_canonical()
.vortex_expect("VarBin to canonical")
.into_varbinview()
.vortex_expect("VarBinView"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferably do this in a single-pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to "fast-path" build a VarBinView since we know that the strings in values are unique (doesn't need to be in this PR)

@robert3005
Copy link
Member

Failed example:
    array.to_pandas()
Exception raised:
    Traceback (most recent call last):
      File "/home/runner/.rye/py/[email protected]/lib/python3.11/doctest.py", line 1355, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest default[1]>", line 1, in <module>
        array.to_pandas()
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 1214, in __repr__
        return self.to_string(**repr_params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/util/_decorators.py", line 333, in wrapper
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 1394, in to_string
        return fmt.DataFrameRenderer(formatter).to_string(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 962, in to_string
        string = string_formatter.to_string()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/string.py", line 29, in to_string
        text = self._get_string_representation()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/string.py", line 44, in _get_string_representation
        strcols = self._get_strcols()
                  ^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/string.py", line 35, in _get_strcols
        strcols = self.fmt.get_strcols()
                  ^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 476, in get_strcols
        strcols = self._get_strcols_without_index()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 729, in _get_strcols_without_index
        str_columns = self._get_formatted_column_labels(self.tr_frame)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 809, in _get_formatted_column_labels
        need_leadsp = dict(zip(fmt_columns, map(is_numeric_dtype, dtypes)))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/core/dtypes/common.py", line 1119, in is_numeric_dtype
        return _is_dtype_type(
               ^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/core/dtypes/common.py", line 1468, in _is_dtype_type
        tipo = pandas_dtype(arr_or_dtype).type
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/vortex/vortex/.venv/lib/python3.11/site-packages/pandas/core/dtypes/dtypes.py", line 2169, in type
        raise NotImplementedError(pa_type)
    NotImplementedError: string_view

too ambitious

@robert3005
Copy link
Member

Seems this might be just an issue with to_string or parts of the codepath when you need to get pandas type out of arrow type

@a10y
Copy link
Contributor Author

a10y commented Oct 18, 2024

Yea, so it works fine when I bind it to a variable, and then print the variable

>>> df = names.to_arrow_table().to_pandas()
>>> df
  name
0    a

But not if I let it log to the repl

>>> names.to_arrow_table().to_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 1214, in __repr__
    return self.to_string(**repr_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 1394, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 962, in to_string
    string = string_formatter.to_string()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/string.py", line 29, in to_string
    text = self._get_string_representation()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/string.py", line 44, in _get_string_representation
    strcols = self._get_strcols()
              ^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/string.py", line 35, in _get_strcols
    strcols = self.fmt.get_strcols()
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 476, in get_strcols
    strcols = self._get_strcols_without_index()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 729, in _get_strcols_without_index
    str_columns = self._get_formatted_column_labels(self.tr_frame)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/io/formats/format.py", line 809, in _get_formatted_column_labels
    need_leadsp = dict(zip(fmt_columns, map(is_numeric_dtype, dtypes)))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/core/dtypes/common.py", line 1119, in is_numeric_dtype
    return _is_dtype_type(
           ^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/core/dtypes/common.py", line 1468, in _is_dtype_type
    tipo = pandas_dtype(arr_or_dtype).type
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Code/vortex/.venv/lib/python3.11/site-packages/pandas/core/dtypes/dtypes.py", line 2169, in type
    raise NotImplementedError(pa_type)
NotImplementedError: string_view

I'll file an issue with pandas, in the meantime just fixed up the example to assign to variable then log

@danking
Copy link
Member

danking commented Oct 18, 2024

Wat

@robert3005
Copy link
Member

this does look like tostring being wrong

@a10y
Copy link
Contributor Author

a10y commented Oct 18, 2024

Filed: pandas-dev/pandas#60068

@a10y a10y added the benchmark Run benchmarks on this branch label Oct 18, 2024
@github-actions github-actions bot removed benchmark Run benchmarks on this branch labels Oct 18, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vortex bytes_at

Benchmark suite Current: f524f8f Previous: acdd46b Ratio
bytes_at/array_data 711.5127204052014 ns (1.031384300668492) 688.4294123541362 ns (1.9603893620694635) 1.03
bytes_at/array_view 179.63084933526906 ns (0.3510182229461378) 878.9548024570145 ns (2.514577200262977) 0.20

This comment was automatically generated by workflow using github-action-benchmark.

vortex-scalar/src/arrow.rs Outdated Show resolved Hide resolved
@a10y a10y marked this pull request as draft October 22, 2024 13:59
@a10y
Copy link
Contributor Author

a10y commented Oct 22, 2024

Putting back into draft to make a couple of small changes. Namely I think we can just use Arrow casting to canonicalize VarBin -> View, it looks ~10x faster than what I have based on microbenchmarks.

@a10y
Copy link
Contributor Author

a10y commented Oct 22, 2024

Alright that seems to have closed the perf gap for the scan-heavy queries.

image

let array_arrow = self.clone().into_array().into_canonical()?.into_arrow()?;
let array_ref = varbinview_as_arrow(self);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one line made a ~20x difference in take performance.

take_strings benchmark was 40µs with old version, mostly due to the try_from.

This brings it down to 2µs

@a10y a10y marked this pull request as ready for review October 22, 2024 14:56

[[bench]]
name = "take_strings"
harness = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline EOF

dtype: &DType,
validity: Validity,
) -> VortexResult<VarBinViewArray> {
let mut views: Vec<u128> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small optimization: chunks.iter().map(|a| a.len()).sum() is the views capacity

// Create a view to hold the scalar bytes.
// If the scalar cannot be inlined, allocate a single buffer large enough to hold it.
let view: u128 = make_view(scalar_bytes, 0, 0);
let mut buffers = Vec::with_capacity(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversely, better for this to be Vec::new() because current version will always allocate even if not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was sort of betting most constant strings will be > 12bytes but actually, most constant strings are probably very short. will change

}

// Clone our constant view `len` times.
// TODO(aduffy): switch this out for a ConstantArray once we support u128 scalars in Vortex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let offsets = match offsets.ptype() {
PType::I32 | PType::I64 => offsets,
PType::U64 => offsets.reinterpret_cast(PType::I64),
PType::U32 => offsets.reinterpret_cast(PType::I32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call try_cast here, which IIRC will reinterpret cast after checking that offsets max is <= I64::MAX. (if not, that's what it should do)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just copying existing code from canonical module into varbin module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also, yes

vortex-array/src/array/varbinview/mod.rs Outdated Show resolved Hide resolved
}

// TODO(aduffy): do we really need to do this with copying?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can we just return a VortexResult<&[u8]> instead, and the caller can copy if it wants owned bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit awkward with the inlined views (b/c you can't return a reference to a u128 that is on the stack)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can return a Buffer that points to the underlying bytes within the inlined view I suppose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the only place this is used currently is ScalarAt which would already just take the reference and convert it to an owned bytes anyway

/// Binary and String views are a new, better encoding format for nearly all use-cases. For now,
/// because DataFusion does not include pervasive support for compute over StringView, we opt to use
/// the [`VarBinArray`] as the canonical encoding (which corresponds to the Arrow `BinaryViewArray`).
/// Binary and String views, also known as "German strings" are a better encoding format for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🇩🇪

vortex-array/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +102 to +106
values
.into_canonical()
.vortex_expect("VarBin to canonical")
.into_varbinview()
.vortex_expect("VarBinView"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to "fast-path" build a VarBinView since we know that the strings in values are unique (doesn't need to be in this PR)

@a10y
Copy link
Contributor Author

a10y commented Oct 22, 2024

@lwwmanning I'd prefer to do the DictArray thing in a followup, I think that also gets easier with u128 support, filed #1111 as an independent but related issue

Copy link
Member

@lwwmanning lwwmanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🇩🇪 🚀 🥳

@a10y a10y enabled auto-merge (squash) October 22, 2024 18:54
@a10y a10y merged commit b9335aa into develop Oct 22, 2024
5 checks passed
@a10y a10y deleted the aduffy/german-strings-take3 branch October 22, 2024 19:07
a10y added a commit that referenced this pull request Dec 6, 2024
Historically, we've gated the ability to go from Vortex -> Arrow arrays
behind the `Canonical` type, which picks one "blessed" Arrow encoding
for each of our DTypes.

Since the introduction of VarBinView in #1082, we are in a position
where there are now 2 Vortex string encodings that can each be directly
converted to Arrow.

What's more, FSSTArray internally uses a `VarBin` array to encode the
FSST-compressed strings. It delegates in its CompareFn implementation to
running a comparison against the values, which are `VarBin` that will
use the default `compare` codepath which does
`into_canonical()?.into_arrow()?` and then uses the Arrow codec.

This is slow now, because VarBin.into_canonical() will iterate over all
the strings to build a canonical `VarBinView`. This requires a full
decompress which makes the pushdown pointless.

This PR augments the existing `IntoCanonicalVTable` allowing encodings
to implement their own `into_arrow()` method. The default continues to
call `into_canonical().into_arrow()`, but we implement a fast version
for VarBin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants